Re: Unicode upper() bug still present
| От | Peter Eisentraut | 
|---|---|
| Тема | Re: Unicode upper() bug still present | 
| Дата | |
| Msg-id | Pine.LNX.4.44.0310202235580.29086-100000@peter.localdomain обсуждение исходный текст | 
| Ответ на | Re: Unicode upper() bug still present (Tom Lane <tgl@sss.pgh.pa.us>) | 
| Ответы | Re: Unicode upper() bug still present Re: Unicode upper() bug still present | 
| Список | pgsql-hackers | 
Tom Lane writes: > I'm not sure that "supporting our own locale subsystem" really qualifies > as "sustainable" ... can you give an estimate of how big the code + > supporting data is likely to be? It's not much worse than supporting our own character conversion subsystem (which, btw., is something we could more likely do without, because the standard system facilities tend to be quite adequate), and certainly much less worse than maintaining our own set of translated strings. For the "ctype" category, you can generate the code straight out of the Unicode tables, with a handfull of hardcoded exception (like the Turkish i). For the "collate" category we need about 40 kB of language-specific data files plus a big master data file that is maintained by the Unicode consortium. (Those 40 kB correspond to the 22 files I currently have, which, together with the big default file, cover about 70 languages.) The other locale categories aren't of interest for string processing. The code isn't large, but of course someone needs to write it. The algorithms are standardized (Unicode collation algorithm) and have several existing implementations. So this isn't something that we would need to maintain in a vacuum. (Note that I say Unicode a lot here because those people do a lot of research and standardization in this area, which is available for free, but this does not constrain the result to work only with the Unicode character set.) > I agree that depending on the system-provided locale behavior has its > downsides, but it has its upsides too; compatibility with the behavior > of everything else on the machine being one big one. So the idea of > being able to use glibc where available shouldn't be rejected out of > hand, I think. I like to think that in the end we can do much better than the POSIX framework can do. For instance, the character classification can have more useful categories, the case conversion can be context-dependent (which is a requirement in some languages), and users could more directly add their own collations or parametrize existing ones (because no one ever seems to agree on the details). -- Peter Eisentraut peter_e@gmx.net
В списке pgsql-hackers по дате отправления: