Re: Unicode upper() bug still present

Поиск
Список
Период
Сортировка
От Peter Eisentraut
Тема Re: Unicode upper() bug still present
Дата
Msg-id Pine.LNX.4.44.0310202235580.29086-100000@peter.localdomain
обсуждение исходный текст
Ответ на Re: Unicode upper() bug still present  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Unicode upper() bug still present  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Unicode upper() bug still present  (Karel Zak <zakkr@zf.jcu.cz>)
Список pgsql-hackers
Tom Lane writes:

> I'm not sure that "supporting our own locale subsystem" really qualifies
> as "sustainable" ... can you give an estimate of how big the code +
> supporting data is likely to be?

It's not much worse than supporting our own character conversion subsystem
(which, btw., is something we could more likely do without, because the
standard system facilities tend to be quite adequate), and certainly much
less worse than maintaining our own set of translated strings.

For the "ctype" category, you can generate the code straight out of the
Unicode tables, with a handfull of hardcoded exception (like the Turkish
i).  For the "collate" category we need about 40 kB of language-specific
data files plus a big master data file that is maintained by the Unicode
consortium.  (Those 40 kB correspond to the 22 files I currently have,
which, together with the big default file, cover about 70 languages.)
The other locale categories aren't of interest for string processing.
The code isn't large, but of course someone needs to write it.  The
algorithms are standardized (Unicode collation algorithm) and have several
existing implementations.  So this isn't something that we would need to
maintain in a vacuum.

(Note that I say Unicode a lot here because those people do a lot of
research and standardization in this area, which is available for free,
but this does not constrain the result to work only with the Unicode
character set.)

> I agree that depending on the system-provided locale behavior has its
> downsides, but it has its upsides too; compatibility with the behavior
> of everything else on the machine being one big one.  So the idea of
> being able to use glibc where available shouldn't be rejected out of
> hand, I think.

I like to think that in the end we can do much better than the POSIX
framework can do.  For instance, the character classification can have
more useful categories, the case conversion can be context-dependent
(which is a requirement in some languages), and users could more directly
add their own collations or parametrize existing ones (because no one ever
seems to agree on the details).

-- 
Peter Eisentraut   peter_e@gmx.net



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Christopher Browne
Дата:
Сообщение: Re: Vacuum thoughts
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Unicode upper() bug still present