Re: Unicode combining characters
От | Bruce Momjian |
---|---|
Тема | Re: Unicode combining characters |
Дата | |
Msg-id | 200110112043.f9BKhBK10486@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Unicode combining characters (Patrice Hédé <phede-ml@islande.org>) |
Список | pgsql-hackers |
Can I ask about the status of this? > Hi all, > > while working on a new project involving PostgreSQL and making some > tests, I have come up with the following output from psql : > > lang | length | length | text | text > ------+--------+--------+-----------+----------- > isl | 7 | 6 | _l_ta | _leit > isl | 7 | 7 | _l_ta | _litum > isl | 7 | 7 | _l_ta | _liti_ > isl | 5 | 4 | ma_ur | mann > isl | 5 | 7 | ma_ur | m_nnum > isl | 5 | 5 | ma_ur | manna > isl | 5 | 4 | _ska | -a_i > > [the misalignment is what I got, it's not a copy-paste error] > > This is pasted from a UTF-8 xterm running psql under a UTF-8 locale, > querying a database created with -E UNICODE (by the way, these are > icelandic words :) ). > > What you see above is misleading, since it's not possible to see that > '_', '_', '_' and '_' are using combining marks, while '_' is not. > > As a reminder, a combining mark in Unicode is that _ is actually > encoded as a + ' (where ' is the acute combining mark). > > Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301], > instead of <c3 a1> [UTF16: 00E1]. > > The "length" fields are what is returned by length(a.text) and > length(b.text). > > So, this shows two problems : > > - length() on the server side doesn't handle correctly Unicode [I have > the same result with char_length()], and returns the number of chars > (as it is however advertised to do), rather the length of the > string. > > - the psql frontend makes the same mistake. > > I am using version 7.1.3 (debian sid), so it may have been corrected > in the meantime (in this case, I apologise, but I have only recently > started again to use PostgreSQL and I haven't followed -hackers long > enough). > > > => I think fixing psql shouldn't be too complicated, as the glibc > should be providing the locale, and return the right values (is this > the case ? and what happens for combined latin + chinese characters > for example ? I'll have to try that later). If it's not fixed already, > do you want me to look at this ? [it will take some time, as I haven't > set up any development environment for postgres yet, and I'm away for > one week from thursday]. > > => regarding the backend, it may be more complex, as the underlaying > system may not provide any UTF-8 locale to use (!= from being UTF-8 > aware : an administrator may have decided that UTF-8 locales are > useless on a server, as only root connections are made, and he wants > only the C locale on the console - I've seen that quite often ;) ). > > > This brings me to another subject : I will need to support the full > Unicode collation algorithm (UCA, as described in TR#10 [1] of the > Unicode consortium), and I will need to be able to sort according to > locales which may not be installed on the backend server (some of > which may not even be recognised by GNU libc, which supports already > more than 140 locales -- artificial languages would be an example). I > will also need to be able to normalise the unicode strings (TR#15 [2]) > so that I don't have some characters in legacy codepoints [as 00E1 > above], and others with combining marks. > > There is today an implementation in perl of the needed functionality, > in Unicode::Collate and Unicode::Normalize (which I haven't tried yet > :( ). But as they are Perl modules, the untrusted version of perl, > plperlu, will be needed, and it's a pity for what I consider a core > functionality in the future (not that plperlu isn't a good thing - I > can't wait for it ! - but that an untrusted pl language is needed to > support normalisation and collation). > > Note also that there are a lot of data associated with these > algorithms, as you could expect. > > I was wondering if some people have already thought about this, or > already done something, or if some of you are interested in this. If > nobody does anything, I'll do something eventually, probably before > Christmas (I don't have much time for this, and I don't need the > functionality right now), but if there is an interest, I could team > with others and develop it faster :) > > Anyway, I'm open to suggestions : > > - implement it in C, in the core, > > - implement it in C, as contributed custom functions, > > - implement it in perl (by reusing Unicode:: work), in a trusted plperl, > > - implement it in perl, calling Unicode:: modules, in an untrusted > plperl. > > and then : > > - provide the data in tables (system and/or user) - which should be > available across databases, > > - load the data from the original text files provided in Unicode (and > other as needed), if the functionality is compiled into the server. > > - I believe the basic unicode information should be standard, and the > locales should be provided as contrib/ files to be plugged in as > needed. > > I can't really accept a solution which would rely on the underlaying > libc, as it may not provide the necessary locales (or maybe, then, > have a way to override the collating tables by user tables - actually, > this would be certainly the best solution if it's in the core, as the > tables will put an extra burden on the distribution and the > installation footprint, especially if the tables are already there, > for glibc, for perl5.6+, for other software dealing with Unicode). > > The main functions I foresee are : > > - provide a normalisation function to all 4 forms, > > - provide a collation_key(text, language) function, as the calculation > of the key may be expensive, some may want to index on the result (I > would :) ), > > - provide a collation algorithm, using the two previous facilities, > which can do primary to tertiary collation (cf TR#10 for a detailed > explanation). > > I haven't looked at PostgreSQL code yet (shame !), so I may be > completely off-track, in which case I'll retract myself and won't > bother you again (on that subject, that is ;) )... > > Comments ? > > > Patrice. > > [1] http://www.unicode.org/unicode/reports/tr10/ > > [2] http://www.unicode.org/unicode/reports/tr15/ > > -- > Patrice H_D_ ------------------------------- patrice _ islande.org ----- > -- Isn't it weird how scientists can imagine all the matter of the > universe exploding out of a dot smaller than the head of a pin, but they > can't come up with a more evocative name for it than "The Big Bang" ? > -- What would _you_ call the creation of the universe ? > -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes > ------------------------------------------ http://www.islande.org/ ----- > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
В списке pgsql-hackers по дате отправления: