Re: Unicode combining characters

Поиск
Список
Период
Сортировка
От Patrice Hédé
Тема Re: Unicode combining characters
Дата
Msg-id 20011011232336.N14587@idf.net
обсуждение исходный текст
Ответ на Unicode combining characters  (Patrice Hédé <phede-ml@islande.org>)
Ответы Re: Unicode combining characters  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Re: Unicode combining characters  (Bruce Momjian <pgman@candle.pha.pa.us>)
Список pgsql-hackers
* Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]:
> 
> Can I ask about the status of this?

I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.

There is still the issue about unicode characters which have code
points above U00FFFF, which probably should be rejected on the server
side. I have yet to update my patch for that. I'll probably do that
tomorrow, as I don't have more time tonight, but I think this will be
trivial, so maybe Tatsuo can do it, if he has some time before that :)

If there are other issues, I'd like to know :)

Regarding the implementation of Unicode functionality (normalisation,
collation, Unicode-aware regexes, uc/lc/tc (title-case) functions,...)
on the server side, it's definitely something for 7.3 (though it might
be available sooner). It will probably be just a contributed extension
first. I'm currently making an alpha version of the project I'm
working on in order to have sufficient "real-life" Unicode data to
work with, and make sure the design choices make sense :)

Patrice.

BTW, I tried to find web-accessible archives of pgsql-patches, are
there some, or should each and every discussion be followed-up on
pgsql-hackers (even though the description for pgsql-patches includes
discussions on patches) ?

> > Hi all,
> > 
> > while working on a new project involving PostgreSQL and making some
> > tests, I have come up with the following output from psql :
> > 
> >  lang | length | length |   text    |   text
> > ------+--------+--------+-----------+-----------
> >  isl  |      7 |      6 | _l_ta | _leit
> >  isl  |      7 |      7 | _l_ta | _litum
> >  isl  |      7 |      7 | _l_ta | _liti_
> >  isl  |      5 |      4 | ma_ur    | mann
> >  isl  |      5 |      7 | ma_ur    | m_nnum
> >  isl  |      5 |      5 | ma_ur    | manna
> >  isl  |      5 |      4 | _ska    | -a_i
> > 
> > [the misalignment is what I got, it's not a copy-paste error]
> > 
> > This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
> > querying a database created with -E UNICODE (by the way, these are
> > icelandic words :) ).
> > 
> > What you see above is misleading, since it's not possible to see that
> > '_', '_', '_' and '_' are using combining marks, while '_' is not.
> > 
> > As a reminder, a combining mark in Unicode is that _ is actually
> > encoded as a + ' (where ' is the acute combining mark).
> > 
> > Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
> > instead of <c3 a1> [UTF16: 00E1].
> > 
> > The "length" fields are what is returned by length(a.text) and
> > length(b.text).
> > 
> > So, this shows two problems :
> > 
> > - length() on the server side doesn't handle correctly Unicode [I have
> >   the same result with char_length()], and returns the number of chars
> >   (as it is however advertised to do), rather the length of the
> >   string.
> > 
> > - the psql frontend makes the same mistake.
> > 
> > I am using version 7.1.3 (debian sid), so it may have been corrected
> > in the meantime (in this case, I apologise, but I have only recently
> > started again to use PostgreSQL and I haven't followed -hackers long
> > enough).
> > 
> > 
> > => I think fixing psql shouldn't be too complicated, as the glibc
> > should be providing the locale, and return the right values (is this
> > the case ? and what happens for combined latin + chinese characters
> > for example ? I'll have to try that later). If it's not fixed already,
> > do you want me to look at this ? [it will take some time, as I haven't
> > set up any development environment for postgres yet, and I'm away for
> > one week from thursday].
> > 
> > => regarding the backend, it may be more complex, as the underlaying
> > system may not provide any UTF-8 locale to use (!= from being UTF-8
> > aware : an administrator may have decided that UTF-8 locales are
> > useless on a server, as only root connections are made, and he wants
> > only the C locale on the console - I've seen that quite often ;) ).
> > 
> > 
> > This brings me to another subject : I will need to support the full
> > Unicode collation algorithm (UCA, as described in TR#10 [1] of the
> > Unicode consortium), and I will need to be able to sort according to
> > locales which may not be installed on the backend server (some of
> > which may not even be recognised by GNU libc, which supports already
> > more than 140 locales -- artificial languages would be an example). I
> > will also need to be able to normalise the unicode strings (TR#15 [2])
> > so that I don't have some characters in legacy codepoints [as 00E1
> > above], and others with combining marks.
> > 
> > There is today an implementation in perl of the needed functionality,
> > in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
> > :( ). But as they are Perl modules, the untrusted version of perl,
> > plperlu, will be needed, and it's a pity for what I consider a core
> > functionality in the future (not that plperlu isn't a good thing - I
> > can't wait for it ! - but that an untrusted pl language is needed to
> > support normalisation and collation).
> > 
> > Note also that there are a lot of data associated with these
> > algorithms, as you could expect.
> > 
> > I was wondering if some people have already thought about this, or
> > already done something, or if some of you are interested in this. If
> > nobody does anything, I'll do something eventually, probably before
> > Christmas (I don't have much time for this, and I don't need the
> > functionality right now), but if there is an interest, I could team
> > with others and develop it faster :)
> > 
> > Anyway, I'm open to suggestions :
> > 
> > - implement it in C, in the core,
> > 
> > - implement it in C, as contributed custom functions,
> > 
> > - implement it in perl (by reusing Unicode:: work), in a trusted plperl,
> > 
> > - implement it in perl, calling Unicode:: modules, in an untrusted
> >   plperl.
> > 
> > and then :
> > 
> > - provide the data in tables (system and/or user) - which should be
> >   available across databases,
> > 
> > - load the data from the original text files provided in Unicode (and
> >   other as needed), if the functionality is compiled into the server.
> > 
> > - I believe the basic unicode information should be standard, and the
> >   locales should be provided as contrib/ files to be plugged in as
> >   needed.
> > 
> > I can't really accept a solution which would rely on the underlaying
> > libc, as it may not provide the necessary locales (or maybe, then,
> > have a way to override the collating tables by user tables - actually,
> > this would be certainly the best solution if it's in the core, as the
> > tables will put an extra burden on the distribution and the
> > installation footprint, especially if the tables are already there,
> > for glibc, for perl5.6+, for other software dealing with Unicode).
> > 
> > The main functions I foresee are :
> > 
> > - provide a normalisation function to all 4 forms,
> > 
> > - provide a collation_key(text, language) function, as the calculation
> >   of the key may be expensive, some may want to index on the result (I
> >   would :) ),
> > 
> > - provide a collation algorithm, using the two previous facilities,
> >   which can do primary to tertiary collation (cf TR#10 for a detailed
> >   explanation).
> > 
> > I haven't looked at PostgreSQL code yet (shame !), so I may be
> > completely off-track, in which case I'll retract myself and won't
> > bother you again (on that subject, that is ;) )...
> > 
> > Comments ?
> > 
> > 
> > Patrice.
> > 
> > [1] http://www.unicode.org/unicode/reports/tr10/
> > 
> > [2] http://www.unicode.org/unicode/reports/tr15/

-- 
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Glitch in handling of postmaster -o options
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Deadlock? idle in transaction