Re: Pre-proposal: unicode normalized text

Поиск
Список
Период
Сортировка
От Nico Williams
Тема Re: Pre-proposal: unicode normalized text
Дата
Msg-id ZRsnnGfQ701AA091@ubby21
обсуждение исходный текст
Ответ на Pre-proposal: unicode normalized text  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Pre-proposal: unicode normalized text  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> One of the frustrations with using the "C" locale (or any deterministic
> locale) is that the following returns false:
> 
>   SELECT 'á' = 'á'; -- false
> 
> because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
> respectively, so memcmp() returns non-zero. But it's really the same
> character with just a different representation, and if you normalize
> them they are equal:
> 
>   SELECT normalize('á') = normalize('á'); -- true

I think you misunderstand Unicode normalization and equivalence.  There
is no standard Unicode `normalize()` that would cause the above equality
predicate to be true.  If you normalize to NFD (normal form decomposed)
then a _prefix_ of those two strings will be equal, but that's clearly
not what you're looking for.

PostgreSQL already has Unicode normalization support, though it would be
nice to also have form-insensitive indexing and equality predicates.

There are two ways to write 'á' in Unicode: one is pre-composed (one
codepoint) and the other is decomposed (two codepoints in this specific
case), and it would be nice to be able to preserve input form when
storing strings but then still be able to index and match them
form-insensitively (in the case of 'á' both equivalent representations
should be considered equal, and for UNIQUE indexes they should be
considered the same).

You could also have functions that perform lossy normalization in the
sort of way that soundex does, such as first normalizing to NFD then
dropping all combining codepoints which then could allow 'á' to be eq to
'a'.  But this would not be a Unicode normalization function.

Nico
-- 



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Eager page freeze criteria clarification
Следующее
От: Peter Smith
Дата:
Сообщение: Re: [PGDOCS] change function linkend to refer to a more relevant target