Re: unicode match normal forms
| От | Daniel Verite |
|---|---|
| Тема | Re: unicode match normal forms |
| Дата | |
| Msg-id | 48e7eaab-9403-4d65-8581-cd1e55231d28@manitou-mail.org обсуждение исходный текст |
| Ответ на | unicode match normal forms (hamann.w@t-online.de) |
| Список | pgsql-general |
Hamann W wrote:
> in unicode letter ä exists in two versions - linux and windows use a
> composite whereas macos prefers
> the decomposed form. Is there any way to make a semi-exact match that
> accepts both variants?
Aside from normalizing the strings into the same normal form
before comparing, non-deterministic ICU collations will recognize them as
identical (they're "canonically equivalent" in Unicode terms)
For instance,
CREATE COLLATION nd (
provider = 'icu',
locale='',
deterministic = false
);
SELECT
nfc_form,
nfd_form,
nfc_form = nfd_form COLLATE nd AS equal1,
nfc_form = nfd_form COLLATE "C" AS equal2 -- or any deterministic collation
FROM
(VALUES
(E'j\u00E4hrlich',
E'j\u0061\u0308hrlich'))
AS s(nfc_form, nfd_form);
nfc_form | nfd_form | equal1 | equal2
----------+----------+--------+--------
jährlich | jährlich | t | f
(1 row)
Normalizing is available as a built-in function since Postgres 13 and
non-deterministic collations appeared in Postgres 12.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite
В списке pgsql-general по дате отправления: