Re: Unicode normalization SQL functions

Поиск

Список

Период

Сортировка

От	Daniel Verite
Тема	Re: Unicode normalization SQL functions
Дата	28 января 2020 г. 12:48:45
Msg-id	623fa07e-348f-4273-afa4-7110ad43ca57@manitou-mail.org обсуждение исходный текст
Ответ на	Re: Unicode normalization SQL functions (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Ответы	Re: Unicode normalization SQL functions
Список	pgsql-hackers

Дерево обсуждения

    Peter Eisentraut wrote:

> Here is an updated patch set that now also implements the "quick check"
> algorithm from UTR #15 for making IS NORMALIZED very fast in many cases,
> which I had mentioned earlier in the thread.

I found a bug in unicode_is_normalized_quickcheck() which is
triggered when the last codepoint of the string is beyond
U+10000. On encountering it, it does:
+        if (is_supplementary_codepoint(ch))
+            p++;
When ch is the last codepoint, it makes p point to
the ending zero, but the subsequent p++ done by
the for loop makes it miss the exit and go into over-reading.

But anyway, what's the reason for skipping the codepoint
following a codepoint outside of the BMP?
I've figured that it comes from porting the Java code in UAX#15:

public int quickCheck(String source) {
    short lastCanonicalClass = 0;
    int result = YES;
    for (int i = 0; i < source.length(); ++i) {
    int ch = source.codepointAt(i);
    if (Character.isSupplementaryCodePoint(ch)) ++i;
    short canonicalClass = getCanonicalClass(ch);
    if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
        return NO;          }
    int check = isAllowed(ch);
    if (check == NO) return NO;
    if (check == MAYBE) result = MAYBE;
    lastCanonicalClass = canonicalClass;
    }
    return result;
}

source.length() is the length in UTF-16 code units, in which a surrogate
pair counts for 2. This would be why it does
   if (Character.isSupplementaryCodePoint(ch)) ++i;
it's meant to skip the 2nd UTF-16 code of the pair.
As this does not apply to the 32-bit pg_wchar, I think the two lines above
in the C implementation should just be removed.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Amit Kapila
Дата: 28 января 2020 г., 12:47:50
Сообщение: Re: [HACKERS] Block level parallel vacuum

Следующее

От: Thomas Munro
Дата: 28 января 2020 г., 12:56:26
Сообщение: Re: The flinfo->fn_extra question, from me this time.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Unicode normalization SQL functions

Предыдущее

Следующее