Re: Unicode normalization SQL functions

Поиск
Список
Период
Сортировка
От Peter Eisentraut
Тема Re: Unicode normalization SQL functions
Дата
Msg-id 2309023a-6f69-f049-70e5-3c70b4fb9672@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Unicode normalization SQL functions  ("Daniel Verite" <daniel@manitou-mail.org>)
Ответы Re: Unicode normalization SQL functions
Список pgsql-hackers
On 2020-01-06 17:00, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>> Also, there is a way to optimize the "is normalized" test for common
>> cases, described in UTR #15.  For that we'll need an additional data
>> file from Unicode.  In order to simplify that, I would like my patch
>> "Add support for automatically updating Unicode derived files"
>> integrated first.
> 
> Would that explain that the NFC/NFKC normalization and "is normalized"
> check seem abnormally slow with the current patch, or should
> it be regarded independently of the other patch?

That's unrelated.

> For instance, testing 10000 short ASCII strings:
> 
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfc normalized ;
>   count
> -------
>   10000
> (1 row)
> 
> Time: 2573,859 ms (00:02,574)
> 
> By comparison, the NFD/NFKD case is faster by two orders of magnitude:
> 
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfd normalized ;
>   count
> -------
>   10000
> (1 row)
> 
> Time: 29,962 ms
> 
> Although NFC/NFKC has a recomposition step that NFD/NFKD
> doesn't have, such a difference is surprising.

It's very likely that this is because the recomposition calls 
recompose_code() which does a sequential scan of UnicodeDecompMain for 
each character.  To optimize that, we should probably build a bespoke 
reverse mapping table that can be accessed more efficiently.

> I've tried an alternative implementation based on ICU's
> unorm2_isNormalized() /unorm2_normalize() functions (which I'm
> currently adding to the icu_ext extension to be exposed in SQL).
> With these, the 4 normal forms are in the 20ms ballpark with the above
> test case, without a clear difference between composed and decomposed
> forms.

That's good feedback.

> Independently of the performance, I've compared the results
> of the ICU implementation vs this patch on large series of strings
> with all normal forms and could not find any difference.

And that too.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: Add support for automatically updating Unicode derived files
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: pgbench - use pg logging capabilities