Re: speed up unicode decomposition and recomposition

Поиск
Список
Период
Сортировка
От Daniel Verite
Тема Re: speed up unicode decomposition and recomposition
Дата
Msg-id 041d6bff-9b37-430c-80f6-67e1cb68dab2@manitou-mail.org
обсуждение исходный текст
Ответ на Re: speed up unicode decomposition and recomposition  (John Naylor <john.naylor@enterprisedb.com>)
Ответы Re: speed up unicode decomposition and recomposition  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
    John Naylor wrote:

> I'd be curious how it compares to ICU now

I've made another run of the test in [1] with your v2 patches
from this thread against icu_ext built with ICU-67.1.
The results show the times in milliseconds to process
about 10 million short strings:

 operation  | unpatched | patched | icu_ext
------------+-----------+---------+---------
 nfc check  |       7968 |    5989 |    4076
 nfc conv   |     700894 |   15163 |    6808
 nfd check  |      16399 |    9852 |    3849
 nfd conv   |      17391 |   10916 |    6758
 nfkc check |       8259 |    6092 |    4301
 nfkc conv  |     700241 |   15354 |    7034
 nfkd check |      16585 |   10049 |    4038
 nfkd conv  |      17587 |   11109 |    7086

The ICU implementation still wins by a large margin, but
the improvements brought by the patch are gorgeous,
especially for the conversion to NFC/NFKC.
This test ran on a slower machine than what I used for [1], so
that's why all queries take longer.

For the two queries upthread, I get this:

1)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
count
--------
 100000
(1 row)

Time: 371.043 ms

VS ICU:

select count(icu_normalize(t, 'NFC')) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
 count
--------
 100000
(1 row)

Time: 125.809 ms


2)
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
 count
--------
 100000
(1 row)
Time: 428.214 ms


VS ICU:

select count(icu_normalize(t, 'NFC')) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
 count
--------
 100000
(1 row)

Time: 147.713 ms


[1]
https://www.postgresql.org/message-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a%40manitou-mail.org


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Anastasia Lubennikova
Дата:
Сообщение: Re: Commitfest manager 2020-11
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Commitfest manager 2020-11