Re: Improve the performance of Unicode Normalization Forms.

Поиск

Список

Период

Сортировка

От	Victor Yegorov
Тема	Re: Improve the performance of Unicode Normalization Forms.
Дата	10 сентября 21:50:12
Msg-id	CAGnEbohehx6sty5LFBkXqYKs7sB1qpy2YV1yu=n9X63ereosjQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Improve the performance of Unicode Normalization Forms. (Alexander Borisov <lex.borisov@gmail.com>)
Ответы	Re: Improve the performance of Unicode Normalization Forms.
Список	pgsql-hackers

Дерево обсуждения

ср, 3 сент. 2025 г. в 09:35, Alexander Borisov <lex.borisov@gmail.com>:

Hi, Jeff, hackers!

As promised, refactoring the C code for Unicode Normalization Forms.

In general terms, here's what has changed:
1. Recursion has been removed; now data is generated using
a Perl script.
2. Memory is no longer allocated for uint32 for the entire size,
but uint8 is allocated for the entire size for the CCC cache, which
boosts performance significantly.
3. The code for the unicode_normalize() function has been completely
rewritten.

I am confident that we have achieved excellent results.

Hey.

I've looked into these patches.

Patches apply, compilation succeedes, make check and make installcheck shows
no errors.

Code quality is good, although I suggest a native english speaker to review
comments and commit messages — a bit difficult to follow.

Description of the Sparse Array approach is done in the newly introduced
GenerateSparseArray.pm module. Perhaps it'd be valuable to add a section into
the src/common/unicode/README, it'll get more visibility.
( Not insisting here. )

For performance testing I've used an approach by Jeff Davis. [1]
I've prepared NFC and NFD files, loaded them into UNLOGGED tables and measured
normalize() calls.

CREATE UNLOGGED TABLE strings_nfd (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';

CREATE UNLOGGED TABLE strings_nfc (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';

SELECT count( normalize( str, NFD ) ) FROM strings_nfd, generate_series( 1, 10 ) x;
SELECT count( normalize( str, NFC ) ) FROM strings_nfc, generate_series( 1, 10 ) x;

And I've got the following numbers:

Master
NFD Time: 2954.630 ms / 295ms
NFC Time: 3929.939 ms / 330ms

Patched
NFD Time: 1658.345 ms / 166ms / +78%
NFC Time: 1862.757 ms / 186ms / +77%

Overall, I find these patches and performance very nice and valuable.
I've added myself as a reviewer and marked this patch as Ready for Committer.

[1] https://postgr.es/m/adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com

Victor Yegorov

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Improve the performance of Unicode Normalization Forms.