speed up unicode decomposition and recomposition

Поиск

Список

Период

Сортировка

От	John Naylor
Тема	speed up unicode decomposition and recomposition
Дата	14 октября 2020 г. 19:58:21
Msg-id	CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com обсуждение исходный текст
Ответы	Re: speed up unicode decomposition and recomposition (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

Having committed the optimization for unicode normalization quick check, Michael Paquier suggested I might do the same for decomposition as well. I wrote:

> I'll
> do some performance testing soon. Note that a 25kB increase in size could
> be present in frontend binaries as well in this case. While looking at
> decomposition, I noticed that recomposition does a linear search through
> all 6600+ entries, although it seems only about 800 are valid for that.
> That could be optimized as well now, since with hashing we have more
> flexibility in the ordering and can put the recomp-valid entries in front.
> I'm not yet sure if it's worth the additional complexity. I'll take a look
> and start a new thread.

The attached patch uses a perfect hash for codepoint decomposition, and for recomposing reduces the linear search from 6604 entries to 942.

The performance is very nice, and if I'd known better I would have done this first, since the decomp array is as big as the two quick check arrays put together:

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

master patchÏ
887ms 231ms

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;

master patch
1110ms 208ms

Normalize, decomp+recomp (note: 100x less data)

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

master patch
194ms 50.6ms

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,1000) as i
) s;

master patch
137ms 39.4ms

Quick check is another 2x faster on top of previous gains, since it tests canonical class via the decomposition array:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;

master patch
296ms 131ms

Some other considerations:

- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since the decomp array was reordered to optimize linear search, it can no longer be used for binary search. It's possible to build two arrays, one for frontend and one for backend, but that's additional complexity. We could also force frontend to do a linear search all the time, but that seems foolish. I haven't checked if it's possible to exclude the hash from backend's libpq.

- I could split out the two approaches into separate patches, but it'd be rather messy.

I'll add a CF entry for this.

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company