speed up unicode decomposition and recomposition

Поиск
Список
Период
Сортировка
От John Naylor
Тема speed up unicode decomposition and recomposition
Дата
Msg-id CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
обсуждение исходный текст
Ответы Re: speed up unicode decomposition and recomposition  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Having committed the optimization for unicode normalization quick check, Michael Paquier suggested I might do the same for decomposition as well. I wrote:

> I'll
> do some performance testing soon. Note that a 25kB increase in size could
> be present in frontend binaries as well in this case. While looking at
> decomposition, I noticed that recomposition does a linear search through
> all 6600+ entries, although it seems only about 800 are valid for that.
> That could be optimized as well now, since with hashing we have more
> flexibility in the ordering and can put the recomp-valid entries in front.
> I'm not yet sure if it's worth the additional complexity. I'll take a look
> and start a new thread.

The attached patch uses a perfect hash for codepoint decomposition, and for recomposing reduces the linear search from 6604 entries to 942.

The performance is very nice, and if I'd known better I would have done this first, since the decomp array is as big as the two quick check arrays put together:

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

master   patchÏ
887ms    231ms

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;

master   patch
1110ms   208ms


Normalize, decomp+recomp (note: 100x less data)

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

master   patch
194ms    50.6ms

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,1000) as i
) s;

master   patch
137ms    39.4ms


Quick check is another 2x faster on top of previous gains, since it tests canonical class via the decomposition array:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;

master   patch
296ms    131ms


Some other considerations:
- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since the decomp array was reordered to optimize linear search, it can no longer be used for binary search. It's possible to build two arrays, one for frontend and one for backend, but that's additional complexity. We could also force frontend to do a linear search all the time, but that seems foolish. I haven't checked if it's possible to exclude the hash from backend's libpq.
- I could split out the two approaches into separate patches, but it'd be rather messy.

I'll add a CF entry for this.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company 
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: Global snapshots
Следующее
От: Tom Lane
Дата:
Сообщение: Re: speed up unicode decomposition and recomposition