> I'll> do some performance testing soon. Note that a 25kB increase in size could> be present in frontend binaries as well in this case. While looking at> decomposition, I noticed that recomposition does a linear search through> all 6600+ entries, although it seems only about 800 are valid for that.> That could be optimized as well now, since with hashing we have more> flexibility in the ordering and can put the recomp-valid entries in front.> I'm not yet sure if it's worth the additional complexity. I'll take a look> and start a new thread.
The attached patch uses a perfect hash for codepoint decomposition, and for recomposing reduces the linear search from 6604 entries to 942.
The performance is very nice, and if I'd known better I would have done this first, since the decomp array is as big as the two quick check arrays put together:
Normalize, decomp only
select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
master patchÏ
887ms 231ms
select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;
master patch
1110ms 208ms
Normalize, decomp+recomp (note: 100x less data)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;
master patch
194ms 50.6ms
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,1000) as i
) s;
master patch
137ms 39.4ms
Quick check is another 2x faster on top of previous gains, since it tests canonical class via the decomposition array:
-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;
master patch
296ms 131ms
Some other considerations:
- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since the decomp array was reordered to optimize linear search, it can no longer be used for binary search. It's possible to build two arrays, one for frontend and one for backend, but that's additional complexity. We could also force frontend to do a linear search all the time, but that seems foolish. I haven't checked if it's possible to exclude the hash from backend's libpq.
- I could split out the two approaches into separate patches, but it'd be rather messy.
I'll add a CF entry for this.
--