Re: speed up unicode decomposition and recomposition

Поиск

Список

Период

Сортировка

От	Michael Paquier
Тема	Re: speed up unicode decomposition and recomposition
Дата	16 октября 2020 г. 06:32:08
Msg-id	20201016033208.GC1581@paquier.xyz обсуждение исходный текст
Ответ на	Re: speed up unicode decomposition and recomposition (John Naylor <john.naylor@enterprisedb.com>)
Ответы	Re: speed up unicode decomposition and recomposition (John Naylor <john.naylor@enterprisedb.com>) Re: speed up unicode decomposition and recomposition (John Naylor <john.naylor@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

On Thu, Oct 15, 2020 at 01:59:38PM -0400, John Naylor wrote:
> I think I've seen a trie recommended somewhere, maybe the official website.
> That said, I was able to get the hash working for recomposition (split into
> a separate patch, and both of them now leave frontend alone), and I'm
> pleased to say it's 50-75x faster than linear search in simple tests. I'd
> be curious how it compares to ICU now. Perhaps Daniel Verite would be
> interested in testing again? (CC'd)

Yeah, that would be interesting to compare.  Now the gains proposed by
this patch are already a good step forward, so I don't think that it
should be a blocker for a solution we have at hand as the numbers
speak by themselves here.  So if something better gets proposed, we
could always change the decomposition and recomposition logic as
needed.

> select count(normalize(t, NFC)) from (
> select md5(i::text) as t from
> generate_series(1,100000) as i
> ) s;
>
> master     patch
> 18800ms    257ms

My environment was showing HEAD as being a bit faster with 15s, while
the patch gets "only" down to 290~300ms (compiled with -O2, as I guess
you did).  Nice.

+   # Then the second
+   return -1 if $a2 < $b2;
+   return 1 if $a2 > $b2;
Should say "second code point" here?

+       hashkey = pg_hton64(((uint64) start << 32) | (uint64) code);
+       h = recompinfo.hash(&hashkey);
This choice should be documented, and most likely we should have
comments on the perl and C sides to keep track of the relationship
between the two.

The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.
Structurally, wouldn't it be better to move this part into its own,
backend-only, header?  It could be possible to paint the difference
with some ifdef FRONTEND of course, or just keep things as they are
because this can be useful for some out-of-core frontend tool?  But if
we keep that as a separate header then any C part can decide to
include it or not, so frontend tools could also make this choice.
Note that we don't include unicode_normprops_table.h for frontends in
unicode_norm.c, but that's the case of unicode_norm_table.h.
--
Michael

Вложения

signature.asc

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 16 октября 2020 г., 06:27:02
Сообщение: Re: gs_group_1 crashing on 13beta2/s390x

Следующее

От: "Hou, Zhijie"
Дата: 16 октября 2020 г., 06:42:34
Сообщение: RE: Use list_delete_xxxcell O(1) instead of list_delete_ptr O(N) in some places

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: speed up unicode decomposition and recomposition

Вложения

Предыдущее

Следующее