Re: Unicode normalization SQL functions

Поиск

Список

Период

Сортировка

От	John Naylor
Тема	Re: Unicode normalization SQL functions
Дата	3 апреля 2020 г. 04:01:47
Msg-id	CACPNZCsptwBXYGqxT8RNsYVAc0dncJkCvtchUDg=6Jbrbuo8Vw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Unicode normalization SQL functions (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список	pgsql-hackers

Дерево обсуждения

On Thu, Apr 2, 2020 at 3:51 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> On 2020-03-26 18:41, John Naylor wrote:
> > We don't have a trie implementation in Postgres, but we do have a
> > perfect hash implementation. Doing that would bring the tables back to
> > 64 bits per entry, but would likely be noticeably faster than binary
> > search. Since v4 has left out the biggest tables entirely, I think
> > this might be worth a look for the smaller tables remaining.
> >
> > In the attached v5, when building the hash tables, we sort the code
> > points by NO/MAYBE, and store the index of the beginning of the NO
> > block:
>
> This is a valuable idea, but I fear it's a bit late now in this cycle.
> I have questions about some details.  For example, you mention that you
> had to fiddle with the hash seed.  How does that affect other users of
> PerfectHash?

They would still try the same multipliers they use now, so no effect on them.

> What happens when we update Unicode data and the hash
> doesn't work anymore?

The script would choose different multipliers and/or seeds
automatically. Only if you're unlucky would you have to fiddle with
the hash parameters again. The last-resort multipliers in the v2 patch
in the other thread [1] seem very effective and easily build both the
quick check D tables, which I tried for amusement's sake.

That said, we could reduce the chances of that happening this way:
After trying all the shift-and-add multipliers, we could add another
loop to try a bunch of numbers in a range. We'd need a quick check to
weed out multiples of small primes so the number has a decent chance
of being prime. To save time, just try a few seeds and move on to the
next number. Maybe emit a warning that it exhausted the shift-and-add
multipliers in case the developer wanted to intervene.

If I resubmit this, I would split the build up into two steps: have
the current manual script build the quick check array for later commit
into the tree, and build the hash function separately from that as a
Makefile distprep target. There's no reason to have the hash functions
checked in as I did in v5, like we don't check in the keyword hash
functions.

I would also consider splitting it into two patches:

1. Keep binary search but with a more abstract array representation
(think PG_RMGR). This patch would be large but mostly mechanical.
2. Switch to hash lookup. A smaller one for ease of review.

[1] https://www.postgresql.org/message-id/CACPNZCvMMj88Bsnk1k%3DRffW6gBw%2BFH7wcwCBfcKLDM%3DUEG2UWg%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Unicode normalization SQL functions