Re: Better locale-specific-character-class handling for regexps

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Better locale-specific-character-class handling for regexps
Дата	4 сентября 2016 г. 21:17:57
Msg-id	13600.1473023873@sss.pgh.pa.us обсуждение
Ответ на	Better locale-specific-character-class handling for regexps (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

I wrote:
> I got tired of hearing complaints about the issue described in
> this thread:
> https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us
> Here's a proposed fix.  I've not done extensive performance testing,
> but it seems to be as fast or faster than the old code in cases where
> there are not too many "large" characters in the input.  And, more
> to the point, it gets the right answer for such large characters.

I've now done some performance testing on this patch.  I ran the
attached scripts in cassert-off builds, using en_US.utf8 locale on
a RHEL6 box.  Each number is the best of 3 runs; all times in ms.

The first two test cases are meant to exercise regex compile speed; they
provide relatively short input strings, and each call provides a slightly
different pattern so as to defeat the compiled-pattern cache in
adt/regexp.c.

            HEAD        patched

compilespeed.sql    61959.793    32936.773
compilespeed09.sql    6411.957    5903.211

compilespeed.sql exercises compiling '\w' which of course is locale
dependent, while compilespeed09.sql compiles '[0-9]' which is not.
Since the patch sets MAX_SIMPLE_CHR to 0x7FF which is the same as
regc_pg_locale.c's old cutoff for how many characters to examine,
we must be doing about the same number of compile-time iswalnum()
probes in both old and new code.  The compile speedup must come entirely
from the fact that the new code uses a simple array for the low-char-codes
colormap, whereas the old code has a complicated multilevel colormap that
is more expensive to update.  Then the bigger win for compilespeed.sql
must be due to the fact that \w will require changing the colors of more
characters in the colormap.  I was not expecting to see such a big win,
but I'll take it ;-)

The remaining cases hold the pattern fixed (so it's not recompiled
each time) and are meant to measure the per-character scan speed
while executing a compiled regexp.

            HEAD        patched        ratio of per-char times

runspeedlo.sql        12951.588    12053.533    0.91
runspeedhi.sql        10410.140 *    36678.876    2.72 (estimated)
runspeednotlo.sql    12999.629    12130.439    0.91
runspeednothi.sql    15678.125    20053.861    1.36
dummy.sql        3444.696    3437.984    (to measure overhead)

* gives wrong answer

In the runspeedhi.sql case, HEAD thinks the first character doesn't match
the pattern, so it's not scanning the whole input, which makes that
measurement not useful.  We can assume though that the scan rate would
have been about the same as runspeednothi.sql, and the estimated per-char
ratio shown in the table is computed on that basis.

So the patched code's scan speed is about 9% better than HEAD for data
characters below MAX_SIMPLE_CHR, presumably because of the simpler color
map structure.  But it's noticeably worse for data chars above that, and
much worse if the pattern forces runtime iswalnum() tests.  That's what
I was expecting.

It's worth noticing that comparing runspeednotlo.sql and runspeednothi.sql
shows that HEAD is 28% worse per-character on the latter.  Those cases are
exactly the same so far as HEAD's regex engine is concerned, so I think
that that differential must be blamable entirely on the fact that we're
converting 2-byte UTF8 sequences into pg_wchars in the first case and
3-byte UTF8 sequences in the second case.  The patched code is also paying
that cost of course, so you should take it into account when comparing
numbers in different rows of this table.  And it's also an indication
that really, this code is pretty frickin fast --- the total time per
character is only circa 4x the extra logic to pick up and shift in
one more byte!

Overall I'm pretty happy with these results.  I think that most cases
will be as fast or faster than before, assuming that most data characters
are below MAX_SIMPLE_CHR for most use-cases.  In the worst cases, the
per-character cost can be several times what it was before, but those
are exactly the same cases where *we gave the wrong answer* before, so
it's hard to complain too much.

            regards, tom lane

\timing on
do $$
begin
  for i in 1..100000 loop
    perform 'aqaqaqaqaqaqaqaqaqaqaqaqaq' ~ ('\w\w\w\w\w\w\w\w\w\w\w\w\w\w' || i);
  end loop;
end$$;
\timing on
do $$
begin
  for i in 1..100000 loop
    perform 'a' ~ ('[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' || i);
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2037), 1000);
  pat text := '^\w*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2310), 1000);
  pat text := '^\w*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2037), 1000);
  pat text := '^[^x]*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2310), 1000);
  pat text := '^[^x]*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2310), 1000);
  pat text := '^\w*$';
begin
  for i in 1..1000000 loop
    perform txt;
  end loop;
end$$;

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Better locale-specific-character-class handling for regexps