Re: Some regular-expression performance hacking

Поиск
Список
Период
Сортировка
От Joel Jacobson
Тема Re: Some regular-expression performance hacking
Дата
Msg-id 913516b8-c3d7-4dea-bf60-d735cf4673f1@www.fastmail.com
обсуждение исходный текст
Ответ на Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:
>0001-invent-rainbow-arcs-2.patch
>0002-recognize-matchall-NFAs-2.patch

I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.

Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:

This 100577 bytes big regex (pattern_id = 207811)...

\.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mil\.ac| ... |wmflabs\.org|yolasite\.com|za\.net|za\.org)$

...previously raised...

    error invalid regular expression: regular expression is too complex

...but now goes through:

is_match <NULL> => t
captured <NULL> => {de}
error invalid regular expression: regular expression is too complex => <NULL>

Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.

The test that found the deviation tests each (pattern, text string) individually,
to catch errors. But I've also made a separate query to just test regexes
known to not cause errors, to allow testing all regexes in one big query,
which fully utilizes the CPU cores and also runs quicker.

Below is the result of the performance test query:

\timing

SELECT
  tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern),
  tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, patterns.pattern),
  COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
GROUP BY 1,2
ORDER BY 1,2;

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:

?column? | ?column? |  count
----------+----------+---------
t        | t        | 1448212
(1 row)

Time: 592196.145 ms (09:52.196)

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:

?column? | ?column? |  count
----------+----------+---------
t        | t        | 1448212
(1 row)

Time: 461739.364 ms (07:41.739)

That's an impressive 22% speed-up!

/Joel

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: How to get Relation tuples in C function
Следующее
От: Ranier Vilela
Дата:
Сообщение: Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)