Re: Some regular-expression performance hacking

Поиск
Список
Период
Сортировка
От Joel Jacobson
Тема Re: Some regular-expression performance hacking
Дата
Msg-id 12e6381c-adc1-4711-a619-864fb8df969c@www.fastmail.com
обсуждение исходный текст
Ответ на Re: Some regular-expression performance hacking  ("Joel Jacobson" <joel@compiler.org>)
Ответы Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote:
>SELECT * FROM vdeviations;
>-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
>pattern          | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$

Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.

I noticed it when trying to read from the vdeviations view in PostgreSQL 13.2.

Here is my little helper-function which I used to shrink patterns/subjects longer than N characters:

CREATE OR REPLACE FUNCTION shrink_text(text,integer) RETURNS text LANGUAGE sql AS $$
SELECT CASE WHEN length($1) < $2 THEN $1 ELSE
  format('%s ... %s chars ... %s', m[1], length(m[2]), m[3])
END
FROM (
  SELECT regexp_matches($1,format('^(.{1,%1$s})(.*?)(.{1,%1$s})$',$2/2)) AS m
) AS q
$$;

The regex aims to produce three capture groups,
where I wanted the first and third ones to be greedy
and match up to $2 characters (controlled by the second input param to the function),
and the second capture group in the middle to be non-greedy,
but match the remainder to make up a fully anchored match.

It works like expected in both 13.2 and HEAD+patches, but the speed-up it enormous:

PostgreSQL 13.2:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
ProjectSet  (cost=0.00..0.02 rows=1 width=32) (actual time=23600.816..23600.838 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)
Planning Time: 0.432 ms
Execution Time: 23600.859 ms
(4 rows)

HEAD+0001+0002+0003+0004+0005:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
ProjectSet  (cost=0.00..0.02 rows=1 width=32) (actual time=36.656..36.661 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.002 rows=1 loops=1)
Planning Time: 0.575 ms
Execution Time: 36.689 ms
(4 rows)

Cool stuff.

/Joel

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Nancarrow
Дата:
Сообщение: Re: Parallel INSERT (INTO ... SELECT ...)
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: 64-bit XIDs in deleted nbtree pages