Re: Some regular-expression performance hacking

Поиск
Список
Период
Сортировка
От Joel Jacobson
Тема Re: Some regular-expression performance hacking
Дата
Msg-id 85140bfb-516c-449b-8a03-7930abb42b88@www.fastmail.com
обсуждение исходный текст
Ответ на Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi Tom,

On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote:
>0001-invent-rainbow-arcs.patch
>0002-recognize-matchall-NFAs.patch

Many thanks for working on the regex engine,
this looks like an awesome optimization.

To test the correctness of the patches,
I thought it would be nice with some real-life regexes,
and just as important, some real-life text strings,
to which the real-life regexes are applied to.

I therefore patched Chromium's v8 regexes engine,
to log the actual regexes that get compiled when
visiting websites, and also the text strings that
are the regexes are applied to during run-time
when the regexes are executed.

I logged the regex and text strings as base64 encoded
strings to STDOUT, to make it easy to grep out the data,
so it could be imported into PostgreSQL for analytics.

In total, I scraped the first-page of some ~50k websites,
which produced 45M test rows to import,
which when GROUP BY pattern and flags was reduced
down to 235k different regex patterns,
and 1.5M different text string subjects.

Here are some statistics on the different flags used:

SELECT *, SUM(COUNT) OVER () FROM (SELECT flags, COUNT(*) FROM patterns GROUP BY flags) AS x ORDER BY COUNT DESC;
flags | count  |  sum
-------+--------+--------
       | 150097 | 235204
i     |  43537 | 235204
g     |  22029 | 235204
gi    |  15416 | 235204
gm    |   2411 | 235204
gim   |    602 | 235204
m     |    548 | 235204
im    |    230 | 235204
y     |    193 | 235204
gy    |     60 | 235204
giy   |     29 | 235204
giu   |     26 | 235204
u     |     11 | 235204
iy    |      6 | 235204
gu    |      5 | 235204
gimu  |      2 | 235204
iu    |      1 | 235204
my    |      1 | 235204
(18 rows)

As we can see, no flag at all is the most common, followed by the "i" flag.

Most of the Javascript-regexes (97%) could be understood by PostgreSQL,
only 3% produced some kind of error, which is not unexpected,
since some Javascript-regex features like \w and \W have different
syntax in PostgreSQL:

SELECT *, SUM(COUNT) OVER () FROM (SELECT is_match,error,COUNT(*) FROM subjects GROUP BY is_match,error) AS x ORDER BY count DESC;
is_match |                             error                             | count  |   sum
----------+---------------------------------------------------------------+--------+---------
f        |                                                               | 973987 | 1489489
t        |                                                               | 474225 | 1489489
          | invalid regular expression: invalid escape \ sequence         |  39141 | 1489489
          | invalid regular expression: invalid character range           |    898 | 1489489
          | invalid regular expression: invalid backreference number      |    816 | 1489489
          | invalid regular expression: brackets [] not balanced          |    327 | 1489489
          | invalid regular expression: invalid repetition count(s)       |     76 | 1489489
          | invalid regular expression: quantifier operand invalid        |     17 | 1489489
          | invalid regular expression: parentheses () not balanced       |      1 | 1489489
          | invalid regular expression: regular expression is too complex |      1 | 1489489
(10 rows)

Having had some fun looking at statistics, let's move on to look at if there are any
observable differences between HEAD (8063d0f6f56e53edd991f53aadc8cb7f8d3fdd8f)
and when these two patches have been applied.

To detect any differences,
for each (regex pattern, text string subject) pair,
the columns,
  is_match boolean
  captured text[]
  error text
were set by a PL/pgSQL function running HEAD:

  BEGIN
    _is_match := _subject ~ _pattern;
    _captured := regexp_match(_subject, _pattern);
  EXCEPTION WHEN OTHERS THEN
    UPDATE subjects SET
      error = SQLERRM
    WHERE subject_id = _subject_id;
    CONTINUE;
  END;
  UPDATE subjects SET
    is_match = _is_match,
    captured = _captured
  WHERE subject_id = _subject_id;

The patches

  0001-invent-rainbow-arcs.patch
  0002-recognize-matchall-NFAs.patch

were then applied and this query was executed to spot any differences:

SELECT
  is_match <> (subject ~ pattern) AS is_match_diff,
  captured IS DISTINCT FROM regexp_match(subject, pattern) AS captured_diff,
  COUNT(*)
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern) OR captured IS DISTINCT FROM regexp_match(subject, pattern))
GROUP BY 1,2
ORDER BY 3 DESC
;

The query was first run on the unpatched HEAD to verify it detects no results.
0 rows indeed, and it took this long to finish the query:

Time: 186077.866 ms (03:06.078)

Running the same query with the two patches, was significantly faster:

Time: 111785.735 ms (01:51.786)

No is_match differences were detected, good!

However, there were 23 cases where what got captured differed:

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject        | v-cloak
is_match_head  | t
captured_head  | {cloak,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-cloak}
-[ RECORD 2 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject        | v-if
is_match_head  | t
captured_head  | {if,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-if}
-[ RECORD 3 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?a5oc.com).*
is_match_head  | t
captured_head  | {https://,a5oc.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 4 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?allfordmustangs.com).*
is_match_head  | t
is_match_patch | t
captured_patch |
-[ RECORD 5 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?audi-forums.com).*
is_match_head  | t
captured_head  | {https://,audi-forums.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 6 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?can-amforum.com).*
is_match_head  | t
captured_head  | {https://,can-amforum.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 7 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?contractortalk.com).*
is_match_head  | t
is_match_patch | t
captured_patch |
-[ RECORD 8 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?halloweenforum.com).*
is_match_head  | t
is_match_patch | t
captured_patch |
-[ RECORD 9 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?horseforum.com).*
is_match_head  | t
captured_head  | {https://,horseforum.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 10 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?passatworld.com).*
is_match_head  | t
captured_head  | {https://,passatworld.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 11 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?plantedtank.net).*
is_match_head  | t
captured_head  | {https://,plantedtank.net,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 12 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?vauxhallownersnetwork.co.uk).*
is_match_head  | t
is_match_patch | t
captured_patch |
-[ RECORD 13 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?volvov40club.com).*
is_match_head  | t
captured_head  | {https://,volvov40club.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 14 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?vwidtalk.com).*
is_match_head  | t
captured_head  | {https://,vwidtalk.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 15 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^.*://)?((www.)?yellowbullet.com).*
is_match_head  | t
captured_head  | {https://,yellowbullet.com,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 16 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^[^\?]*)?(\?[^#]*)?(#.*$)?
is_match_head  | t
is_match_patch | t
captured_patch |
-[ RECORD 17 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject        | /
is_match_head  | t
captured_head  | {/,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 18 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject        | /en.html
is_match_head  | t
captured_head  | {/en,.html}
is_match_patch | t
captured_patch |
-[ RECORD 19 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | (^https?:\/\/)?(((\[[^\]]+\])|([^:\/\?#]+))(:(\d+))?)?([^\?#]*)(.*)?
is_match_head  | t
is_match_patch | t
captured_patch | {NULL,https,https,NULL,https,NULL,NULL,://e.echatsoft.com/mychat/visitor,""}
-[ RECORD 20 ]-+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------
pattern        | (^|.)41nbc.com$|(^|.)41nbc.dev$|(^|.)52.23.179.12$|(^|.)52.3.245.221$|(^|.)clipsyndicate.com$|(^|.)michaelbgiordano.com$|(^|.)syndicaster.tv$|(^|.)wdef.com$|(^|.)wdef.dev$|(^|.)wxxv.mysiteserver.net$|(^|.)wxxv25.dev$|(^|.)clipsyndicate.com$|(^|.)syndicaster.tv$
subject        | wdef.com
is_match_head  | t
captured_head  | {NULL,NULL,NULL,NULL,NULL,NULL,NULL,"",NULL,NULL,NULL,NULL,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 21 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | ^((^\w+:|^)\/\/)?(?:www\.)?
subject        | https://www.deputy.com/
is_match_head  | t
captured_head  | {https://,https:}
is_match_patch | t
captured_patch |
-[ RECORD 22 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | ^((^\w+:|^)\/\/)?(?:www\.)?
subject        | https://www.westernsydney.edu.au/
is_match_head  | t
captured_head  | {https://,https:}
is_match_patch | t
captured_patch |
-[ RECORD 23 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern        | ^(https?:){0,1}\/\/|
subject        | https://ui.powerreviews.com/api/
is_match_head  | t
captured_head  | {https:}
is_match_patch | t
captured_patch | {NULL}

The code to reproduce the results have been pushed here:

Let me know if you want access to the dataset,
I could open up a port to my PostgreSQL so you could take a dump.

SELECT
    pg_size_pretty(pg_relation_size('patterns')) AS patterns,
    pg_size_pretty(pg_relation_size('subjects')) AS subjects;
patterns | subjects
----------+----------
20 MB    | 568 MB
(1 row)

/Joel

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: repeated decoding of prepared transactions
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Some regular-expression performance hacking