Re: Better locale-specific-character-class handling for regexps
От | Tom Lane |
---|---|
Тема | Re: Better locale-specific-character-class handling for regexps |
Дата | |
Msg-id | 13600.1473023873@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Better locale-specific-character-class handling for regexps (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
I wrote: > I got tired of hearing complaints about the issue described in > this thread: > https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us > Here's a proposed fix. I've not done extensive performance testing, > but it seems to be as fast or faster than the old code in cases where > there are not too many "large" characters in the input. And, more > to the point, it gets the right answer for such large characters. I've now done some performance testing on this patch. I ran the attached scripts in cassert-off builds, using en_US.utf8 locale on a RHEL6 box. Each number is the best of 3 runs; all times in ms. The first two test cases are meant to exercise regex compile speed; they provide relatively short input strings, and each call provides a slightly different pattern so as to defeat the compiled-pattern cache in adt/regexp.c. HEAD patched compilespeed.sql 61959.793 32936.773 compilespeed09.sql 6411.957 5903.211 compilespeed.sql exercises compiling '\w' which of course is locale dependent, while compilespeed09.sql compiles '[0-9]' which is not. Since the patch sets MAX_SIMPLE_CHR to 0x7FF which is the same as regc_pg_locale.c's old cutoff for how many characters to examine, we must be doing about the same number of compile-time iswalnum() probes in both old and new code. The compile speedup must come entirely from the fact that the new code uses a simple array for the low-char-codes colormap, whereas the old code has a complicated multilevel colormap that is more expensive to update. Then the bigger win for compilespeed.sql must be due to the fact that \w will require changing the colors of more characters in the colormap. I was not expecting to see such a big win, but I'll take it ;-) The remaining cases hold the pattern fixed (so it's not recompiled each time) and are meant to measure the per-character scan speed while executing a compiled regexp. HEAD patched ratio of per-char times runspeedlo.sql 12951.588 12053.533 0.91 runspeedhi.sql 10410.140 * 36678.876 2.72 (estimated) runspeednotlo.sql 12999.629 12130.439 0.91 runspeednothi.sql 15678.125 20053.861 1.36 dummy.sql 3444.696 3437.984 (to measure overhead) * gives wrong answer In the runspeedhi.sql case, HEAD thinks the first character doesn't match the pattern, so it's not scanning the whole input, which makes that measurement not useful. We can assume though that the scan rate would have been about the same as runspeednothi.sql, and the estimated per-char ratio shown in the table is computed on that basis. So the patched code's scan speed is about 9% better than HEAD for data characters below MAX_SIMPLE_CHR, presumably because of the simpler color map structure. But it's noticeably worse for data chars above that, and much worse if the pattern forces runtime iswalnum() tests. That's what I was expecting. It's worth noticing that comparing runspeednotlo.sql and runspeednothi.sql shows that HEAD is 28% worse per-character on the latter. Those cases are exactly the same so far as HEAD's regex engine is concerned, so I think that that differential must be blamable entirely on the fact that we're converting 2-byte UTF8 sequences into pg_wchars in the first case and 3-byte UTF8 sequences in the second case. The patched code is also paying that cost of course, so you should take it into account when comparing numbers in different rows of this table. And it's also an indication that really, this code is pretty frickin fast --- the total time per character is only circa 4x the extra logic to pick up and shift in one more byte! Overall I'm pretty happy with these results. I think that most cases will be as fast or faster than before, assuming that most data characters are below MAX_SIMPLE_CHR for most use-cases. In the worst cases, the per-character cost can be several times what it was before, but those are exactly the same cases where *we gave the wrong answer* before, so it's hard to complain too much. regards, tom lane \timing on do $$ begin for i in 1..100000 loop perform 'aqaqaqaqaqaqaqaqaqaqaqaqaq' ~ ('\w\w\w\w\w\w\w\w\w\w\w\w\w\w' || i); end loop; end$$; \timing on do $$ begin for i in 1..100000 loop perform 'a' ~ ('[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' || i); end loop; end$$; \timing on do $$ declare txt text := repeat(chr(2037), 1000); pat text := '^\w*$'; begin raise notice 'result of txt ~ pat = %', txt ~ pat; for i in 1..1000000 loop perform txt ~ pat; end loop; end$$; \timing on do $$ declare txt text := repeat(chr(2310), 1000); pat text := '^\w*$'; begin raise notice 'result of txt ~ pat = %', txt ~ pat; for i in 1..1000000 loop perform txt ~ pat; end loop; end$$; \timing on do $$ declare txt text := repeat(chr(2037), 1000); pat text := '^[^x]*$'; begin raise notice 'result of txt ~ pat = %', txt ~ pat; for i in 1..1000000 loop perform txt ~ pat; end loop; end$$; \timing on do $$ declare txt text := repeat(chr(2310), 1000); pat text := '^[^x]*$'; begin raise notice 'result of txt ~ pat = %', txt ~ pat; for i in 1..1000000 loop perform txt ~ pat; end loop; end$$; \timing on do $$ declare txt text := repeat(chr(2310), 1000); pat text := '^\w*$'; begin for i in 1..1000000 loop perform txt; end loop; end$$;
В списке pgsql-hackers по дате отправления: