Re: Notes about fixing regexes and UTF-8 (yet again)

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Notes about fixing regexes and UTF-8 (yet again)
Дата
Msg-id CA+TgmoZtsKi8oQyjn=HMQW8JYnhYmDjzSwAA1-uZNWZBRK+g-A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Notes about fixing regexes and UTF-8 (yet again)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Notes about fixing regexes and UTF-8 (yet again)  (Vik Reykja <vikreykja@gmail.com>)
Re: Notes about fixing regexes and UTF-8 (yet again)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps.  The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.
>
> In the meantime, I still think the caching logic is worth having, and
> we could at least make some people happy if we selected a cutoff point
> somewhere between U+FF and U+FFFF.  I don't have any strong ideas about
> what a good compromise cutoff would be.  One possibility is U+7FF, which
> corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> if that corresponds to any significant dropoff in frequency of usage.

The problem, of course, is that this probably depends quite a bit on
what language you happen to be using.  For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+FFFF might not be enough.  So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.

In theory you can imagine a regular expression engine where these
decisions can be postponed until we see the string we're matching
against.  IOW, your DFA ends up with state transitions for characters
specifically named, plus a state transition for "anything else that's
a letter", plus a state transition for "anything else not otherwise
specified".  Then you only need to test the letters that actually
appear in the target string, rather than all of the ones that might
appear there.

But implementing that could be quite a lot of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Brendan Jurd
Дата:
Сообщение: Re: Future of our regular expression code
Следующее
От: Vik Reykja
Дата:
Сообщение: Re: Notes about fixing regexes and UTF-8 (yet again)