Re: Support regular expressions with nondeterministic collations
От | Jeff Davis |
---|---|
Тема | Re: Support regular expressions with nondeterministic collations |
Дата | |
Msg-id | 4a1e185b7442e9f9c89be3d13aa4be148ce27b98.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: Support regular expressions with nondeterministic collations (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Support regular expressions with nondeterministic collations
|
Список | pgsql-hackers |
On Mon, 2024-12-16 at 17:16 -0500, Tom Lane wrote: > Yeah, there is some set of collations for which that would work. > But I think it requires nontrivial assumptions both about how > comparison works in the collation, and whether the available > case-folding logic matches that. An important point here is > that the results depend on which direction you choose to smash > case, which is at best a bit uncomfortable-making. For instance, > I believe in German "ß" upcases to "SS" and would therefore match > "ss" if you choose to fold to upper, but not so much if you choose > to fold to lower. (Possibly Peter will correct me on that, but the > point is there are some weird rules out there.) Unicode specifies case folding separately from case conversion (lower/title/upper) to deal with these kinds of issues: "ß", "Ss", "SS", and "ss" all fold to "ss". I have a couple patches that create that infrastructure: https://www.postgresql.org/message-id/flat/a1886ddfcd8f60cb3e905c93009b646b4cfb74c5.camel@j-davis.com https://www.postgresql.org/message-id/flat/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com after that's in place, we can even discuss adding a builtin case- insensitive collation that does memcmp() on the case-folded strings. > The existing logic in the regex engine for case-insensitive matching > is to convert every letter to a bracket expression containing all > its case variants. For example, "a" becomes "[aA]" and "[xY1]" > becomes "[xXyY1]". This fails on "ß", so a better way would be > nice... We have a couple options: * create more complex regexes like "(ß|[sS][sS])" * case fold the pattern first, and then lazily case fold the string as we match against it The former sounds faster but the latter sounds simpler. Regards, Jeff Davis
В списке pgsql-hackers по дате отправления: