Re: Support LIKE with nondeterministic collations
От | Peter Eisentraut |
---|---|
Тема | Re: Support LIKE with nondeterministic collations |
Дата | |
Msg-id | ba4dc9f5-fb14-433f-8d53-3dd291585161@eisentraut.org обсуждение исходный текст |
Ответ на | Re: Support LIKE with nondeterministic collations (Paul A Jungwirth <pj@illuminatedcomputing.com>) |
Список | pgsql-hackers |
On 27.07.24 00:32, Paul A Jungwirth wrote: > On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut <peter@eisentraut.org> wrote: >> Here is an updated patch for this. > > I took a look at this. I added some tests and found a few that give > the wrong result (I believe). The new tests are included in the > attached patch, along with the results I expect. Here are the > failures: Thanks, these are great test cases. > > -- inner %% matches b then zero: > SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents; > ?column? > ---------- > - t > + f > (1 row) > > -- trailing _ matches two codepoints that form one char: > SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents; > ?column? > ---------- > - t > + f > (1 row) > > -- leading % matches zero: > SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents; > ?column? > ---------- > - t > + f > (1 row) > > -- leading % matches zero (with later %): > SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents; > ?column? > ---------- > - t > + f > (1 row) > > I think the 1st, 3rd, and 4th failures are all from % not backtracking > to match zero chars. These are all because of this in like_match.c: * Otherwise, scan for a text position at which we can match the * rest of the pattern. The first remaining pattern char is known * to be a regular or escaped literal character, so we can compare * the first pattern byte to each text byte to avoid recursing * more than we have to. [...] This shortcut doesn't work with nondeterministic collations, so we have to recurse in any case here. I have fixed that in the new patch; these test cases work now. > The 2nd failure I'm not sure about. Maybe my expectation is wrong, but > then why does the same test pass with __ leading not trailing? Surely > they should be consistent. The question is why is SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents; -- false but SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents; -- true The second one matches because SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents; So the accent character will be ignored if it's adjacent to another fixed substring in the pattern.
Вложения
В списке pgsql-hackers по дате отправления: