Обсуждение: Support LIKE with nondeterministic collations
This patch adds support for using LIKE with nondeterministic collations. So you can do things such as col LIKE 'foo%' COLLATE case_insensitive This currently results in a "not supported" error. The reason for that is that when I first developed support for nondeterministic collations, I didn't know what the semantics of that should be, especially since with nondeterministic collations, strings of different lengths could be equal, and then dropped the issue for a while. After further research, the SQL standard's definition of the LIKE predicate actually provides a clear definition of the semantics: The pattern is partitioned into substrings at wildcard characters (so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then whole predicate matches if a match can be found for each partition under the applicable collation (so for 'foo%bar' we look to partition the input string into s1 || s2 || s3 such that s1 = 'foo', s2 is anything, and s3 = 'bar'.) The only difference to deterministic collations is that for deterministic collations we can optimize this by matching by character, but for nondeterministic collations we have to go by substring.
Вложения
Peter Eisentraut wrote: > This patch adds support for using LIKE with nondeterministic > collations. So you can do things such as > > col LIKE 'foo%' COLLATE case_insensitive Nice! > The pattern is partitioned into substrings at wildcard characters > (so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then > whole predicate matches if a match can be found for each partition > under the applicable collation Trying with a collation that ignores punctuation: postgres=# CREATE COLLATION "ign_punct" ( provider = 'icu', locale='und-u-ka-shifted', deterministic = false ); postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct; ?column? ---------- t (1 row) postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct; ?column? ---------- t (1 row) postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct; ?column? ---------- f (1 row) The first two results look fine, but the next one is inconsistent. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 30.04.24 14:39, Daniel Verite wrote: > postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct; > ?column? > ---------- > f > (1 row) > > The first two results look fine, but the next one is inconsistent. This is correct, because '_' means "any single character". This is independent of the collation. I think with nondeterministic collations, the single-character wildcard is often not going to be all that useful.
On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <peter@eisentraut.org> wrote: > On 30.04.24 14:39, Daniel Verite wrote: > > postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct; > > ?column? > > ---------- > > f > > (1 row) > > > > The first two results look fine, but the next one is inconsistent. > > This is correct, because '_' means "any single character". This is > independent of the collation. Seems really counterintuitive. I had to think for a long time to be able to guess what was happening here. Finally I came up with this guess: If the collation-aware matching tries to match up f with the initial period, the period is skipped and the f matches f. But when the wildcard is matched to the initial period, that uses up the wildcard and then we're left trying to match o with f, which doesn't work. Is that right? It'd probably be good to use something like this as an example in the documentation. My intuition is that if foo matches a string, then _oo f_o and fo_ should also match that string. Apparently that's not the case, but I doubt I'll be the last one who thinks it should be. -- Robert Haas EDB: http://www.enterprisedb.com
On 03.05.24 02:11, Robert Haas wrote: > On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> On 30.04.24 14:39, Daniel Verite wrote: >>> postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct; >>> ?column? >>> ---------- >>> f >>> (1 row) >>> >>> The first two results look fine, but the next one is inconsistent. >> >> This is correct, because '_' means "any single character". This is >> independent of the collation. > > Seems really counterintuitive. I had to think for a long time to be > able to guess what was happening here. Finally I came up with this > guess: > > If the collation-aware matching tries to match up f with the initial > period, the period is skipped and the f matches f. But when the > wildcard is matched to the initial period, that uses up the wildcard > and then we're left trying to match o with f, which doesn't work. Formally, what X like '_oo' means is, can X be partitioned into substrings such that the first substring is a single character and the second substring is equal to 'oo' under the applicable collation? This is false in this case, there is no such partitioning. What the implementation does is, it walks through the pattern. It sees '_', so it steps over one character in the input string, which is '.' here. Then we have 'foo.' left to match in the input string. Then it takes from the pattern the next substring up to but not including either a wildcard character or the end of the string, which is 'oo', and then it checks if a prefix of the remaining input string can be found that is "equal to" 'oo'. So here it would try in turn '' = 'oo' collate ign_punct ? 'f' = 'oo' collate ign_punct ? 'fo' = 'oo' collate ign_punct ? 'foo' = 'oo' collate ign_punct ? 'foo.' = 'oo' collate ign_punct ? and they all fail, so the match fails. > It'd probably be good to use something like this as an example in the > documentation. My intuition is that if foo matches a string, then _oo > f_o and fo_ should also match that string. Apparently that's not the > case, but I doubt I'll be the last one who thinks it should be. This intuition fails because with nondeterministic collations, strings of different lengths can be equal, and so the question arises, what does the pattern '_' mean. It could mean either, (1) a single character, or perhaps something like, (2) a string that is equal to some other string of length one. The second definition would satisfy the expectation here, because then '.f' matches '_' because '.f' is equal to some string of length one, such as 'f'. (And then 'oo.' matches 'oo' for the rest of the pattern.) However, off the top of my head, this definition has three flaws: (1) It would make the single-character wildcard effectively an any-number-of-characters wildcard, but only in some circumstances, which could be confusing, (2) it would be difficult to compute, because you'd have to check equality against all possible single-character strings, and (3) it is not what the SQL standard says. In any case, yes, some explanation and examples should be added.
On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <peter@eisentraut.org> wrote: > What the implementation does is, it walks through the pattern. It sees > '_', so it steps over one character in the input string, which is '.' > here. Then we have 'foo.' left to match in the input string. Then it > takes from the pattern the next substring up to but not including either > a wildcard character or the end of the string, which is 'oo', and then > it checks if a prefix of the remaining input string can be found that is > "equal to" 'oo'. So here it would try in turn > > '' = 'oo' collate ign_punct ? > 'f' = 'oo' collate ign_punct ? > 'fo' = 'oo' collate ign_punct ? > 'foo' = 'oo' collate ign_punct ? > 'foo.' = 'oo' collate ign_punct ? > > and they all fail, so the match fails. Interesting. Does that imply that these matches are slower than normal ones? > The second definition would satisfy the expectation here, because then > '.f' matches '_' because '.f' is equal to some string of length one, > such as 'f'. (And then 'oo.' matches 'oo' for the rest of the pattern.) > However, off the top of my head, this definition has three flaws: (1) > It would make the single-character wildcard effectively an > any-number-of-characters wildcard, but only in some circumstances, which > could be confusing, (2) it would be difficult to compute, because you'd > have to check equality against all possible single-character strings, > and (3) it is not what the SQL standard says. Right, those are good arguments. > In any case, yes, some explanation and examples should be added. Cool. -- Robert Haas EDB: http://www.enterprisedb.com
On 03.05.24 15:20, Robert Haas wrote: > On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> What the implementation does is, it walks through the pattern. It sees >> '_', so it steps over one character in the input string, which is '.' >> here. Then we have 'foo.' left to match in the input string. Then it >> takes from the pattern the next substring up to but not including either >> a wildcard character or the end of the string, which is 'oo', and then >> it checks if a prefix of the remaining input string can be found that is >> "equal to" 'oo'. So here it would try in turn >> >> '' = 'oo' collate ign_punct ? >> 'f' = 'oo' collate ign_punct ? >> 'fo' = 'oo' collate ign_punct ? >> 'foo' = 'oo' collate ign_punct ? >> 'foo.' = 'oo' collate ign_punct ? >> >> and they all fail, so the match fails. > > Interesting. Does that imply that these matches are slower than normal ones? Yes, certainly, and there is also no indexing support (other than for exact matches).
Peter Eisentraut wrote: > Yes, certainly, and there is also no indexing support (other than for > exact matches). The ICU docs have this note about prefix matching: https://unicode-org.github.io/icu/userguide/collation/architecture.html#generating-bounds-for-a-sort-key-prefix-matching * Generating bounds for a sort key (prefix matching) Having sort keys for strings allows for easy creation of bounds - sort keys that are guaranteed to be smaller or larger than any sort key from a give range. For example, if bounds are produced for a sortkey of string “smith”, strings between upper and lower bounds with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds of upper bounds can be generated - the first one will match only strings of equal length, while the second one will match all the strings with the same initial prefix. CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum primary weight, so that for example the string “smith\uFFFF” can be used as the upper bound rather than modifying the sort key for “smith”. In other words it says that col LIKE 'smith%' collate "nd" is equivalent to: col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd" which could be obtained from an index scan, assuming a btree index on "col" collate "nd". U+FFFF is a valid code point but a "non-character" [1] so it's not supposed to be present in normal strings. [1] https://www.unicode.org/glossary/#noncharacter Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Peter Eisentraut wrote: > However, off the top of my head, this definition has three flaws: (1) > It would make the single-character wildcard effectively an > any-number-of-characters wildcard, but only in some circumstances, which > could be confusing, (2) it would be difficult to compute, because you'd > have to check equality against all possible single-character strings, > and (3) it is not what the SQL standard says. For #1 we're currently using the definition of a "character" as being any single point of code, but this definition fits poorly with non-deterministic collation rules. The simplest illustration I can think of is the canonical equivalence match between the NFD and NFC forms of an accented character. postgres=# CREATE COLLATION nd ( provider = 'icu', locale = 'und', deterministic = false ); -- match NFD form with NFC form of eacute postgres=# select U&'e\0301' like 'é' collate nd; ?column? ---------- t postgres=# select U&'e\0301' like '_' collate nd; ?column? ---------- f (1 row) I understand why the algorithm produces these opposite results. But at the semantic level, when asked if the left-hand string matches a specific character, it says yes, and when asked if it matches any character, it says no. To me it goes beyond counter-intuitive, it's not reasonable enough to be called correct. What could we do about it? Intuitively I think that our interpretation of "character" here should be whatever sequence of code points are between character boundaries [1], and that the equality of such characters would be the equality of their sequences of code points, with the string equality check of the collation, whatever the length of these sequences. [1]: https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 03.05.24 16:58, Daniel Verite wrote: > * Generating bounds for a sort key (prefix matching) > > Having sort keys for strings allows for easy creation of bounds - > sort keys that are guaranteed to be smaller or larger than any sort > key from a give range. For example, if bounds are produced for a > sortkey of string “smith”, strings between upper and lower bounds > with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds > of upper bounds can be generated - the first one will match only > strings of equal length, while the second one will match all the > strings with the same initial prefix. > > CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with > the maximum primary weight, so that for example the string > “smith\uFFFF” can be used as the upper bound rather than modifying > the sort key for “smith”. > > In other words it says that > > col LIKE 'smith%' collate "nd" > > is equivalent to: > > col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd" > > which could be obtained from an index scan, assuming a btree > index on "col" collate "nd". > > U+FFFF is a valid code point but a "non-character" [1] so it's > not supposed to be present in normal strings. Thanks, this could be very useful!
On 03.05.24 17:47, Daniel Verite wrote: > Peter Eisentraut wrote: > >> However, off the top of my head, this definition has three flaws: (1) >> It would make the single-character wildcard effectively an >> any-number-of-characters wildcard, but only in some circumstances, which >> could be confusing, (2) it would be difficult to compute, because you'd >> have to check equality against all possible single-character strings, >> and (3) it is not what the SQL standard says. > > For #1 we're currently using the definition of a "character" as > being any single point of code, That is the definition that is used throughout SQL and PostgreSQL. We can't change that without redefining everything. To pick just one example, the various trim function also behave in seemingly inconsistent ways when you apply then to strings in different normalization forms. The better fix there is to enforce the normalization form somehow. > Intuitively I think that our interpretation of "character" here should > be whatever sequence of code points are between character > boundaries [1], and that the equality of such characters would be the > equality of their sequences of code points, with the string equality > check of the collation, whatever the length of these sequences. > > [1]: > https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary Even that page says, what we are calling character here is really called a grapheme cluster. In a different world, pattern matching, character trimming, etc. would work by grapheme, but it does not.
Here is an updated patch for this. I have added some more documentation based on the discussions, including some examples taken directly from the emails here. One thing I have been struggling with a bit is the correct use of LIKE_FALSE versus LIKE_ABORT in the MatchText() code. I have made some small tweaks about this in this version that I think are more correct, but it could use another look. Maybe also some more tests to verify this one way or the other. On 30.04.24 14:39, Daniel Verite wrote: > Peter Eisentraut wrote: > >> This patch adds support for using LIKE with nondeterministic >> collations. So you can do things such as >> >> col LIKE 'foo%' COLLATE case_insensitive > > Nice! > >> The pattern is partitioned into substrings at wildcard characters >> (so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then >> whole predicate matches if a match can be found for each partition >> under the applicable collation > > Trying with a collation that ignores punctuation: > > postgres=# CREATE COLLATION "ign_punct" ( > provider = 'icu', > locale='und-u-ka-shifted', > deterministic = false > ); > > postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct; > ?column? > ---------- > t > (1 row) > > postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct; > ?column? > ---------- > t > (1 row) > > postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct; > ?column? > ---------- > f > (1 row) > > The first two results look fine, but the next one is inconsistent.