Обсуждение: Use correct collation in pg_trgm
Hi hackers,
In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
for converting trigrams to lower-case. Here are some examples where
today the collation is ignored:
CREATE EXSTENSION pg_trgm;
CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
show_trgm
---------------------------------------------
{" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
CREATE TABLE test(col TEXT COLLATE "turkish");
INSERT INTO test VALUES ('ISTANBUL');
postgres=# select show_trgm(col) FROM test;
show_trgm
---------------------------------------------
{" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
similarity
------------
0.5
If the database is initialized via initdb --locale="tr_TR.utf8", the
output changes:
postgres=# SELECT show_trgm('ISTANBUL');
show_trgm
--------------------------------------------------------
{0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
and
postgres=# select show_trgm(col) FROM test;
show_trgm
--------------------------------------------------------
{0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
similarity
------------
1
tr_TR.utf8 converts capital I to ı which is a multibyte character, while
my default collation converts I to i.
The attached patch attempts to fix that. I grepped for all occurrences
of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
collation OID instead DEFAULT_COLLATION_OID.
The corresponding regression tests pass.
[1]
https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
--
David Geier
Вложения
Hello! The patch is simple and it does what it says it does, I verified the difference in behavior with/without it. I think the test case included in the email should be part of the patch, maybe as a new file contrib/pg_trgm/sql/pg_trgm_collation.sql? It also needs a proper commit message, and seems like the affected indexes will need a REINDEX after this fix.
On Wed, 21 Jan 2026 at 20:36, David Geier <geidav.pg@gmail.com> wrote:
>
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
> CREATE EXSTENSION pg_trgm;
> CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
>
> postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> CREATE TABLE test(col TEXT COLLATE "turkish");
> INSERT INTO test VALUES ('ISTANBUL');
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 0.5
>
> If the database is initialized via initdb --locale="tr_TR.utf8", the
> output changes:
>
> postgres=# SELECT show_trgm('ISTANBUL');
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> and
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 1
>
> tr_TR.utf8 converts capital I to ı which is a multibyte character, while
> my default collation converts I to i.
>
> The attached patch attempts to fix that. I grepped for all occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
>
> The corresponding regression tests pass.
>
> [1]
> https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
>
> --
> David Geier
Hi!
LGTM
--
Best regards,
Kirill Reshke
Hi! Thanks for reviewing. On 22.01.2026 07:32, Zsolt Parragi wrote: > Hello! > > The patch is simple and it does what it says it does, I verified the > difference in behavior with/without it. While reading through [1] I realized that the word boundary detection also uses the wrong collation. Patch 0002 fixes that. > I think the test case included in the email should be part of the > patch, maybe as a new file contrib/pg_trgm/sql/pg_trgm_collation.sql? > It also needs a proper commit message, and seems like the affected > indexes will need a REINDEX after this fix. - I've added tests to 0001 and 0002 based on what each commit fixes. - I've improved the commit messages. Looking at [2], it seems like we don't include release note changes in bug fix commits but rather collect them retroactively before cutting the release. [1] https://www.postgresql.org/message-id/f30299bf-ad8e-4125-bf80-e0a8663991b6%40eisentraut.org [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fb1a18810f0 -- David Geier
Вложения
> While reading through [1] I realized that the word boundary detection > also uses the wrong collation. Patch 0002 fixes that. Good catch, I considered checking that when I reviewed the patch, but I thought that it would be an issue with CJK languages, and I was completely wrong about that. Updated patch looks good, I only noticed two minor things: * the new test should have a newline at the end of the file * and probably a conditional skip based on locale availability, the citext_utf8.sql test case does something similar
Hi! Thank you for working on this. On Mon, Jan 26, 2026 at 3:24 PM Zsolt Parragi <zsolt.parragi@percona.com> wrote: > > > While reading through [1] I realized that the word boundary detection > > also uses the wrong collation. Patch 0002 fixes that. > > Good catch, I considered checking that when I reviewed the patch, but > I thought that it would be an issue with CJK languages, and I was > completely wrong about that. > > Updated patch looks good, I only noticed two minor things: > > * the new test should have a newline at the end of the file > * and probably a conditional skip based on locale availability, the > citext_utf8.sql test case does something similar I wonder about existing indexes, which already use default collation not column collation. Should we add a release note saying they might be obsolete? Alternatively, given we now have opclass options, we may add a new opclass option defining whether to use column collation (must be false for existing indexes). ------ Regards, Alexander Korotkov Supabase
Attached is v3 of the patch with the following changes: On 26.01.2026 14:33, Alexander Korotkov wrote: >> Updated patch looks good, I only noticed two minor things: >> >> * the new test should have a newline at the end of the file Done. >> * and probably a conditional skip based on locale availability, the >> citext_utf8.sql test case does something similar Done, the same way collate.icu.utf8.sql does it. This test also uses the Turkish collation tr-x-icu. > I wonder about existing indexes, which already use default collation > not column collation. Should we add a release note saying they might > be obsolete? Alternatively, given we now have opclass options, we may > add a new opclass option defining whether to use column collation > (must be false for existing indexes). I think adding a release note is fine. We discussed this in [1] already and there's precedence for doing it this way. See [2]. Looking at [3], it seems like we don't include release notes in bug fix commits but rather collect them retroactively before cutting the release. The file doc/src/sgml/release-19.sgml is also still completely empty on master. -- David Geier [1] https://www.postgresql.org/message-id/CAEze2WiUL9idZBbuUN%2BMuWqr6DcPr_-C91E9MTx%3DH62Xx5fHaQ%40mail.gmail.com [2] https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fb1a18810f0
Вложения
>> * the new test should have a newline at the end of the file > > Done. It is still missing. Otherwise it looks good to me. (I don't see a commitfest entry for it?)
On 02.02.2026 21:33, Zsolt Parragi wrote: >>> * the new test should have a newline at the end of the file >> >> Done. > > It is still missing. Sorry, my bad. Now it's there. > Otherwise it looks good to me. > > (I don't see a commitfest entry for it?) I thought Alexander might pick it up without CF entry. I created now one, see [1]. Will you add yourself as reviewer? -- David Geier [1] https://commitfest.postgresql.org/patch/6454/
Вложения
> I thought Alexander might pick it up without CF entry. I created now > one, see [1]. Will you add yourself as reviewer? I've rebased the patches on latest master. -- David Geier
Вложения
> I've rebased the patches on latest master. Attached is v6 of the patch which fixes a problem with the tests: pg_trgm_collation_1.out mistakenly contained \endif at the end of the file. -- David Geier
Вложения
On Wed, 2026-01-21 at 16:36 +0100, David Geier wrote:
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
...
> The attached patch attempts to fix that. I grepped for all
> occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
Hi,
Thank you for working on this.
This area is a bit awkward conceptually. The case you found is not
about the *sort order* of the values; it's about the casing semantics.
We mix those two concepts into a single "collation oid" that determines
both sort order and casing semantics (and pattern matching semantics,
too).
LOWER() and UPPER() take the casing semantics from the inferred
collation, so that's a good argument that you're doing the right thing
here.
But full text search does not; it uses DEFAULT_COLLATION_OID for
parsing the input. That sort of makes sense, because tsvector/tsquery
don't have a collatable sort order -- it's more about the parsing
semantics to create the values in the first place, not about how the
tsvector/tsquery values are sorted.
So that leaves me wondering: why would pg_trgm use the inferred
collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem
conceptually similar, and the only real difference I see is that
tsvector/tsquery are types and pg_trgm is a set of functions.
Note that I made some changes here recently: full text search and ltree
used to use libc unconditionally or a mix of libc and
DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to
consistently use DEFAULT_COLLATION_OID. But I didn't resolve the
conceptual problem of whether we should use the inferred collation (as
you suggest) or not.
Regards,
Jeff Davis
Hi Jeff! > This area is a bit awkward conceptually. The case you found is not > about the *sort order* of the values; it's about the casing semantics. > We mix those two concepts into a single "collation oid" that determines > both sort order and casing semantics (and pattern matching semantics, > too). > > LOWER() and UPPER() take the casing semantics from the inferred > collation, so that's a good argument that you're doing the right thing > here. > > But full text search does not; it uses DEFAULT_COLLATION_OID for > parsing the input. That sort of makes sense, because tsvector/tsquery > don't have a collatable sort order -- it's more about the parsing > semantics to create the values in the first place, not about how the > tsvector/tsquery values are sorted. For pg_trgm it's also not only about casing but also about parsing: the decision of what is considered an alpha-numeric character in ISWORDCHR() depends on the collation. > So that leaves me wondering: why would pg_trgm use the inferred > collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem > conceptually similar, and the only real difference I see is that > tsvector/tsquery are types and pg_trgm is a set of functions. I agree. That is inconsistent. But if anything, shouldn't we change tsvector/tsquery to as well adhere to the inferred collation? For example, when a user specifies a collation for some table column, he expects the collation to not only impact sort order. With say collation en-US-x-icu, the B-tree lookup will be case-insensitive. Why would a GIN index suddenly not adhere to the collation? That seems counter-intuitive and confusing. The same when using tsvector/tsquery. More generally: shouldn't it, from a user's point-of-view, be an all or nothing to avoid surprises? If not, we should come up with easy to understand and easy to remember reasons for what adheres to the inferred collation and what adheres the default collation and document that. > Note that I made some changes here recently: full text search and ltree > used to use libc unconditionally or a mix of libc and > DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to > consistently use DEFAULT_COLLATION_OID. But I didn't resolve the > conceptual problem of whether we should use the inferred collation (as > you suggest) or not. Thanks for the heads up. -- David Geier
On Thu, 2026-03-26 at 09:50 +0100, David Geier wrote:
> I agree. That is inconsistent. But if anything, shouldn't we change
> tsvector/tsquery to as well adhere to the inferred collation?
I am not sure either way.
It's easy to specify a COLLATE clause to affect the interpretation of
the input. But once you parse the inputs into a stored value, you can't
later reinterpret those values by specifying a COLLATE clause. The
parsing already happened and the original input string was lost.
You can end up with a table full of values, some of which were parsed
with one set of semantics, and others parsed with a different set of
semantics. That may make sense or it may just cause confusion. It's
tough for me to say.
Another consequence is that if we actually declare a type to be
collatable, then parsing will infer the collation all the way through.
Is that what we want?
In any case, I think we should make an explicit decision about which
way to go before making changes. Including Peter, who probably has an
opinion here.
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
> On Thu, 2026-03-26 at 09:50 +0100, David Geier wrote:
>> I agree. That is inconsistent. But if anything, shouldn't we change
>> tsvector/tsquery to as well adhere to the inferred collation?
> I am not sure either way.
> It's easy to specify a COLLATE clause to affect the interpretation of
> the input. But once you parse the inputs into a stored value, you can't
> later reinterpret those values by specifying a COLLATE clause. The
> parsing already happened and the original input string was lost.
> You can end up with a table full of values, some of which were parsed
> with one set of semantics, and others parsed with a different set of
> semantics. That may make sense or it may just cause confusion. It's
> tough for me to say.
The rule that text search goes by is that it's okay to be a bit
fuzzy about this because people are usually looking for approximate
matches, so that even if you have sets of lexemes that were extracted
under slightly different parsing rules you can probably still find
what you want. While that argument still works for pg_trgm's original
"similarity" functions, it falls flat for the LIKE/ILIKE/regex index
support functionality: people will be justifiably unhappy if the index
doesn't find the exact same matches that a seqscan-and-filter would.
I've not experimented, but I rather imagine that things are already
buggy as heck, in that optimizing a LIKE or regex expression that's
got collation A applied to it into an indexscan on a pg_trgm index
made with collation B will not work if different trigrams get
extracted. I think we have to insist that the index collation match
the query. Once we've done that, the concern about making a change
like this seems less: you will not get wrong answers, rather the
planner will refuse to use an incompatible index.
regards, tom lane
On 26.03.2026 19:26, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> On Thu, 2026-03-26 at 09:50 +0100, David Geier wrote:
>>> I agree. That is inconsistent. But if anything, shouldn't we change
>>> tsvector/tsquery to as well adhere to the inferred collation?
>
>> I am not sure either way.
>> It's easy to specify a COLLATE clause to affect the interpretation of
>> the input. But once you parse the inputs into a stored value, you can't
>> later reinterpret those values by specifying a COLLATE clause. The
>> parsing already happened and the original input string was lost.
>> You can end up with a table full of values, some of which were parsed
>> with one set of semantics, and others parsed with a different set of
>> semantics. That may make sense or it may just cause confusion. It's
>> tough for me to say.
>
> The rule that text search goes by is that it's okay to be a bit
> fuzzy about this because people are usually looking for approximate
> matches, so that even if you have sets of lexemes that were extracted
> under slightly different parsing rules you can probably still find
tsquery allows to do starts-with queries equivalent to LIKE 'foo%' via
to_tsquery('foo:*'). These two would then also behave differently.
Can you see any good reason that speaks against using the inferred
collation in tsquery / tsvector?
> what you want. While that argument still works for pg_trgm's original
> "similarity" functions, it falls flat for the LIKE/ILIKE/regex index
> support functionality: people will be justifiably unhappy if the index
> doesn't find the exact same matches that a seqscan-and-filter would.
Agreed. That was also one of the motivations to change it.
> I've not experimented, but I rather imagine that things are already
> buggy as heck, in that optimizing a LIKE or regex expression that's
> got collation A applied to it into an indexscan on a pg_trgm index
> made with collation B will not work if different trigrams get
> extracted. I think we have to insist that the index collation match
> the query. Once we've done that, the concern about making a change
> like this seems less: you will not get wrong answers, rather the
> planner will refuse to use an incompatible index.
I thought that happens already. In the following example no index scan
is used, even though sequential scan is disabled. FWICS,
IndexCollMatchesExprColl() takes care of that.
CREATE EXTENSION pg_trgm;
CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
CREATE INDEX ON test USING GIN(col gin_trgm_ops);
SET enable_seqscan = FALSE;
EXPLAIN SELECT * FROM test WHERE col LIKE '%test%' COLLATE "C";
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on test (cost=10000000000.00..10000000001.01 rows=1 width=32)
Filter: (col = 'test'::text COLLATE "C")
If you have other cases in mind, pointers are appreciated.
--
David Geier