Re: Wrong results with equality search using trigram index and non-deterministic collation
| От | David Geier |
|---|---|
| Тема | Re: Wrong results with equality search using trigram index and non-deterministic collation |
| Дата | |
| Msg-id | 9b850976-8f0d-4957-9308-e1c053a35559@gmail.com обсуждение |
| Ответ на | Re: Wrong results with equality search using trigram index and non-deterministic collation (Laurenz Albe <laurenz.albe@cybertec.at>) |
| Ответы |
Re: Wrong results with equality search using trigram index and non-deterministic collation
|
| Список | pgsql-hackers |
>> I think we should change that. It's very counter intuitive that a query >> can change behavior when the planner flips from using e.g. a Seq Scan to >> a Bitmap Index Scan or the other way around. There's already a patch for >> that, see [1]. >> >> [1] >> https://www.postgresql.org/message-id/flat/db087c3e-230e-4119-8a03-8b5d74956bc2%40gmail.com > > > That's not only unintuitive, it is a clear bug. > An index is not allowed to change the semantics of a query. > > Does your patch fix the bug, that is, will the query with "WHERE t = 'right'" > return both results? That's the case that is mostly in need of fixing. > I am not sure if the behavior for the % operator should also be considered > a bug. Not yet :). >>> I don't know what the correct fix would be. Perhaps just refusing to use >>> the index for equality comparisons with non-deterministic collations. >> >> If we merge [1], then not only = but also LIKE would be incorrect. How >> about disabling CREATE INDEX USING gin on columns with non-deterministic >> collations? > > Oh, I see. So your patch won't fix the bug. Indeed. > I am not sure if refusing to *create* the index is the best solution. > Perhaps a warning will be better: > > WARNING: GIN indexes won't be used columns with non-deterministic collations At this point we could more generally say WARNING: pg_trgm does not use inferred column collations but always uses the default database collation. But I'm hoping the patch in [1] gets merged and we can also use the inferred collation to fix the bug you found. >> Or is there maybe a way to make these cases work correctly for >> non-deterministic collations by applying the collation when extracting >> the search trigrams? I take a look into that. Attached patch makes your case work, including the % case. It builds on top of the other patches from [1] that makes pg_trgm use the inferred collation trigram extraction. Instead of using btint4cmp() to compare trigrams, the patch uses a collation-aware string comparison function. This is just a PoC. I haven't given much thought to the details but e.g. when three consecutive characters exceed 3 bytes then compact_trigram() uses a truncated 32-bit hash value as trigram instead. Such trigrams won't work in all cases. We could omit them from the query string but for languages where the majority of trigrams are hashed or where the query string consists of only a few trigrams, the look-up performance would suffer. I guess better would be using a collation-aware hash function that maps different values that compare equal to the same hash value. hashtext() does that already. The new comparison function would then have to distinguish between plain text trigrams and hash trigrams. Alternatively, we could store all trigrams as hashes but that would break functions such as show_trgm(). -- David Geier
Вложения
В списке pgsql-hackers по дате отправления: