Re: Why do indexes and sorts use the database collation?

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: Why do indexes and sorts use the database collation?
Дата
Msg-id 0a0be9def13242b679c5ba4c6539bd62ad672c45.camel@j-davis.com
обсуждение исходный текст
Ответ на Re: Why do indexes and sorts use the database collation?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
On Tue, 2023-11-14 at 13:01 +0100, Tomas Vondra wrote:

> Presumably we'd no generate incorrect
> results, but we'd not be able use an index, causing performance
> issues.

Couldn't use the index for its pathkeys or range scans, but could use
it for equality.

> AFAICS this is a trade-off between known benefits (faster equality
> searches, which are common for PK columns) vs. unknown downsides
> (performance penalty for operations with unknown frequency).

Don't forget the dependency versioning risk: changes to the collation
provider library can corrupt your indexes. That affects even small
tables where performance is not a concern.

> Not sure it's a decision we can make automatically. But it's mostly
> achievable manually, if the user specifies COLLATE "C" for the
> column.

Changing the column collation is the wrong place to do it, in my
opinion. It conflates semantics with performance considerations and
dependency risks.

We already have the catalog support for indexes with a different
collation:

  CREATE TABLE foo (t TEXT COLLATE "en_US");
  INSERT INTO foo SELECT g::TEXT FROM generate_series(1,1000000) g;
  CREATE INDEX foo_idx ON foo (t COLLATE "C");
  ANALYZE foo;

The problem is that:

  EXPLAIN SELECT * FROM foo WHERE t = '345678';

doesn't use the index. And also that there's no way to do it for a PK
index.

> I realize you propose to do this automatically for everyone,

I don't think I proposed that. Perhaps we nudge users in that direction
over time as the utility becomes clear, but I'm not trying to push for
a sudden radical change.

Perhaps many read $SUBJECT as a rhetorical question, but I really do
want to know if I am missing important and common use cases for indexes
in a non-"C" collation.

>  But maybe there's a way
> to make this manual approach more convenient? Say, by allowing the PK
> to
> have a different collation (which I don't think is allowed now).

Yeah, that should be fairly non-controversial.

> FWIW I wonder what the impact of doing this automatically would be in
> practice. I mean, in my experience the number of tables with TEXT (or
> types sensitive to collations) primary keys is fairly low, especially
> for tables of non-trivial size (where the performance impact might be
> measurable).

I think a lot more users would be helped than hurt. But in absolute
numbers, the latter group still represents a lot of regressions, so
let's not do anything radical.

> >
> True. What about trying to allow a separate collation for the PK
> constraint (and the backing index)?

+1.

> >
> OK. I personally don't recall any case where I'd see a collation on
> PK
> indexes as a performance issue. Or maybe I just didn't realize it.

Try a simple experiment building an index with the "C" collation and
then try with a different locale on the same data. Numbers vary, but
I've seen 1.5X to 4X on some simple generated data. Others have
reported much worse numbers on some versions of glibc that are
especially slow with lots of non-latin characters.

> But speaking of natural keys, I recall a couple schemas with natural
> keys in code/dimension tables, and it's not uncommon to cluster those
> slow-moving tables once in a while. I don't know if ORDER BY queries
> were very common on those tables, though.

Yeah, not exactly a common case though.

> >
> OK, I think this answers my earlier question. Now that I think about
> this, the one confusing thing with this syntax is that it seems to
> assign the collation to the constraint, but in reality we want the
> constraint to be enforced with the column's collation and the
> alternative collation is for the index.

Yeah, let's be careful about that. It's still technically correct:
uniqueness in either collation makes sense. But it could be confusing
anyway.

> >
Regards,
    Jeff Davis




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: On non-Windows, hard depend on uselocale(3)
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: On non-Windows, hard depend on uselocale(3)