Обсуждение: Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Поиск

Список

Период

Сортировка

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

От

Peter Eisentraut

Дата:

05 июня 2025 г., 11:12:18

On 04.06.25 05:22, Jeff Davis wrote:
> This proposal would add that dependency information, and importantly,
> would be careful about which dependency entries are required for
> particular expressions and which are not.

> Introduce three new options when creating or altering a function,
> operator or index AM: COLLATE, CTYPE, or EQUALITY, representing the
> operations that the object is sensitive to.

Yes, this has been on my todo list since the day collations were added, 
but for a different reason:  We should be able to detect a failed 
collation derivation at parse time.  This is according to the SQL 
standard, and also because it's arguably a better user experience.  But 
we don't do that, so we have to check it at run time, which is what all 
these errmsg("could not determine which collation to use for string 
comparison") checks are for.

The reason we don't do it at parse time is that we don't have the 
information which functions care about collations, which is exactly what 
you are proposing here to add.

In my mind, I had this project listed under "procollate", but feel free 
to use a different name.  But I would consider making this one setting 
with multiple values instead of multiple boolean settings.

I don't mean to say that you should implement the parse-time collation 
derivation check as well, but we should design the catalog metadata so 
that that is possible.

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

От

Jeff Davis

Дата:

05 июня 2025 г., 22:56:19

On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote:
> The reason we don't do it at parse time is that we don't have the
> information which functions care about collations, which is exactly
> what
> you are proposing here to add.

Currently, we have:

   create table c(x text collate "C", y text collate "en_US");
   insert into c values ('x', 'y');
   select x < y from c; -- fails (runtime check)
   select x || y from c; -- succeeds

Surely, "<" would be marked as ordering-sensitive, and we could move
the error to parse-time.

But what about UDFs? If we assume that all UDFs are ordering-sensitive
unless marked otherwise, then a user-defined version of "||" that
previously worked would now start failing, until they add the ordering-
insensitive mark.

We'd need some kind of migration path where we could retain the runtime
checks and disable the parse time checks until people have a chance to
add the right marks to their UDFs. Migration paths like that are not
great because they take several releases to work out, and we're never
quite sure when to finally remove the deprecated behavior.

If we make the opposite assumption, that none are ordering-sensitive
unless we mark them so, that would allow properly-marked functions to
fail at parse time, and the rest to fail at runtime. But this
assumption doesn't work as well for recording dependencies, because
we'd miss the dependencies for UDFs that aren't properly marked.

Thoughts?

Regards,
    Jeff Davis

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

От

Jeff Davis

Дата:

05 июня 2025 г., 23:47:22

On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote:
> But I would consider making this one setting
> with multiple values instead of multiple boolean settings.

While we're at it, CTYPE is not very descriptive for a user-facing
name. And COLLATE has become overloaded (expression clause,
pg_collation object, ordering, or the superset of behaviors that
includes CTYPE). Let's consider more user-friendly naming for the
markers:

  CASE: lower/upper/initcap/fold behavior
  CLASS: char classifications such as [[:punct:]]
  ORDER: comparisons

Internally, at least for the foreseeable future, CASE and CLASS would
be the same. They'd just be different markers to record the user's
intent.

Also, we could use keywords in the DDL syntax, or we could use a new
options syntax, or a comma-separated list as a string literal to
specify the markers. I don't have a strong opinion on which route to
take, but I chose the above names from existing keywords so we wouldn't
have to add any.

Regards,
    Jeff Davis

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

От

Peter Eisentraut

Дата:

11 июня 2025 г., 10:03:46

On 05.06.25 21:56, Jeff Davis wrote:
> On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote:
>> The reason we don't do it at parse time is that we don't have the
>> information which functions care about collations, which is exactly
>> what
>> you are proposing here to add.
> 
> Currently, we have:
> 
>     create table c(x text collate "C", y text collate "en_US");
>     insert into c values ('x', 'y');
>     select x < y from c; -- fails (runtime check)
>     select x || y from c; -- succeeds
> 
> Surely, "<" would be marked as ordering-sensitive, and we could move
> the error to parse-time.
> 
> But what about UDFs? If we assume that all UDFs are ordering-sensitive
> unless marked otherwise, then a user-defined version of "||" that
> previously worked would now start failing, until they add the ordering-
> insensitive mark.

I think no matter how we slice it, there is going to be some case that 
will be degraded until some update is applied.  I would be content to 
accept this particular variant, because it doesn't seem very realistic. 
Why would a user define their own concatenation function?  There already 
is one.  Unless your concatenation function does something special, in 
which case you should probably think about this collations topic.  More 
generally, there are I think only so many operations you can do on 
characters strings that you can do without considering the 
collation/ctype/etc.  These are essentially all the operations that you 
can do without looking at the characters, like length(), ||, repeat(). 
Everything beyond that looks at the characters and needs to take 
collation/ctype/etc. into account.

> We'd need some kind of migration path where we could retain the runtime
> checks and disable the parse time checks until people have a chance to
> add the right marks to their UDFs. Migration paths like that are not
> great because they take several releases to work out, and we're never
> quite sure when to finally remove the deprecated behavior.

Perhaps pg_dump can apply some properties during upgrades?

> If we make the opposite assumption, that none are ordering-sensitive
> unless we mark them so, that would allow properly-marked functions to
> fail at parse time, and the rest to fail at runtime. But this
> assumption doesn't work as well for recording dependencies, because
> we'd miss the dependencies for UDFs that aren't properly marked.

That feels like the worst of both worlds.

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

От

Peter Eisentraut

Дата:

11 июня 2025 г., 10:06:33

On 05.06.25 22:47, Jeff Davis wrote:
> While we're at it, CTYPE is not very descriptive for a user-facing
> name. And COLLATE has become overloaded (expression clause,
> pg_collation object, ordering, or the superset of behaviors that
> includes CTYPE). Let's consider more user-friendly naming for the
> markers:
> 
>    CASE: lower/upper/initcap/fold behavior
>    CLASS: char classifications such as [[:punct:]]
>    ORDER: comparisons
> 
> Internally, at least for the foreseeable future, CASE and CLASS would
> be the same. They'd just be different markers to record the user's
> intent.

Under what scenario would they become different, and how would that 
matter in practice?

I would be worried that this could confuse users and they would apply 
these incorrectly, if the differences are too fine.

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

От

Jeff Davis

Дата:

11 июня 2025 г., 19:07:59

On Wed, 2025-06-11 at 09:06 +0200, Peter Eisentraut wrote:
> >    CASE: lower/upper/initcap/fold behavior
> >    CLASS: char classifications such as [[:punct:]]
> >    ORDER: comparisons
> >
> > Internally, at least for the foreseeable future, CASE and CLASS
> > would
> > be the same. They'd just be different markers to record the user's
> > intent.
>
> Under what scenario would they become different, and how would that
> matter in practice?

I can't think of any reason those behaviors should diverge. If nothing
else, the "uppercase" property should be consistent with the results of
case mapping.

However, I have struggled to come up with a single word that includes
both casing behavior and character classification, but excludes
ordering behavior. Such a word would be useful for documentation, too.
I guess "CTYPE" works, but it's too technical and feels libc-specific.

Regards,
    Jeff Davis

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

От

Jeff Davis

Дата:

12 июня 2025 г., 01:01:19

On Wed, 2025-06-11 at 09:03 +0200, Peter Eisentraut wrote:
> I think no matter how we slice it, there is going to be some case
> that
> will be degraded until some update is applied.

The problem I see is a conflict between two goals:

1. Record appropriate dependencies if a function is sensitive to
ordering or ctype.

2. Raise parse errors if we cannot infer the collation for a function
call site where the function is sensitive to ordering or ctype.

The safest assumption with respect to the first goal is to assume that
UDFs are sensitive to ordering and ctype. Otherwise, we will miss
recording dependencies for, e.g., a validation function that uses a
regex that depends on character classification.

But the safest assumption with respect to the second goal is to assume
that UDFs are not sensitive to ordering or ctype. Otherwise, we'd throw
an error for queries that work today (see below example).

To resolve this conflict I think we need some notion about whether the
markings are explicitly specified or left as the defaults. If CREATE
FUNCTION doesn't specify any markings, then the dependency tracking
code can make one assumption and the parser can make the opposite
assumption. We need to sort out the actual syntax of CREATE FUNCTION,
and I'm starting to think we need some options syntax (similar to
storage parameters for CREATE TABLE).

> Why would a user define their own concatenation function?

It's more likely that someone combines a few primitive functions:

  CREATE OR REPLACE FUNCTION shorten(t TEXT) RETURNS TEXT
    LANGUAGE plpgsql AS $$
    BEGIN
      IF (length(t) < 4) THEN
        RETURN t;
      END IF;
      RETURN substr(t,1,1) ||
             (length(t) - 2)::text ||
             substr(t,length(t),1);
    END;
    $$;

  CREATE TABLE c(x TEXT COLLATE "C", y TEXT COLLATE "en_US");
  INSERT INTO c VALUES ('kuber','netes');

  SELECT x = y FROM c;
  ERROR:  could not determine which collation to use for string
comparison

  -- currently succeeds even though collation cannot be inferred
  SELECT shorten(x || y) FROM c;
   shorten
  ---------
   k8s
  (1 row)

The example is a bit silly, but I think there are realistic cases along
those lines.

> Everything beyond that looks at the characters and needs to take
> collation/ctype/etc. into account.

I'm not sure. My guess would be that the various kinds of markings you
might want (or no markings at all) are all common enough cases that
they shouldn't be ignored.

>
> Perhaps pg_dump can apply some properties during upgrades?

Interesting idea. We'd still need to account for CREATE FUNCTION
statements that come from other places (e.g. direct from applications,
or migration scripts, or extension scripts).

>
Regards,
    Jeff Davis

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity

Re: [19] Proposal: function markers to indicate collation/ctype sensitivity