Re: PGDay.it collation discussion notes

Поиск
Список
Период
Сортировка
От Dave Gudeman
Тема Re: PGDay.it collation discussion notes
Дата
Msg-id 7b079fba0810221043o4d205782p883d8a8df84f54f9@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PGDay.it collation discussion notes  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: PGDay.it collation discussion notes  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Список pgsql-hackers


On Mon, Oct 20, 2008 at 2:28 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Tom Lane wrote:
Another objection to this design is that it's completely unclear that
functions from text to text should necessarily yield the same collation
that went into them, but if you treat collation as a hard-wired part of
the expression syntax tree you aren't going to be able to do anything else.
(What will you do about functions/operators taking more than one text
argument?)

Whatever the spec says. Collation is intimately associated with the comparison operations, and doesn't make any sense anywhere else.

Of course the comparison operator is involved in many areas such as index creation, ORDER BY, GROUP BY, etc. In order to support GROUP BY and hash joins on values with a collation type, you need to have a hash function corresponding to the collation.
 
The way the default collation for a given operation is determined, by bubbling up the collation from the operands, through function calls and other expressions, is just to make life a bit easier for the developer who's writing the SQL.We could demand that you always explicitly specify a collation when you use the text equality or inequality operators, but because that would be quite tiresome, a reasonable default is derived from the context.

In this sense, collation is no different from any other feature of the value's type. You could require explicit type annotations everywhere.
 
Looking at an individual value, collation just doesn't make sense.
Collation is property of the comparison operation, not of a value.

Collation can't be a property of the comparison operation because you don't know what comparison to use until you know the collation type of the value. Collation is a property of string values, just like scale and precision are properties of numeric values. And like those properties of numeric values, collation can be statically determined. The rules for determining what collation to use in an expression are similar in kind to the rules for determining what the resulting scale and precision of an arithmetic expression are. If you consider collation as just part of the type, a lot of things are easier.

In the parser, we might have to do something like that though, because according to the standard you can tack the COLLATION keyword to string constants and have it bubble up. But let's keep that ugliness just inside the parser.

The COLLATION expression is no different in kind from a type cast. It just works on a restricted part of the type.
 
One, impractical, way to implement collation would be to have one operator class per collation. In fact you could do that today, with no backend changes, to support multiple collations. It's totally impractical, because for starters you'd need different comparison operators, with different names, for each collation. But it's the right mental model.

You can use that model, but it is simpler to view it as an overloaded function. You don't conceptually imagine that DECIMAL(10,4)  and DECIMAL(20,2) have different comparison operations, so why would you view that two strings with different collations have different comparison operations?

I think the right approach is to invent a new concept called "operator modifier". It's basically a 3rd argument to operators. It can be specified explicitly when an operator is used, with syntax like "<left> Op <right> USING <modifier>", or in case of collation, it's derived from the context, per SQL spec. The operator modifier is tacked on to OpExprs and SortClauses in the parser, and passed as a 3rd argument to the function implementing the operator at execution time.

This is a good way to implement collated comparisons, but it's not a new concept, just an additional argument to the comparison operator. It isn't necessary to create new concepts to handle collation when it fits so well into an existing concept, the type. For example, the difference between two indexes with collation is a difference in the type of the index --just like the difference between a DECIMAL(10,4) index and a DECIMAL(20,2) index.
 
When I added collation to a commercial RDBMS it made things a lot easier to just fold the collation into the type system. After all, the type defines the operators that act on it and collation is just a specialization of this notion. Incidentally, collation can be easily extended to non-string types; it is just the section of the type information that controls how the values are compared (and hashed). This could be very useful for datetime values and user-defined types as well as strings.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Deriving Recovery Snapshots
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.