Обсуждение: Theory of operation of collation patch

Поиск
Список
Период
Сортировка

Theory of operation of collation patch

От
Tom Lane
Дата:
Is there any documentation of $SUBJECT?   Because the more I look at
this patch the more I think it's misdesigned; either that or I
fundamentally misunderstand what it's trying to do.  I complained
yesterday about why the planner wasn't using indcollation to identify
the sort order of an index.  I now think that the reason it doesn't
obviously fail to fail is that indcollation is dead code, and so is
approximately 99% of what you added to the planner, because two
expressions that are equal() must necessarily have the same collation
property.  Tracking the collation as a separate property of a pathkey
is thus a useless activity.  If this conclusion isn't correct, please
explain why not.
        regards, tom lane


Re: Theory of operation of collation patch

От
Peter Eisentraut
Дата:
On mån, 2011-03-07 at 11:43 -0500, Tom Lane wrote:
> Is there any documentation of $SUBJECT?   Because the more I look at
> this patch the more I think it's misdesigned; either that or I
> fundamentally misunderstand what it's trying to do.  I complained
> yesterday about why the planner wasn't using indcollation to identify
> the sort order of an index.  I now think that the reason it doesn't
> obviously fail to fail is that indcollation is dead code, and so is
> approximately 99% of what you added to the planner, because two
> expressions that are equal() must necessarily have the same collation
> property.  Tracking the collation as a separate property of a pathkey
> is thus a useless activity.  If this conclusion isn't correct, please
> explain why not.

I'll have to check into these details, but here is a test case that
shows that it's doing something with the collation of an index:

DROP TABLE test;
CREATE TABLE test (a text);
CREATE INDEX test_1 ON test (a);
CREATE INDEX test_d ON test (a COLLATE "de_DE");
CREATE INDEX test_e ON test (a COLLATE "es_ES");
CREATE INDEX test_f ON test (a COLLATE "fr_FR");
SET enable_seqscan TO off;
SET enable_bitmapscan TO off;
EXPLAIN SELECT * FROM test WHERE a > 'foo';                             QUERY PLAN                               
-----------------------------------------------------------------------Index Scan using test_1 on test
(cost=0.00..51.90rows=437 width=32)  Index Cond: (a > 'foo'::text)
 
(2 rows)

EXPLAIN SELECT * FROM test WHERE a > 'foo' COLLATE "de_DE";                             QUERY PLAN
        
 
-----------------------------------------------------------------------Index Scan using test_d on test
(cost=0.00..51.90rows=437 width=32)  Index Cond: (a > ('foo'::text COLLATE "de_DE"))
 
(2 rows)

EXPLAIN SELECT * FROM test WHERE a > 'foo' COLLATE "es_ES";                             QUERY PLAN
        
 
-----------------------------------------------------------------------Index Scan using test_e on test
(cost=0.00..51.90rows=437 width=32)  Index Cond: (a > ('foo'::text COLLATE "es_ES"))
 
(2 rows)

EXPLAIN SELECT * FROM test WHERE a > 'foo' COLLATE "fr_FR";                             QUERY PLAN
        
 
-----------------------------------------------------------------------Index Scan using test_f on test
(cost=0.00..51.90rows=437 width=32)  Index Cond: (a > ('foo'::text COLLATE "fr_FR"))
 
(2 rows)




Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> On mån, 2011-03-07 at 11:43 -0500, Tom Lane wrote:
>> ... I now think that the reason it doesn't
>> obviously fail to fail is that indcollation is dead code, and so is
>> approximately 99% of what you added to the planner, because two
>> expressions that are equal() must necessarily have the same collation
>> property.  Tracking the collation as a separate property of a pathkey
>> is thus a useless activity.  If this conclusion isn't correct, please
>> explain why not.

> I'll have to check into these details, but here is a test case that
> shows that it's doing something with the collation of an index:

[ pokes at it some more... ]  It looks like indcollation is acting as a
substitute for including a CollateClause in the index key expression,
which doesn't seem like a particularly good tradeoff considering all the
overhead you must introduce into the default case.

But more to the point, your examples do *not* work for me.  I can
reproduce both failing to use an index that should work, and selecting
an index that doesn't work:

d1u=# CREATE TABLE test (a text);
CREATE TABLE
d1u=# CREATE INDEX test_1 ON test (a);
CREATE INDEX
d1u=# CREATE INDEX test_d ON test (a COLLATE "de_DE");
CREATE INDEX
d1u=# CREATE INDEX test_e ON test (a COLLATE "es_ES");
CREATE INDEX
d1u=# CREATE INDEX test_f ON test (a COLLATE "fr_FR");
CREATE INDEX
d1u=# CREATE INDEX test_fz ON test ((a||'z') COLLATE "fr_FR");
CREATE INDEX
d1u=# explain select * from test order by a;                              QUERY PLAN                               
------------------------------------------------------------------------Index Scan using test_f on test
(cost=0.00..63.90rows=1310 width=32)
 
(1 row)

d1u=# explain select * from test order by a collate "fr_FR";                         QUERY PLAN
 
 
---------------------------------------------------------------Sort  (cost=90.93..94.20 rows=1310 width=32)  Sort Key:
((aCOLLATE "fr_FR"))  ->  Seq Scan on test  (cost=0.00..23.10 rows=1310 width=32)
 
(3 rows)

d1u=# set enable_seqscan TO 0;
SET
d1u=# explain select * from test order by a collate "fr_FR";                                   QUERY PLAN
                    
 
----------------------------------------------------------------------------------Sort
(cost=10000000090.93..10000000094.20rows=1310 width=32)  Sort Key: ((a COLLATE "fr_FR"))  ->  Seq Scan on test
(cost=10000000000.00..10000000023.10rows=1310 width=32)
 
(3 rows)

d1u=# explain select * from test order by a||'z';                                         QUERY PLAN
           
 
-------------------------------------------------------------------------Index Scan using test_fz on test
(cost=0.00..67.18rows=1310 width=32)
 
(1 row)

(This is in a database with encoding utf8 and lc_collate = c)
        regards, tom lane


Re: Theory of operation of collation patch

От
Martijn van Oosterhout
Дата:
On Mon, Mar 07, 2011 at 11:43:20AM -0500, Tom Lane wrote:
> Is there any documentation of $SUBJECT?   Because the more I look at
> this patch the more I think it's misdesigned; either that or I
> fundamentally misunderstand what it's trying to do.  I complained
> yesterday about why the planner wasn't using indcollation to identify
> the sort order of an index.  I now think that the reason it doesn't
> obviously fail to fail is that indcollation is dead code, and so is
> approximately 99% of what you added to the planner, because two
> expressions that are equal() must necessarily have the same collation
> property.  Tracking the collation as a separate property of a pathkey
> is thus a useless activity.  If this conclusion isn't correct, please
> explain why not.

The collation is a property of the operators/functions and not of the
values. An individual value does not have a collation, a column does.

A pathkey represents a sort order, right? To define a sort order you
need a collation and so the path key is the natural place to put it.
Two path keys that differ only by the collation are *not* compatable,
but obviously can be converted by a sort node.

After the plan phase, the collation will be ignored in 99% of the
executer, except for operators like =, < and > that need to look at it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Mon, Mar 07, 2011 at 11:43:20AM -0500, Tom Lane wrote:
>> Is there any documentation of $SUBJECT?

> The collation is a property of the operators/functions and not of the
> values. An individual value does not have a collation, a column does.

OK.

> A pathkey represents a sort order, right? To define a sort order you
> need a collation and so the path key is the natural place to put it.

Only if the expression-to-be-sorted does not already fully specify the
collation, which so far as I can tell (either from the code or your
description above) it does.  I think that the explicit representation
of collation as part of the PathKey node is unnecessary, inefficient,
and bug-inducing --- the latter because it promotes fuzzy thinking about
where the collation information is coming from.  (And this isn't just
hypothetical: IMO the bugs I exhibited upthread are *directly* due to
fuzzy thinking about what defines an index's sort order.)

Or, to put it another way: the properties that define a sort order are
the sort comparison operator, the collation, the ASC/DESC bit, and the
NULLS FIRST/LAST bit.  Given the way that the SQL committee has
constructed the language, the operator and the two flag bits are
attached to the ORDER BY clause, but the collation is a property of the
expression-to-be-sorted.  If we fail to preserve that distinction in the
internal representation, we're just creating problems for ourselves.

I'm willing to take a pass at fixing this code during the alpha cycle,
but I want to be sure I understand it correctly first.  So, if there's
a hole in my thinking, please point it out.
        regards, tom lane


Re: Theory of operation of collation patch

От
Martijn van Oosterhout
Дата:
On Mon, Mar 07, 2011 at 07:15:28PM -0500, Tom Lane wrote:
> Only if the expression-to-be-sorted does not already fully specify the
> collation, which so far as I can tell (either from the code or your
> description above) it does.  I think that the explicit representation
> of collation as part of the PathKey node is unnecessary, inefficient,
> and bug-inducing --- the latter because it promotes fuzzy thinking about
> where the collation information is coming from.  (And this isn't just
> hypothetical: IMO the bugs I exhibited upthread are *directly* due to
> fuzzy thinking about what defines an index's sort order.)

Well, collation processing happens in two phases. Initially collate
information is provided by the columns in the query, explicit clauses,
etc. These are indeed attached to the values. From here the collations
of expressions are determined. The SQL committee thought up a bunch of
actually quite logical rules here. Explicit overrides implicit,
implicit overrides default. Both explicit or both implicit is an error,
etc. Note error state is only a problem if you use it for sorting or
comparison, otherwise it is ignored.

This phase of processing happens in the parse analysis, the end result
being that every expression should have a collation set, and every
operator where it matters has consistant collation information for its
arguments. So at this point the collation should be attached to the
expression.

In the planning phase however, all collation information is ignored
except where it matters, which is the comparison operators and ORDER BY
and similar. By the previous phase all comparison operators should have
a defined collation for its arguments and thus for itself, this allows
it to construct appropriate pathkeys for index scans, etc.

I havn't looked at the patch, perhaps it confuses the information in
the two phases but the basic idea seems to me:

- parse analysis - collation in expressions
- planning - collation in path key

> Or, to put it another way: the properties that define a sort order are
> the sort comparison operator, the collation, the ASC/DESC bit, and the
> NULLS FIRST/LAST bit.  Given the way that the SQL committee has
> constructed the language, the operator and the two flag bits are
> attached to the ORDER BY clause, but the collation is a property of the
> expression-to-be-sorted.  If we fail to preserve that distinction in the
> internal representation, we're just creating problems for ourselves.

The assigning of the collation to expressions is simply the method for
getting the collation information to the operators. Mainly because 99%
of the time there will be no explicit clauses in the queries, so the
information has to get there by other means.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Theory of operation of collation patch

От
Susanne Ebrecht
Дата:
On 07.03.2011 17:43, Tom Lane wrote:
> because two expressions that are equal() must necessarily have the same collation
> property.

Peter, Tom,

I am not able to see this.

If 'abc' == 'abc' is not collation depending at all. It is only
encoding depending.

Collation is only needed for upper(), lower() and sorting.
Means it tells if e.g. upper('i') will get Y or I.
It tells if the German s-umlaut will be sorted together with 's' or 
after 'z'.

Btw. the follows on implementing collations will be different -
and I hope Peter is aware of it.

My experience is that a huge follow will be that users will complain 
that the
sorting isn't correct even when it is correct.

Susanne

-- 
Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com



Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Susanne Ebrecht <susanne@2ndQuadrant.com> writes:
> On 07.03.2011 17:43, Tom Lane wrote:
>> because two expressions that are equal() must necessarily have the same collation
>> property.

> Peter, Tom,

> I am not able to see this.

> If 'abc' == 'abc' is not collation depending at all. It is only
> encoding depending.

Sorry, I was using a term of art there.  This isn't about data values,
it's about parsed expression trees.  See the equal() function in
src/backend/nodes/equalfuncs.c.  If two expression trees match according
to equal(), then they must necessarily have the same collation property,
because equal() compares all the fields and in particular the collation
fields.
        regards, tom lane


Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> This phase of processing happens in the parse analysis, the end result
> being that every expression should have a collation set, and every
> operator where it matters has consistant collation information for its
> arguments. So at this point the collation should be attached to the
> expression.

Right.

> I havn't looked at the patch, perhaps it confuses the information in
> the two phases but the basic idea seems to me:

> - parse analysis - collation in expressions
> - planning - collation in path key

Not so right.  A path key contains an expression tree, plus whatever
*additional* information is needed to fully specify the sort ordering.
If the collation is already fully determined by the expression tree,
there is no need to duplicate that information in the PathKey node.
And, as I said, doing so anyway has real negative consequences.
        regards, tom lane


Re: Theory of operation of collation patch

От
Greg Stark
Дата:
On Tue, Mar 8, 2011 at 3:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Not so right.  A path key contains an expression tree, plus whatever
> *additional* information is needed to fully specify the sort ordering.
> If the collation is already fully determined by the expression tree,
> there is no need to duplicate that information in the PathKey node.
> And, as I said, doing so anyway has real negative consequences.

Isn't the reason to copy that information outside the expression so
that we can choose sometimes to ignore it? Namely, for == we can use
an index with any defined collation even if it doesn't match the
collation in the pathkey we're looking for?

I think currently that's the only example but in theory we could have
collations that are "supersets" of the desired collation. For example
a UTF8 collation that sorts english in the desired way and sorts utf8
characters in some way that isn't relevant to the query.


--
greg


Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> On Tue, Mar 8, 2011 at 3:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Not so right. �A path key contains an expression tree, plus whatever
>> *additional* information is needed to fully specify the sort ordering.
>> If the collation is already fully determined by the expression tree,
>> there is no need to duplicate that information in the PathKey node.
>> And, as I said, doing so anyway has real negative consequences.

> Isn't the reason to copy that information outside the expression so
> that we can choose sometimes to ignore it? Namely, for == we can use
> an index with any defined collation even if it doesn't match the
> collation in the pathkey we're looking for?

No, that's nonsense.  A PathKey is all about sort order.  It is not
relevant in any situation where sort order is ignorable.

> I think currently that's the only example but in theory we could have
> collations that are "supersets" of the desired collation. For example
> a UTF8 collation that sorts english in the desired way and sorts utf8
> characters in some way that isn't relevant to the query.

That sounds like nonsense as well.  There is no such thing as a superset
ordering: to have such a thing, you'd have to have some orderings that
didn't fully determine the ordering of data values, which is pretty much
unworkable.
        regards, tom lane


Re: Theory of operation of collation patch

От
Peter Eisentraut
Дата:
On mån, 2011-03-07 at 12:52 -0500, Tom Lane wrote:
> It looks like indcollation is acting as a
> substitute for including a CollateClause in the index key expression,
> which doesn't seem like a particularly good tradeoff considering all
> the overhead you must introduce into the default case.

Conceptually, this characterization is correct.  But assuming you do
away with the indcollation column and instead create expression indexes
on (col COLLATE "x"), you'd then have an interesting time matching those
expressions to expressions in the query when choosing an index.
Example:

CREATE TABLE test (a text);
CREATE INDEX test_idx ((a COLLATE "foo"));  -- expression index

-- easy to choose the index here
SELECT * FROM test WHERE a COLLATE "foo" > 'x';

-- not so easy here
SELECT * FROM test WHERE a > 'x' COLLATE "foo";

-- more difficult
CREATE TABLE test2 (b text COLLATE "foo");
SELECT * FROM test WHERE x > (SELECT b FROM test2 WHERE something);

Note that this situation already exists: If you create a plain column
index using CREATE INDEX test_idx (a COLLATE "foo"), all of the above
cases should pick up the index; if you create an expression index using
CREATE INDEX test_idx ((a COLLATE "foo")), they won't, except the first
one.  (That is modulo your earlier demonstration that it apparently
doesn't work at all for you, but I'm just saying in case you want to try
out various ways of changing this.)




Re: Theory of operation of collation patch

От
Peter Eisentraut
Дата:
On mån, 2011-03-07 at 19:15 -0500, Tom Lane wrote:
> Or, to put it another way: the properties that define a sort order are
> the sort comparison operator, the collation, the ASC/DESC bit, and the
> NULLS FIRST/LAST bit.  Given the way that the SQL committee has
> constructed the language, the operator and the two flag bits are
> attached to the ORDER BY clause, but the collation is a property of
> the expression-to-be-sorted.  If we fail to preserve that distinction
> in the internal representation, we're just creating problems for
> ourselves.

You are probably right; it could be simplified.




Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> On mån, 2011-03-07 at 12:52 -0500, Tom Lane wrote:
>> It looks like indcollation is acting as a
>> substitute for including a CollateClause in the index key expression,
>> which doesn't seem like a particularly good tradeoff considering all
>> the overhead you must introduce into the default case.

> Conceptually, this characterization is correct.  But assuming you do
> away with the indcollation column and instead create expression indexes
> on (col COLLATE "x"), you'd then have an interesting time matching those
> expressions to expressions in the query when choosing an index.

Hmm ... interesting point, but I think it already doesn't work in all
cases for expression indexes.  There is an analogous issue for
binary-compatible datatypes, which we hack around by stripping
RelabelType nodes before doing this type of comparison.  Maybe we'll
have to do something similar with CollateClauses.  I never much cared
for that hack, though ... wonder if there is a cleaner way.
        regards, tom lane


Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Another interesting item ... I see that you added a collation field to
TypeName, apparently on the grounds that the SQL spec includes collation
in <data type>.  However, it seems to me that that is nonsense up with
which we should not put.  <data type> is basically only used in CAST and
column definitions (plus some features related to methods, which we
haven't got).  In CAST, they have to have a syntax rule stating that you
must not use COLLATE in the target type name.  In column definitions,
they call out <collate clause> as a separate syntactic element, and then
they have to have a syntax rule stating that you can't use both that
and a <collate clause> embedded in the type name.  This is just stupid.
And it conflates the notion of collation with that of type --- based on
our discussions so far, it seems like it'd be a lot better to keep those
separate.

I think we should drop <collate clause> from TypeName and just have it
in columnDef and the expression syntax.  This might also ease the
ambiguity problem that evidently led you to restrict the expression
production's argument to c_expr.  It would also allow us to meet the
letter of the spec for <column definition>, in that <collate clause>
is not required to immediately follow <data type>.  I see that right now
you don't allow <collate clause> in ColQualList, so it's not accepting
syntax that the spec says is valid; and you really can't fix that as long
as <collate clause> can be part of TypeName, because then there would be
two ways to parse the case where COLLATE does immediately follow the
type name.

Thoughts?
        regards, tom lane


Re: Theory of operation of collation patch

От
Martijn van Oosterhout
Дата:
On Tue, Mar 08, 2011 at 08:52:41PM -0500, Tom Lane wrote:
> Another interesting item ... I see that you added a collation field to
> TypeName, apparently on the grounds that the SQL spec includes collation
> in <data type>.  However, it seems to me that that is nonsense up with
> which we should not put.  <data type> is basically only used in CAST and

<snip>

Sounds like a good plan.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Theory of operation of collation patch

От
Greg Stark
Дата:
On Wed, Mar 9, 2011 at 1:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Another interesting item ... I see that you added a collation field to
> TypeName, apparently on the grounds that the SQL spec includes collation
> in <data type>.  However, it seems to me that that is nonsense up with
> which we should not put.

The SQL committee has demonstrably awful taste but they're usually not
entirely nutty. Usually whatever they do they had a reason. Is there
any hint at all of what they were trying to accomplish here? When you
say "basically only used in CAST and column definitions" are there
other less common cases where it's convenient?

Or was this just a case of some existing database allowed COLLATE
clauses in column definitions as part of the type and they preferred
to have it in a different syntax so they did this to allow either to
work? If we allow <collate clause> in ColQualList are we covering the
very case they were trying to deal with?

Can you give an example of what a column definition would look like if
you put the COLLATE clause in the <data type> in a way that wouldn't
be parsed according to your plan?

--
greg


Re: Theory of operation of collation patch

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> Can you give an example of what a column definition would look like if
> you put the COLLATE clause in the <data type> in a way that wouldn't
> be parsed according to your plan?

Column definitions look and act the same.  The point of the change is to
not accept COLLATE in all the places where type names are used that are
*not* column definition lists.  The original patch accepted nonsense like

create cast (text collate "C" as text collate "de_DE") with ...

and we'd have had to do quite a lot of additional hacking to plug those
holes.
        regards, tom lane


Re: Theory of operation of collation patch

От
Peter Eisentraut
Дата:
On tis, 2011-03-08 at 20:52 -0500, Tom Lane wrote:
> I think we should drop <collate clause> from TypeName and just have it
> in columnDef and the expression syntax.

Yes, that sounds better in retrospect.  It's easier to see that now that
we see all the cases where it's used and not used.

> This might also ease the
> ambiguity problem that evidently led you to restrict the expression
> production's argument to c_expr.

Maybe, but I seem to recall that I did actually check and concluded that
c_expr covers all cases where <collate clause> is allowed.  We could of
course allow more cases, but maybe it's not necessary.

> It would also allow us to meet the
> letter of the spec for <column definition>, in that <collate clause>
> is not required to immediately follow <data type>.

Note that that case is listed under a separate feature.  I'm not sure if
it's worth supporting, but if they bothered putting it in it's probably
for compatibility with some existing implementation.