Обсуждение: Theory of operation of collation patch
Is there any documentation of $SUBJECT? Because the more I look at this patch the more I think it's misdesigned; either that or I fundamentally misunderstand what it's trying to do. I complained yesterday about why the planner wasn't using indcollation to identify the sort order of an index. I now think that the reason it doesn't obviously fail to fail is that indcollation is dead code, and so is approximately 99% of what you added to the planner, because two expressions that are equal() must necessarily have the same collation property. Tracking the collation as a separate property of a pathkey is thus a useless activity. If this conclusion isn't correct, please explain why not. regards, tom lane
On mån, 2011-03-07 at 11:43 -0500, Tom Lane wrote: > Is there any documentation of $SUBJECT? Because the more I look at > this patch the more I think it's misdesigned; either that or I > fundamentally misunderstand what it's trying to do. I complained > yesterday about why the planner wasn't using indcollation to identify > the sort order of an index. I now think that the reason it doesn't > obviously fail to fail is that indcollation is dead code, and so is > approximately 99% of what you added to the planner, because two > expressions that are equal() must necessarily have the same collation > property. Tracking the collation as a separate property of a pathkey > is thus a useless activity. If this conclusion isn't correct, please > explain why not. I'll have to check into these details, but here is a test case that shows that it's doing something with the collation of an index: DROP TABLE test; CREATE TABLE test (a text); CREATE INDEX test_1 ON test (a); CREATE INDEX test_d ON test (a COLLATE "de_DE"); CREATE INDEX test_e ON test (a COLLATE "es_ES"); CREATE INDEX test_f ON test (a COLLATE "fr_FR"); SET enable_seqscan TO off; SET enable_bitmapscan TO off; EXPLAIN SELECT * FROM test WHERE a > 'foo'; QUERY PLAN -----------------------------------------------------------------------Index Scan using test_1 on test (cost=0.00..51.90rows=437 width=32) Index Cond: (a > 'foo'::text) (2 rows) EXPLAIN SELECT * FROM test WHERE a > 'foo' COLLATE "de_DE"; QUERY PLAN -----------------------------------------------------------------------Index Scan using test_d on test (cost=0.00..51.90rows=437 width=32) Index Cond: (a > ('foo'::text COLLATE "de_DE")) (2 rows) EXPLAIN SELECT * FROM test WHERE a > 'foo' COLLATE "es_ES"; QUERY PLAN -----------------------------------------------------------------------Index Scan using test_e on test (cost=0.00..51.90rows=437 width=32) Index Cond: (a > ('foo'::text COLLATE "es_ES")) (2 rows) EXPLAIN SELECT * FROM test WHERE a > 'foo' COLLATE "fr_FR"; QUERY PLAN -----------------------------------------------------------------------Index Scan using test_f on test (cost=0.00..51.90rows=437 width=32) Index Cond: (a > ('foo'::text COLLATE "fr_FR")) (2 rows)
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2011-03-07 at 11:43 -0500, Tom Lane wrote: >> ... I now think that the reason it doesn't >> obviously fail to fail is that indcollation is dead code, and so is >> approximately 99% of what you added to the planner, because two >> expressions that are equal() must necessarily have the same collation >> property. Tracking the collation as a separate property of a pathkey >> is thus a useless activity. If this conclusion isn't correct, please >> explain why not. > I'll have to check into these details, but here is a test case that > shows that it's doing something with the collation of an index: [ pokes at it some more... ] It looks like indcollation is acting as a substitute for including a CollateClause in the index key expression, which doesn't seem like a particularly good tradeoff considering all the overhead you must introduce into the default case. But more to the point, your examples do *not* work for me. I can reproduce both failing to use an index that should work, and selecting an index that doesn't work: d1u=# CREATE TABLE test (a text); CREATE TABLE d1u=# CREATE INDEX test_1 ON test (a); CREATE INDEX d1u=# CREATE INDEX test_d ON test (a COLLATE "de_DE"); CREATE INDEX d1u=# CREATE INDEX test_e ON test (a COLLATE "es_ES"); CREATE INDEX d1u=# CREATE INDEX test_f ON test (a COLLATE "fr_FR"); CREATE INDEX d1u=# CREATE INDEX test_fz ON test ((a||'z') COLLATE "fr_FR"); CREATE INDEX d1u=# explain select * from test order by a; QUERY PLAN ------------------------------------------------------------------------Index Scan using test_f on test (cost=0.00..63.90rows=1310 width=32) (1 row) d1u=# explain select * from test order by a collate "fr_FR"; QUERY PLAN ---------------------------------------------------------------Sort (cost=90.93..94.20 rows=1310 width=32) Sort Key: ((aCOLLATE "fr_FR")) -> Seq Scan on test (cost=0.00..23.10 rows=1310 width=32) (3 rows) d1u=# set enable_seqscan TO 0; SET d1u=# explain select * from test order by a collate "fr_FR"; QUERY PLAN ----------------------------------------------------------------------------------Sort (cost=10000000090.93..10000000094.20rows=1310 width=32) Sort Key: ((a COLLATE "fr_FR")) -> Seq Scan on test (cost=10000000000.00..10000000023.10rows=1310 width=32) (3 rows) d1u=# explain select * from test order by a||'z'; QUERY PLAN -------------------------------------------------------------------------Index Scan using test_fz on test (cost=0.00..67.18rows=1310 width=32) (1 row) (This is in a database with encoding utf8 and lc_collate = c) regards, tom lane
On Mon, Mar 07, 2011 at 11:43:20AM -0500, Tom Lane wrote: > Is there any documentation of $SUBJECT? Because the more I look at > this patch the more I think it's misdesigned; either that or I > fundamentally misunderstand what it's trying to do. I complained > yesterday about why the planner wasn't using indcollation to identify > the sort order of an index. I now think that the reason it doesn't > obviously fail to fail is that indcollation is dead code, and so is > approximately 99% of what you added to the planner, because two > expressions that are equal() must necessarily have the same collation > property. Tracking the collation as a separate property of a pathkey > is thus a useless activity. If this conclusion isn't correct, please > explain why not. The collation is a property of the operators/functions and not of the values. An individual value does not have a collation, a column does. A pathkey represents a sort order, right? To define a sort order you need a collation and so the path key is the natural place to put it. Two path keys that differ only by the collation are *not* compatable, but obviously can be converted by a sort node. After the plan phase, the collation will be ignored in 99% of the executer, except for operators like =, < and > that need to look at it. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Martijn van Oosterhout <kleptog@svana.org> writes: > On Mon, Mar 07, 2011 at 11:43:20AM -0500, Tom Lane wrote: >> Is there any documentation of $SUBJECT? > The collation is a property of the operators/functions and not of the > values. An individual value does not have a collation, a column does. OK. > A pathkey represents a sort order, right? To define a sort order you > need a collation and so the path key is the natural place to put it. Only if the expression-to-be-sorted does not already fully specify the collation, which so far as I can tell (either from the code or your description above) it does. I think that the explicit representation of collation as part of the PathKey node is unnecessary, inefficient, and bug-inducing --- the latter because it promotes fuzzy thinking about where the collation information is coming from. (And this isn't just hypothetical: IMO the bugs I exhibited upthread are *directly* due to fuzzy thinking about what defines an index's sort order.) Or, to put it another way: the properties that define a sort order are the sort comparison operator, the collation, the ASC/DESC bit, and the NULLS FIRST/LAST bit. Given the way that the SQL committee has constructed the language, the operator and the two flag bits are attached to the ORDER BY clause, but the collation is a property of the expression-to-be-sorted. If we fail to preserve that distinction in the internal representation, we're just creating problems for ourselves. I'm willing to take a pass at fixing this code during the alpha cycle, but I want to be sure I understand it correctly first. So, if there's a hole in my thinking, please point it out. regards, tom lane
On Mon, Mar 07, 2011 at 07:15:28PM -0500, Tom Lane wrote: > Only if the expression-to-be-sorted does not already fully specify the > collation, which so far as I can tell (either from the code or your > description above) it does. I think that the explicit representation > of collation as part of the PathKey node is unnecessary, inefficient, > and bug-inducing --- the latter because it promotes fuzzy thinking about > where the collation information is coming from. (And this isn't just > hypothetical: IMO the bugs I exhibited upthread are *directly* due to > fuzzy thinking about what defines an index's sort order.) Well, collation processing happens in two phases. Initially collate information is provided by the columns in the query, explicit clauses, etc. These are indeed attached to the values. From here the collations of expressions are determined. The SQL committee thought up a bunch of actually quite logical rules here. Explicit overrides implicit, implicit overrides default. Both explicit or both implicit is an error, etc. Note error state is only a problem if you use it for sorting or comparison, otherwise it is ignored. This phase of processing happens in the parse analysis, the end result being that every expression should have a collation set, and every operator where it matters has consistant collation information for its arguments. So at this point the collation should be attached to the expression. In the planning phase however, all collation information is ignored except where it matters, which is the comparison operators and ORDER BY and similar. By the previous phase all comparison operators should have a defined collation for its arguments and thus for itself, this allows it to construct appropriate pathkeys for index scans, etc. I havn't looked at the patch, perhaps it confuses the information in the two phases but the basic idea seems to me: - parse analysis - collation in expressions - planning - collation in path key > Or, to put it another way: the properties that define a sort order are > the sort comparison operator, the collation, the ASC/DESC bit, and the > NULLS FIRST/LAST bit. Given the way that the SQL committee has > constructed the language, the operator and the two flag bits are > attached to the ORDER BY clause, but the collation is a property of the > expression-to-be-sorted. If we fail to preserve that distinction in the > internal representation, we're just creating problems for ourselves. The assigning of the collation to expressions is simply the method for getting the collation information to the operators. Mainly because 99% of the time there will be no explicit clauses in the queries, so the information has to get there by other means. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On 07.03.2011 17:43, Tom Lane wrote: > because two expressions that are equal() must necessarily have the same collation > property. Peter, Tom, I am not able to see this. If 'abc' == 'abc' is not collation depending at all. It is only encoding depending. Collation is only needed for upper(), lower() and sorting. Means it tells if e.g. upper('i') will get Y or I. It tells if the German s-umlaut will be sorted together with 's' or after 'z'. Btw. the follows on implementing collations will be different - and I hope Peter is aware of it. My experience is that a huge follow will be that users will complain that the sorting isn't correct even when it is correct. Susanne -- Susanne Ebrecht - 2ndQuadrant PostgreSQL Development, 24x7 Support, Training and Services www.2ndQuadrant.com
Susanne Ebrecht <susanne@2ndQuadrant.com> writes: > On 07.03.2011 17:43, Tom Lane wrote: >> because two expressions that are equal() must necessarily have the same collation >> property. > Peter, Tom, > I am not able to see this. > If 'abc' == 'abc' is not collation depending at all. It is only > encoding depending. Sorry, I was using a term of art there. This isn't about data values, it's about parsed expression trees. See the equal() function in src/backend/nodes/equalfuncs.c. If two expression trees match according to equal(), then they must necessarily have the same collation property, because equal() compares all the fields and in particular the collation fields. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > This phase of processing happens in the parse analysis, the end result > being that every expression should have a collation set, and every > operator where it matters has consistant collation information for its > arguments. So at this point the collation should be attached to the > expression. Right. > I havn't looked at the patch, perhaps it confuses the information in > the two phases but the basic idea seems to me: > - parse analysis - collation in expressions > - planning - collation in path key Not so right. A path key contains an expression tree, plus whatever *additional* information is needed to fully specify the sort ordering. If the collation is already fully determined by the expression tree, there is no need to duplicate that information in the PathKey node. And, as I said, doing so anyway has real negative consequences. regards, tom lane
On Tue, Mar 8, 2011 at 3:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Not so right. A path key contains an expression tree, plus whatever > *additional* information is needed to fully specify the sort ordering. > If the collation is already fully determined by the expression tree, > there is no need to duplicate that information in the PathKey node. > And, as I said, doing so anyway has real negative consequences. Isn't the reason to copy that information outside the expression so that we can choose sometimes to ignore it? Namely, for == we can use an index with any defined collation even if it doesn't match the collation in the pathkey we're looking for? I think currently that's the only example but in theory we could have collations that are "supersets" of the desired collation. For example a UTF8 collation that sorts english in the desired way and sorts utf8 characters in some way that isn't relevant to the query. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Tue, Mar 8, 2011 at 3:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not so right. �A path key contains an expression tree, plus whatever >> *additional* information is needed to fully specify the sort ordering. >> If the collation is already fully determined by the expression tree, >> there is no need to duplicate that information in the PathKey node. >> And, as I said, doing so anyway has real negative consequences. > Isn't the reason to copy that information outside the expression so > that we can choose sometimes to ignore it? Namely, for == we can use > an index with any defined collation even if it doesn't match the > collation in the pathkey we're looking for? No, that's nonsense. A PathKey is all about sort order. It is not relevant in any situation where sort order is ignorable. > I think currently that's the only example but in theory we could have > collations that are "supersets" of the desired collation. For example > a UTF8 collation that sorts english in the desired way and sorts utf8 > characters in some way that isn't relevant to the query. That sounds like nonsense as well. There is no such thing as a superset ordering: to have such a thing, you'd have to have some orderings that didn't fully determine the ordering of data values, which is pretty much unworkable. regards, tom lane
On mån, 2011-03-07 at 12:52 -0500, Tom Lane wrote: > It looks like indcollation is acting as a > substitute for including a CollateClause in the index key expression, > which doesn't seem like a particularly good tradeoff considering all > the overhead you must introduce into the default case. Conceptually, this characterization is correct. But assuming you do away with the indcollation column and instead create expression indexes on (col COLLATE "x"), you'd then have an interesting time matching those expressions to expressions in the query when choosing an index. Example: CREATE TABLE test (a text); CREATE INDEX test_idx ((a COLLATE "foo")); -- expression index -- easy to choose the index here SELECT * FROM test WHERE a COLLATE "foo" > 'x'; -- not so easy here SELECT * FROM test WHERE a > 'x' COLLATE "foo"; -- more difficult CREATE TABLE test2 (b text COLLATE "foo"); SELECT * FROM test WHERE x > (SELECT b FROM test2 WHERE something); Note that this situation already exists: If you create a plain column index using CREATE INDEX test_idx (a COLLATE "foo"), all of the above cases should pick up the index; if you create an expression index using CREATE INDEX test_idx ((a COLLATE "foo")), they won't, except the first one. (That is modulo your earlier demonstration that it apparently doesn't work at all for you, but I'm just saying in case you want to try out various ways of changing this.)
On mån, 2011-03-07 at 19:15 -0500, Tom Lane wrote: > Or, to put it another way: the properties that define a sort order are > the sort comparison operator, the collation, the ASC/DESC bit, and the > NULLS FIRST/LAST bit. Given the way that the SQL committee has > constructed the language, the operator and the two flag bits are > attached to the ORDER BY clause, but the collation is a property of > the expression-to-be-sorted. If we fail to preserve that distinction > in the internal representation, we're just creating problems for > ourselves. You are probably right; it could be simplified.
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2011-03-07 at 12:52 -0500, Tom Lane wrote: >> It looks like indcollation is acting as a >> substitute for including a CollateClause in the index key expression, >> which doesn't seem like a particularly good tradeoff considering all >> the overhead you must introduce into the default case. > Conceptually, this characterization is correct. But assuming you do > away with the indcollation column and instead create expression indexes > on (col COLLATE "x"), you'd then have an interesting time matching those > expressions to expressions in the query when choosing an index. Hmm ... interesting point, but I think it already doesn't work in all cases for expression indexes. There is an analogous issue for binary-compatible datatypes, which we hack around by stripping RelabelType nodes before doing this type of comparison. Maybe we'll have to do something similar with CollateClauses. I never much cared for that hack, though ... wonder if there is a cleaner way. regards, tom lane
Another interesting item ... I see that you added a collation field to TypeName, apparently on the grounds that the SQL spec includes collation in <data type>. However, it seems to me that that is nonsense up with which we should not put. <data type> is basically only used in CAST and column definitions (plus some features related to methods, which we haven't got). In CAST, they have to have a syntax rule stating that you must not use COLLATE in the target type name. In column definitions, they call out <collate clause> as a separate syntactic element, and then they have to have a syntax rule stating that you can't use both that and a <collate clause> embedded in the type name. This is just stupid. And it conflates the notion of collation with that of type --- based on our discussions so far, it seems like it'd be a lot better to keep those separate. I think we should drop <collate clause> from TypeName and just have it in columnDef and the expression syntax. This might also ease the ambiguity problem that evidently led you to restrict the expression production's argument to c_expr. It would also allow us to meet the letter of the spec for <column definition>, in that <collate clause> is not required to immediately follow <data type>. I see that right now you don't allow <collate clause> in ColQualList, so it's not accepting syntax that the spec says is valid; and you really can't fix that as long as <collate clause> can be part of TypeName, because then there would be two ways to parse the case where COLLATE does immediately follow the type name. Thoughts? regards, tom lane
On Tue, Mar 08, 2011 at 08:52:41PM -0500, Tom Lane wrote: > Another interesting item ... I see that you added a collation field to > TypeName, apparently on the grounds that the SQL spec includes collation > in <data type>. However, it seems to me that that is nonsense up with > which we should not put. <data type> is basically only used in CAST and <snip> Sounds like a good plan. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On Wed, Mar 9, 2011 at 1:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Another interesting item ... I see that you added a collation field to > TypeName, apparently on the grounds that the SQL spec includes collation > in <data type>. However, it seems to me that that is nonsense up with > which we should not put. The SQL committee has demonstrably awful taste but they're usually not entirely nutty. Usually whatever they do they had a reason. Is there any hint at all of what they were trying to accomplish here? When you say "basically only used in CAST and column definitions" are there other less common cases where it's convenient? Or was this just a case of some existing database allowed COLLATE clauses in column definitions as part of the type and they preferred to have it in a different syntax so they did this to allow either to work? If we allow <collate clause> in ColQualList are we covering the very case they were trying to deal with? Can you give an example of what a column definition would look like if you put the COLLATE clause in the <data type> in a way that wouldn't be parsed according to your plan? -- greg
Greg Stark <gsstark@mit.edu> writes: > Can you give an example of what a column definition would look like if > you put the COLLATE clause in the <data type> in a way that wouldn't > be parsed according to your plan? Column definitions look and act the same. The point of the change is to not accept COLLATE in all the places where type names are used that are *not* column definition lists. The original patch accepted nonsense like create cast (text collate "C" as text collate "de_DE") with ... and we'd have had to do quite a lot of additional hacking to plug those holes. regards, tom lane
On tis, 2011-03-08 at 20:52 -0500, Tom Lane wrote: > I think we should drop <collate clause> from TypeName and just have it > in columnDef and the expression syntax. Yes, that sounds better in retrospect. It's easier to see that now that we see all the cases where it's used and not used. > This might also ease the > ambiguity problem that evidently led you to restrict the expression > production's argument to c_expr. Maybe, but I seem to recall that I did actually check and concluded that c_expr covers all cases where <collate clause> is allowed. We could of course allow more cases, but maybe it's not necessary. > It would also allow us to meet the > letter of the spec for <column definition>, in that <collate clause> > is not required to immediately follow <data type>. Note that that case is listed under a separate feature. I'm not sure if it's worth supporting, but if they bothered putting it in it's probably for compatibility with some existing implementation.