Обсуждение: Changes to pg_dump/psql following collation "C" in the catalog
Hi, One consequence of using the "C" collation in the catalog versus the db collation is that pg_dump -t with a regexp may not find the same tables as before. It happens when these conditions are all met: - the collation of the database is not "C" - the regexp has locale-dependant parts - the names to match include characters that are sensitive to locale-dependant matching For instance a table named "rapport_journée_12" in an fr_FR.UTF-8 db used to be found by pg_dump -t 'rapport_\w+_\d+', and is now missed in the devel version. It seems that to fix that, we could qualify the references to columns such as "relname" and "schema_name" with COLLATE "default" clauses in the queries that use pattern-matching in client-side tools, AFAICS pg_dump and psql. Before going any further with this idea, is there agreement that it's an issue to address and does this look like the best way to do that? Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > One consequence of using the "C" collation in the catalog versus > the db collation is that pg_dump -t with a regexp may not find > the same tables as before. It happens when these conditions are > all met: > - the collation of the database is not "C" > - the regexp has locale-dependant parts > - the names to match include characters that are sensitive to > locale-dependant matching Hm, interesting. > It seems that to fix that, we could qualify the references to columns such > as "relname" and "schema_name" with COLLATE "default" clauses in the > queries that use pattern-matching in client-side tools, AFAICS > pg_dump and psql. Seems reasonable. I was initially worried that this might interfere with query optimization, but some experimentation says that the planner successfully derives prefix index clauses anyway (which is correct, because matching a fixed regex prefix doesn't depend on locale). It might be better to attach the COLLATE clause to the pattern constant instead of the column name; that'd be less likely to break if sent to an older server. > Before going any further with this idea, is there agreement that it's an > issue to address and does this look like the best way to do that? That is a question worth asking. We're going to be forcing people to get used to this when working directly in SQL, so I don't know if masking it in a subset of tools is really a win or not. regards, tom lane
Tom Lane wrote: > "Daniel Verite" <daniel@manitou-mail.org> writes: > > One consequence of using the "C" collation in the catalog versus > > the db collation is that pg_dump -t with a regexp may not find > > the same tables as before. It happens when these conditions are > > all met: > > - the collation of the database is not "C" > > - the regexp has locale-dependant parts > > - the names to match include characters that are sensitive to > > locale-dependant matching > > Hm, interesting. > > > It seems that to fix that, we could qualify the references to columns such > > as "relname" and "schema_name" with COLLATE "default" clauses in the > > queries that use pattern-matching in client-side tools, AFAICS > > pg_dump and psql. > > Seems reasonable. I was initially worried that this might interfere with > query optimization, but some experimentation says that the planner > successfully derives prefix index clauses anyway (which is correct, > because matching a fixed regex prefix doesn't depend on locale). > > It might be better to attach the COLLATE clause to the pattern constant > instead of the column name; that'd be less likely to break if sent to > an older server. > > > Before going any further with this idea, is there agreement that it's an > > issue to address and does this look like the best way to do that? > > That is a question worth asking. We're going to be forcing people to get > used to this when working directly in SQL, so I don't know if masking it > in a subset of tools is really a win or not. I think psql and pg_dump need to adjust, just like the 3rd party tools will, at least those that support collation-aware search in the catalog. PFA a patch that implements the slight changes needed. I'll add an entry for it in the next CF. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Вложения
>> "Daniel Verite" <daniel@manitou-mail.org> writes: >>> One consequence of using the "C" collation in the catalog versus >>> the db collation As an intrigued Person Following At Home, I was happy when I found this little three-message thread had more context in [1] and [2]. :) -Chap [1] https://postgr.es/m/15938.1544377821@sss.pgh.pa.us [2] https://postgr.es/m/5978.1544030694@sss.pgh.pa.us
"Daniel Verite" <daniel@manitou-mail.org> writes: > I think psql and pg_dump need to adjust, just like the 3rd party tools > will, at least those that support collation-aware search in the catalog. > PFA a patch that implements the slight changes needed. > I'll add an entry for it in the next CF. Hm, if that's as much as we have to touch, I think there's a good argument for squeezing it into v12 rather than waiting. The point here is mostly to avoid a behavior change from pre-v12, but if we allow v12 to have a different behavior, it's questionable whether we'd want v13 to change it back. Just looking at the patch, I wonder whether it doesn't need some server-version checks. At the very least this would break with pre-9.1 servers, which lack COLLATE altogether. regards, tom lane
Chapman Flack wrote: > >> "Daniel Verite" <daniel@manitou-mail.org> writes: > >>> One consequence of using the "C" collation in the catalog versus > >>> the db collation > > As an intrigued Person Following At Home, I was happy when I found > this little three-message thread had more context in [1] and [2]. :) > > -Chap > > [1] https://postgr.es/m/15938.1544377821@sss.pgh.pa.us > [2] https://postgr.es/m/5978.1544030694@sss.pgh.pa.us Yes. The concrete issue that the patch addresses can be illustrated with this example: psql (12devel) Type "help" for help. postgres=# show lc_collate ; lc_collate ------------- fr_FR.UTF-8 (1 row) postgres=# create table année_2018(); CREATE TABLE postgres=# \dt '\\w+_\\d+' psql: error: Did not find any relation named "\w+_\d+". In previous versions it would have found the table with the accent in the name. With 12devel it doesn't, because the match is done with the collation of the column, now "C", which does not consider the accented character to be a letter. This also affects pg_dump with the -t and -n switches that accept patterns and I think pretty much all \d* commands that accept patterns too. The purpose of the fix is for the patterns to give the same results as before. It's done by simply adding explicit collate clauses to use the collation of the database for these comparisons. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Tom Lane wrote: > Hm, if that's as much as we have to touch, I think there's a good > argument for squeezing it into v12 rather than waiting. The point > here is mostly to avoid a behavior change from pre-v12 Yes. I was mentioning the next CF because ISTM that nowadays non-committers are expected to file patches in there, committers picking up patches both in the current and next CF based on their evaluation of priorities. But if you plan to process this one shortly, a CF entry is probably superfluous. > Just looking at the patch, I wonder whether it doesn't need some > server-version checks. At the very least this would break with > pre-9.1 servers, which lack COLLATE altogether. PFA a new version adding the clause for only 12 and up, since the previous versions are not concerned, and as you mention, really old versions would fail otherwise. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Вложения
"Daniel Verite" <daniel@manitou-mail.org> writes: >> Hm, if that's as much as we have to touch, I think there's a good >> argument for squeezing it into v12 rather than waiting. The point >> here is mostly to avoid a behavior change from pre-v12 > Yes. I was mentioning the next CF because ISTM that nowadays > non-committers are expected to file patches in there, committers > picking up patches both in the current and next CF based on > their evaluation of priorities. Yeah, it's good practice to make a CF entry to ensure the patch doesn't slip through the cracks. There's an awful lot of traffic on pgsql-hackers these days... >> Just looking at the patch, I wonder whether it doesn't need some >> server-version checks. At the very least this would break with >> pre-9.1 servers, which lack COLLATE altogether. > PFA a new version adding the clause for only 12 and up, since the > previous versions are not concerned, and as you mention, really old > versions would fail otherwise. Pushed with some fiddling with the comments, and changing the collation names to be schema-qualified for paranoia's sake. regards, tom lane
Tom Lane wrote: > > PFA a new version adding the clause for only 12 and up, since the > > previous versions are not concerned, and as you mention, really old > > versions would fail otherwise. > > Pushed with some fiddling with the comments, and changing the collation > names to be schema-qualified for paranoia's sake. Thanks ! Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite