Обсуждение: pgsql: Collations with nondeterministic comparison

Поиск
Список
Период
Сортировка

pgsql: Collations with nondeterministic comparison

От
Peter Eisentraut
Дата:
Collations with nondeterministic comparison

This adds a flag "deterministic" to collations.  If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal.  That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

This functionality is only supported with the ICU provider.  At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).

This patch makes changes in three areas:

- CREATE COLLATION DDL changes and system catalog changes to support
  this new flag.

- Many executor nodes and auxiliary code are extended to track
  collations.  Previously, this code would just throw away collation
  information, because the eventually-called user-defined functions
  didn't use it since they only cared about equality, which didn't
  need collation information.

- String data type functions that do equality comparisons and hashing
  are changed to take the (non-)deterministic flag into account.  For
  comparison, this just means skipping various shortcuts and tie
  breakers that use byte-wise comparison.  For hashing, we first need
  to convert the input string to a canonical "sort key" using the ICU
  analogue of strxfrm().

Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/5e1963fb764e9cc092e0f7b58b28985c311431d9

Modified Files
--------------
contrib/bloom/bloom.h                            |   1 +
contrib/bloom/blutils.c                          |   3 +-
doc/src/sgml/catalogs.sgml                       |   7 +
doc/src/sgml/charset.sgml                        |  61 +-
doc/src/sgml/citext.sgml                         |  21 +
doc/src/sgml/func.sgml                           |   6 +
doc/src/sgml/ref/create_collation.sgml           |  22 +
src/backend/access/hash/hashfunc.c               | 100 +++-
src/backend/access/spgist/spgtextproc.c          |   3 +-
src/backend/catalog/pg_collation.c               |   2 +
src/backend/commands/collationcmds.c             |  25 +-
src/backend/commands/extension.c                 |   6 +-
src/backend/executor/execExpr.c                  |   4 +-
src/backend/executor/execGrouping.c              |  14 +-
src/backend/executor/execPartition.c             |   1 +
src/backend/executor/execReplication.c           |   5 +-
src/backend/executor/nodeAgg.c                   |   9 +-
src/backend/executor/nodeGroup.c                 |   1 +
src/backend/executor/nodeHash.c                  |  14 +-
src/backend/executor/nodeHashjoin.c              |   5 +
src/backend/executor/nodeRecursiveunion.c        |   1 +
src/backend/executor/nodeSetOp.c                 |   2 +
src/backend/executor/nodeSubplan.c               |  14 +-
src/backend/executor/nodeUnique.c                |   1 +
src/backend/executor/nodeWindowAgg.c             |   2 +
src/backend/nodes/copyfuncs.c                    |   7 +
src/backend/nodes/outfuncs.c                     |   7 +
src/backend/nodes/readfuncs.c                    |   7 +
src/backend/optimizer/plan/createplan.c          |  54 +-
src/backend/optimizer/util/tlist.c               |  25 +
src/backend/partitioning/partbounds.c            |   4 +-
src/backend/partitioning/partprune.c             |   3 +-
src/backend/regex/regc_pg_locale.c               |   5 +
src/backend/utils/adt/arrayfuncs.c               |   2 +-
src/backend/utils/adt/like.c                     |  27 +-
src/backend/utils/adt/like_support.c             |  14 +
src/backend/utils/adt/name.c                     |  32 +-
src/backend/utils/adt/orderedsetaggs.c           |   3 +-
src/backend/utils/adt/pg_locale.c                |   1 +
src/backend/utils/adt/ri_triggers.c              |  33 +-
src/backend/utils/adt/varchar.c                  | 194 ++++++-
src/backend/utils/adt/varlena.c                  | 333 +++++++----
src/backend/utils/cache/catcache.c               |   9 +-
src/backend/utils/cache/lsyscache.c              |  16 +
src/bin/initdb/initdb.c                          |   4 +-
src/bin/pg_dump/pg_dump.c                        |  39 +-
src/bin/psql/describe.c                          |  17 +-
src/include/catalog/catversion.h                 |   2 +-
src/include/catalog/pg_collation.h               |   2 +
src/include/executor/executor.h                  |   4 +
src/include/executor/hashjoin.h                  |   1 +
src/include/executor/nodeHash.h                  |   2 +-
src/include/nodes/execnodes.h                    |   3 +
src/include/nodes/plannodes.h                    |   7 +
src/include/optimizer/planmain.h                 |   2 +-
src/include/optimizer/tlist.h                    |   1 +
src/include/partitioning/partbounds.h            |   1 +
src/include/utils/lsyscache.h                    |   1 +
src/include/utils/pg_locale.h                    |   1 +
src/test/regress/expected/collate.icu.utf8.out   | 710 +++++++++++++++++++++++
src/test/regress/expected/collate.linux.utf8.out |   5 +
src/test/regress/expected/collate.out            |  15 +
src/test/regress/expected/subselect.out          |  19 +
src/test/regress/sql/collate.icu.utf8.sql        | 250 ++++++++
src/test/regress/sql/collate.linux.utf8.sql      |   7 +
src/test/regress/sql/collate.sql                 |   5 +
src/test/regress/sql/subselect.sql               |  17 +
src/test/subscription/Makefile                   |   2 +
src/test/subscription/t/012_collation.pl         | 103 ++++
69 files changed, 2087 insertions(+), 239 deletions(-)


Re: pgsql: Collations with nondeterministic comparison

От
Tom Lane
Дата:
Peter Eisentraut <peter@eisentraut.org> writes:
> Collations with nondeterministic comparison

Buildfarm member tern doesn't like this patch.  I think the issue is
probably that the 012_collation.pl test script isn't being sufficiently
careful to force UTF8 database encoding.

            regards, tom lane


Re: pgsql: Collations with nondeterministic comparison

От
Christoph Berg
Дата:
Re: Tom Lane 2019-03-22 <17125.1553282378@sss.pgh.pa.us>
> Peter Eisentraut <peter@eisentraut.org> writes:
> > Collations with nondeterministic comparison
> 
> Buildfarm member tern doesn't like this patch.  I think the issue is
> probably that the 012_collation.pl test script isn't being sufficiently
> careful to force UTF8 database encoding.

Debian unstable is also unhappy as of b5fd4972a:

LC_ALL=POSIX

19:19:23 t/010_truncate.pl ..... ok
19:19:25 # Looks like your test exited with 29 before it could output anything.
19:19:25 t/012_collation.pl ....
19:19:25 Dubious, test returned 29 (wstat 7424, 0x1d00)
19:19:25 Failed 2/2 subtests

https://pgdgbuild.dus.dg-i.net/job/postgresql-12-binaries/architecture=amd64,distribution=sid/639/console

Christoph


Re: pgsql: Collations with nondeterministic comparison

От
Tom Lane
Дата:
Peter Eisentraut <peter@eisentraut.org> writes:
> Collations with nondeterministic comparison

Buildfarm member snapper doesn't like this.  I think the problem is
that the queries it's failing on do not bother to constrain their
output row order.  Adding an "EXPLAIN" right there indicates that
the INTERSECTs are being done via hashing, meaning that platform
specific output ordering is to be expected.  I don't have a
big-endian platform with ICU installed, but if I did I bet I could
replicate the failure on it.

            regards, tom lane


Re: pgsql: Collations with nondeterministic comparison

От
Peter Eisentraut
Дата:
On 2019-03-25 06:07, Tom Lane wrote:
> Buildfarm member snapper doesn't like this.  I think the problem is
> that the queries it's failing on do not bother to constrain their
> output row order.  Adding an "EXPLAIN" right there indicates that
> the INTERSECTs are being done via hashing, meaning that platform
> specific output ordering is to be expected.  I don't have a
> big-endian platform with ICU installed, but if I did I bet I could
> replicate the failure on it.

pushed a fix

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services