Обсуждение: Windows UTF-8, non-ICU collation trouble

Поиск
Список
Период
Сортировка

Windows UTF-8, non-ICU collation trouble

От
Noah Misch
Дата:
We use system UTF-16 collation to implement UTF-8 collation on Windows.  The
PostgreSQL security team received a report, from Timothy Kuun, that this
collation does not uphold the "symmetric law" and "transitive law" that we
require for btree operator classes.  The attached test program demonstrates
this.  http://www.delphigroups.info/2/62/478610.html quotes reports of that
problem going back eighteen years.  Most code points are unaffected.  Indexing
an affected code point using such a collation can cause btree index scans to not
find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
admit a duplicate.  The security team determined that this doesn't qualify as a
security vulnerability, but it's still a bug.

All I can think to do is issue a warning whenever a CREATE DATABASE or CREATE
COLLATION combines UTF8 encoding with a locale having this problem.  In a
greenfield, I would forbid affected combinations of encoding and locale.  That
is too harsh, considering the few code points affected and the difficulty of
changing the collation of existing databases.  For CREATE DATABASE, all except
LOCALE=C would trigger the warning.  For CREATE COLLATION, ICU locales would
also not trigger the warning.  Hence, the chief workaround is to use LOCALE=C at
the database level and ICU collations for indexes and operator invocations.
(The ability to use an ICU collation at the database level would improve the
user experience here.)  Better ideas?

Вложения

Re: Windows UTF-8, non-ICU collation trouble

От
Thomas Munro
Дата:
On Fri, Dec 6, 2019 at 7:34 PM Noah Misch <noah@leadboat.com> wrote:
> We use system UTF-16 collation to implement UTF-8 collation on Windows.  The
> PostgreSQL security team received a report, from Timothy Kuun, that this
> collation does not uphold the "symmetric law" and "transitive law" that we
> require for btree operator classes.  The attached test program demonstrates
> this.  http://www.delphigroups.info/2/62/478610.html quotes reports of that
> problem going back eighteen years.  Most code points are unaffected.  Indexing
> an affected code point using such a collation can cause btree index scans to not
> find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
> admit a duplicate.  The security team determined that this doesn't qualify as a
> security vulnerability, but it's still a bug.

Huh.  Does this apply in modern times?  Since Windows 10, I thought
they adopted[1] CLDR data to drive that, the same definitions used (or
somewhere in the process of being adopted by) GNU, Illumos, FreeBSD
etc.  Basically, everyone gave up on trying to own this rats nest of a
problem and deferred to the experts.  If you can still get
index-busting behaviour out of modern Windows collations, wouldn't
that be a bug that someone can file against SQL Server, Windows etc
and get fixed?

[1] https://blogs.msdn.microsoft.com/shawnste/2015/08/29/locale-data-in-windows-10-cldr/



Re: Windows UTF-8, non-ICU collation trouble

От
Noah Misch
Дата:
On Fri, Dec 06, 2019 at 07:56:08PM +1300, Thomas Munro wrote:
> On Fri, Dec 6, 2019 at 7:34 PM Noah Misch <noah@leadboat.com> wrote:
> > We use system UTF-16 collation to implement UTF-8 collation on Windows.  The
> > PostgreSQL security team received a report, from Timothy Kuun, that this
> > collation does not uphold the "symmetric law" and "transitive law" that we
> > require for btree operator classes.  The attached test program demonstrates
> > this.  http://www.delphigroups.info/2/62/478610.html quotes reports of that
> > problem going back eighteen years.  Most code points are unaffected.  Indexing
> > an affected code point using such a collation can cause btree index scans to not
> > find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
> > admit a duplicate.  The security team determined that this doesn't qualify as a
> > security vulnerability, but it's still a bug.
> 
> Huh.  Does this apply in modern times?  Since Windows 10, I thought
> they adopted[1] CLDR data to drive that, the same definitions used (or
> somewhere in the process of being adopted by) GNU, Illumos, FreeBSD
> etc.  Basically, everyone gave up on trying to own this rats nest of a
> problem and deferred to the experts.

Based on my test program, it applies to Windows Server 2016.  I didn't test
newer versions.

> If you can still get
> index-busting behaviour out of modern Windows collations, wouldn't
> that be a bug that someone can file against SQL Server, Windows etc
> and get fixed?

Perhaps.  I wouldn't have high hopes, given the behavior's long tenure and the
risk of breaking a different set of applications.



Re: Windows UTF-8, non-ICU collation trouble

От
Thomas Munro
Дата:
On Fri, Dec 6, 2019 at 8:33 PM Noah Misch <noah@leadboat.com> wrote:
> On Fri, Dec 06, 2019 at 07:56:08PM +1300, Thomas Munro wrote:
> > On Fri, Dec 6, 2019 at 7:34 PM Noah Misch <noah@leadboat.com> wrote:
> > > We use system UTF-16 collation to implement UTF-8 collation on Windows.  The
> > > PostgreSQL security team received a report, from Timothy Kuun, that this
> > > collation does not uphold the "symmetric law" and "transitive law" that we
> > > require for btree operator classes.  The attached test program demonstrates
> > > this.  http://www.delphigroups.info/2/62/478610.html quotes reports of that
> > > problem going back eighteen years.  Most code points are unaffected.  Indexing
> > > an affected code point using such a collation can cause btree index scans to not
> > > find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
> > > admit a duplicate.  The security team determined that this doesn't qualify as a
> > > security vulnerability, but it's still a bug.
> >
> > Huh.  Does this apply in modern times?  Since Windows 10, I thought
> > they adopted[1] CLDR data to drive that, the same definitions used (or
> > somewhere in the process of being adopted by) GNU, Illumos, FreeBSD
> > etc.  Basically, everyone gave up on trying to own this rats nest of a
> > problem and deferred to the experts.
>
> Based on my test program, it applies to Windows Server 2016.  I didn't test
> newer versions.

I ran a variation of your program on Appveyor's Studio/Server 2019
image, and the result was the same: it thinks that cmp(s1, s2) == 0,
cmp(s2, s3) == 0, but cmp(s1, s3) == 1, so the operator fails to be
transitive.

These strings include combining characters probably used in a weird
way[1][2], and they confuse my terminal software.  Although they don't
all agree, the 3 other implementations I tried gave self-consistent
answers.  All of the below expressions are true:

create table t as select e'\u11a7\u1188\ud7a2' s1,
e'\u11a7\ud7a2\u1188' s2, e'\ud7a2\u11a7\u1188' s3;
select s1 > s2 collate "ko-x-icu", s2 > s3 collate "ko-x-icu", s1 > s3
collate "ko-x-icu" from t; [ICU]
select s1 > s2 collate "ko_KR.UTF-8", s2 > s3 collate "ko_KR.UTF-8",
s1 > s3 collate "ko_KR.UTF-8" from t; [FreeBSD]
select s1 < s2 collate "ko_KR.utf8", s2 < s3 collate "ko_KR.utf8", s1
< s3 collate "ko_KR.utf8" from t; [GNU]

> > If you can still get
> > index-busting behaviour out of modern Windows collations, wouldn't
> > that be a bug that someone can file against SQL Server, Windows etc
> > and get fixed?
>
> Perhaps.  I wouldn't have high hopes, given the behavior's long tenure and the
> risk of breaking a different set of applications.

I found a SQL Server test website[3] and tried to get it to do
something strange, using "Windows" collations (the ones that are
supposed to be compatible with CompareString() AKA strcoll(), much
like our "libc" provider).  For Latin1_General_100_CI_AS_SC_UTF8 and
Korean_100_CS_AS it insisted that cmp(s1, s2) == 1, cmp(s2, s3) == 0,
cmp(s1, s3) == 1, while for Korean_90_CS_AS it said -1, 0, -1, all
self-consistent answers, matching neither your results nor the results
of other implementations.  Perhaps my query is wrong, or perhaps I
don't understand how to find the right collation name to break
something, or perhaps it does some kind of normalisation or other
transformation on the strings.  Clearly I failed to access the same
code that your sort-locale.c test reaches.

create table t (s1 nvarchar(10) collate Korean_100_CS_AS,
                s2 nvarchar(10) collate Korean_100_CS_AS,
                s3 nvarchar(10) collate Korean_100_CS_AS);

insert into t values (concat(nchar(0x11a7), nchar(0x1188), nchar(0xd7a2)),
                      concat(nchar(0x11a7), nchar(0xd7a2), nchar(0x1188)),
                      concat(nchar(0xd7a2), nchar(0x11a7), nchar(0x1188)));

select case when s1 < s2 then -1 when s1 > s2 then 1 else 0 end,
       case when s2 < s3 then -1 when s2 > s3 then 1 else 0 end,
       case when s1 < s3 then -1 when s1 > s3 then 1 else 0 end
  from t;

[1] https://en.wikipedia.org/wiki/List_of_Hangul_jamo
[2] https://www.unicode.org/faq/korean.html
[3] https://sqltest.net/#927795



Re: Windows UTF-8, non-ICU collation trouble

От
Noah Misch
Дата:
On Tue, Dec 10, 2019 at 03:41:15PM +1300, Thomas Munro wrote:
> On Fri, Dec 6, 2019 at 8:33 PM Noah Misch <noah@leadboat.com> wrote:
> > On Fri, Dec 06, 2019 at 07:56:08PM +1300, Thomas Munro wrote:
> > > On Fri, Dec 6, 2019 at 7:34 PM Noah Misch <noah@leadboat.com> wrote:
> > > > We use system UTF-16 collation to implement UTF-8 collation on Windows.  The
> > > > PostgreSQL security team received a report, from Timothy Kuun, that this
> > > > collation does not uphold the "symmetric law" and "transitive law" that we
> > > > require for btree operator classes.  The attached test program demonstrates
> > > > this.  http://www.delphigroups.info/2/62/478610.html quotes reports of that
> > > > problem going back eighteen years.  Most code points are unaffected.  Indexing
> > > > an affected code point using such a collation can cause btree index scans to not
> > > > find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
> > > > admit a duplicate.  The security team determined that this doesn't qualify as a
> > > > security vulnerability, but it's still a bug.
> > >
> > > Huh.  Does this apply in modern times?  Since Windows 10, I thought
> > > they adopted[1] CLDR data to drive that, the same definitions used (or
> > > somewhere in the process of being adopted by) GNU, Illumos, FreeBSD
> > > etc.  Basically, everyone gave up on trying to own this rats nest of a
> > > problem and deferred to the experts.
> >
> > Based on my test program, it applies to Windows Server 2016.  I didn't test
> > newer versions.
> 
> I ran a variation of your program on Appveyor's Studio/Server 2019
> image, and the result was the same: it thinks that cmp(s1, s2) == 0,
> cmp(s2, s3) == 0, but cmp(s1, s3) == 1, so the operator fails to be
> transitive.

If that test is captured in self-contained artifacts (a few config files, a
public git repository, etc.), could you share them?  If not, no need to
assemble such artifacts.  I probably won't use them, but I'd be curious to
browse them if you've already assembled them.

> > > If you can still get
> > > index-busting behaviour out of modern Windows collations, wouldn't
> > > that be a bug that someone can file against SQL Server, Windows etc
> > > and get fixed?
> >
> > Perhaps.  I wouldn't have high hopes, given the behavior's long tenure and the
> > risk of breaking a different set of applications.
> 
> I found a SQL Server test website[3] and tried to get it to do
> something strange, using "Windows" collations (the ones that are
> supposed to be compatible with CompareString() AKA strcoll(), much
> like our "libc" provider).  For Latin1_General_100_CI_AS_SC_UTF8 and
> Korean_100_CS_AS it insisted that cmp(s1, s2) == 1, cmp(s2, s3) == 0,
> cmp(s1, s3) == 1, while for Korean_90_CS_AS it said -1, 0, -1, all
> self-consistent answers, matching neither your results nor the results
> of other implementations.

This does suggest some set of CompareString* parameters is free from the
problem.  If that's right, we could offer collations based on that.  (I'm not
sure it would be worth offering; ICU may be enough.)

Thanks for this extensive testing.



Re: Windows UTF-8, non-ICU collation trouble

От
Thomas Munro
Дата:
On Tue, Dec 10, 2019 at 10:29 PM Noah Misch <noah@leadboat.com> wrote:
> On Tue, Dec 10, 2019 at 03:41:15PM +1300, Thomas Munro wrote:
> > I ran a variation of your program on Appveyor's Studio/Server 2019
> > image, and the result was the same: it thinks that cmp(s1, s2) == 0,
> > cmp(s2, s3) == 0, but cmp(s1, s3) == 1, so the operator fails to be
> > transitive.
>
> If that test is captured in self-contained artifacts (a few config files, a
> public git repository, etc.), could you share them?  If not, no need to
> assemble such artifacts.  I probably won't use them, but I'd be curious to
> browse them if you've already assembled them.

https://ci.appveyor.com/project/macdice/locale-sort
https://github.com/macdice/locale-sort

To understand which operating systems the images mentioned in
appveyor.yml correspond to:

https://www.appveyor.com/docs/windows-images-software/

> This does suggest some set of CompareString* parameters is free from the
> problem.  If that's right, we could offer collations based on that.  (I'm not
> sure it would be worth offering; ICU may be enough.)

It would be nice to get to the bottom of that (for example, what is
the relationship between names like "Korean_XXX" and names like
"ko-KR"?), but I'm unlikely to investigate further (I have enough
trouble getting N kinds of Unix to do what I want).  Generally I like
the idea of continuing to support and recommend both operating system
and ICU locales for different use cases.  It should be easy to get all
the software on your system to agree on ordering, which seems like a
thing you should want as an application designer.  The lack of
versioning is not a problem on Windows (see
https://commitfest.postgresql.org/26/2351/).



Re: Windows UTF-8, non-ICU collation trouble

От
Noah Misch
Дата:
On Wed, Dec 11, 2019 at 01:54:47PM +1300, Thomas Munro wrote:
> On Tue, Dec 10, 2019 at 10:29 PM Noah Misch <noah@leadboat.com> wrote:
> > This does suggest some set of CompareString* parameters is free from the
> > problem.  If that's right, we could offer collations based on that.  (I'm not
> > sure it would be worth offering; ICU may be enough.)
> 
> It would be nice to get to the bottom of that (for example, what is
> the relationship between names like "Korean_XXX" and names like
> "ko-KR"?), but I'm unlikely to investigate further (I have enough
> trouble getting N kinds of Unix to do what I want).  Generally I like
> the idea of continuing to support and recommend both operating system
> and ICU locales for different use cases.  It should be easy to get all
> the software on your system to agree on ordering, which seems like a
> thing you should want as an application designer.  The lack of
> versioning is not a problem on Windows (see
> https://commitfest.postgresql.org/26/2351/).

I explored this further in a Windows-specific forum:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/7682927d-d0f3-4278-a0cc-3fefc2ca4b65

Key points relevant to PostgreSQL:
- SQL Server contains its own snapshot of string collation data, so its
  sorting is independent of Windows APIs on the machine hosting SQL Server.
  The source of the snapshot is long-past Windows API behavior.
- Passing older NLS version values to Windows string comparison functions does
  change collation of these particular test strings.  On my Windows Server
  2016 system, two older NLS versions were available, both of which exhibited
  transitivity for these strings.

Looking at my original proposal in that light:

On Thu, Dec 05, 2019 at 10:34:01PM -0800, Noah Misch wrote:
> All I can think to do is issue a warning whenever a CREATE DATABASE or CREATE
> COLLATION combines UTF8 encoding with a locale having this problem.  In a
> greenfield, I would forbid affected combinations of encoding and locale.  That
> is too harsh, considering the few code points affected and the difficulty of
> changing the collation of existing databases.

That's still the best I can think to do.

> For CREATE DATABASE, all except
> LOCALE=C would trigger the warning.  For CREATE COLLATION, ICU locales would
> also not trigger the warning.  Hence, the chief workaround is to use LOCALE=C at
> the database level and ICU collations for indexes and operator invocations.

While that is true for Windows Server 2016, it looks like Windows 7 and
Windows Server 2003 wouldn't trigger the warning.  (That's of rapidly decaying
interest, of course.)  If someone took your "Collation versions on Windows"
patch a step further and allowed "CREATE COLLATION x (LOCALE = 'de-DE',
VERSION = 'a.b,c.d')", one could specify the use of Windows 7 collation order
on Windows Server 2016, removing the problem (and stopping the warning).