Обсуждение: SQL/JSON path: collation for comparisons, minor typos in docs

Поиск
Список
Период
Сортировка

SQL/JSON path: collation for comparisons, minor typos in docs

От
Markus Winand
Дата:
Hi!

I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t
followthe standard. 

The functionality in question are the comparison operators except ==. They use the database default collation rather
thenthe standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in
firstparagraph). 

I guess this is the relevant part of the code: src/backend/utils/adt/jsonpath_exec.c (compareItems)

                case jbvString:
                        if (op == jpiEqual)
                                return jb1->val.string.len != jb2->val.string.len ||
                                        memcmp(jb1->val.string.val,
                                                   jb2->val.string.val,
                                                   jb1->val.string.len) ? jpbFalse : jpbTrue;

                        cmp = varstr_cmp(jb1->val.string.val, jb1->val.string.len,
                                                         jb2->val.string.val, jb2->val.string.len,
                                                         DEFAULT_COLLATION_OID);
                        break;

Testcase:

        postgres 12beta3=# select * from jsonb_path_query('"dummy"', '$ ? ("a" < "A")');
         jsonb_path_query
        ------------------
         "dummy"
        (1 row)

In code points, lower case ‘a' is not less than upper case ‘A’—the result should be empty.

To convince myself:

        postgres 12beta3=# select datcollate, 'a' < 'A', 'a' <'A' COLLATE ucs_basic from pg_database where
datname=current_database();
         datcollate  | ?column? | ?column?
        -------------+----------+----------
         en_US.UTF-8 | t        | f
        (1 row)

I also found two minor typos in the docs. Patch attached.

-markus
ps.: I’ve created 230 test cases. Besides the WIP topic .datetime(), the collation issue is the only one I found.
Excellentwork. Down to the SQLSTATEs. For sure the most complete and correct SQL/JSON path implementation I've seen. 


Вложения

Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Alexander Korotkov
Дата:
Hi!

On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:
> I was playing around with JSON path quite a bit and might have found one case where the current implementation
doesn’tfollow the standard. 
>
> The functionality in question are the comparison operators except ==. They use the database default collation rather
thenthe standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in
firstparagraph). 

Thank you for pointing!  Nikita is about to write a patch fixing that.

> I also found two minor typos in the docs. Patch attached.

Pushed, thanks.

> -markus
> ps.: I’ve created 230 test cases. Besides the WIP topic .datetime(), the collation issue is the only one I found.
Excellentwork. Down to the SQLSTATEs. For sure the most complete and correct SQL/JSON path implementation I've seen. 

Thank you!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Alexander Korotkov
Дата:
On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:
> > I was playing around with JSON path quite a bit and might have found one case where the current implementation
doesn’tfollow the standard. 
> >
> > The functionality in question are the comparison operators except ==. They use the database default collation
ratherthen the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last
sentencein first paragraph). 
>
> Thank you for pointing!  Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance.  So, "==" remains per-byte comparison.  For consistency
in other operators we compare code points first, then do per-byte
comparison.  In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard.  In future we can implement strict standard
conformance by normalization of input JSON strings.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Alexander Korotkov
Дата:
On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:
> > > I was playing around with JSON path quite a bit and might have found one case where the current implementation
doesn’tfollow the standard. 
> > >
> > > The functionality in question are the comparison operators except ==. They use the database default collation
ratherthen the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last
sentencein first paragraph). 
> >
> > Thank you for pointing!  Nikita is about to write a patch fixing that.
>
> Please, see the attached patch.
>
> Our idea is to not sacrifice "==" operator performance for standard
> conformance.  So, "==" remains per-byte comparison.  For consistency
> in other operators we compare code points first, then do per-byte
> comparison.  In some edge cases, when same Unicode codepoints have
> different binary representations in database encoding, this behavior
> diverges standard.  In future we can implement strict standard
> conformance by normalization of input JSON strings.

Previous version of patch has buggy implementation of
compareStrings().  Revised version is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Alexander Korotkov
Дата:
On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
> > <a.korotkov@postgrespro.ru> wrote:
> > > On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:
> > > > I was playing around with JSON path quite a bit and might have found one case where the current implementation
doesn’tfollow the standard. 
> > > >
> > > > The functionality in question are the comparison operators except ==. They use the database default collation
ratherthen the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last
sentencein first paragraph). 
> > >
> > > Thank you for pointing!  Nikita is about to write a patch fixing that.
> >
> > Please, see the attached patch.
> >
> > Our idea is to not sacrifice "==" operator performance for standard
> > conformance.  So, "==" remains per-byte comparison.  For consistency
> > in other operators we compare code points first, then do per-byte
> > comparison.  In some edge cases, when same Unicode codepoints have
> > different binary representations in database encoding, this behavior
> > diverges standard.  In future we can implement strict standard
> > conformance by normalization of input JSON strings.
>
> Previous version of patch has buggy implementation of
> compareStrings().  Revised version is attached.

Nikita pointed me that for UTF-8 strings per-byte comparison result
matches codepoints comparison result.  That allows simplify patch a
lot.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Markus Winand
Дата:
Hi!

The patch makes my tests pass.

I wonder about a few things:

- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?

- For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph).
- I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.

My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :(

-markus
ps.: for me, testing the regular expression dialect of like_regex is out of scope


On 8 Aug 2019, at 02:27, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:
I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

Thank you for pointing!  Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance.  So, "==" remains per-byte comparison.  For consistency
in other operators we compare code points first, then do per-byte
comparison.  In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard.  In future we can implement strict standard
conformance by normalization of input JSON strings.

Previous version of patch has buggy implementation of
compareStrings().  Revised version is attached.

Nikita pointed me that for UTF-8 strings per-byte comparison result
matches codepoints comparison result.  That allows simplify patch a
lot.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
<0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patch>

Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Alexander Korotkov
Дата:
Hi, Markus!

On Thu, Aug 8, 2019 at 11:53 AM Markus Winand <markus.winand@winand.at> wrote:
> The patch makes my tests pass.

Cool.

> I wonder about a few things:
>
> - Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?

PostgreSQL supports ucs_basic, but it's alias to C collation and works
only for utf-8.  Jsonpath code may work in different encodings.  New
string comparison code can work in different encodings.

> - For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last
paragraph).
> - I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.

For object keys we don't actually care about whether strings are less
or greater.  We only search for equal keys.  So, per-byte comparison
we currently use should be fine.  The same states for "starts with"
predicate.

> My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot
discoverpotential issues there. I haven’t played around with nondeterministic ICU collations yet :( 

That's OK. There should be other beta testers around :)


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Alexander Korotkov
Дата:
On Thu, Aug 8, 2019 at 11:30 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Aug 8, 2019 at 11:53 AM Markus Winand <markus.winand@winand.at> wrote:
> > The patch makes my tests pass.
>
> Cool.
>
> > I wonder about a few things:
> >
> > - Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?
>
> PostgreSQL supports ucs_basic, but it's alias to C collation and works
> only for utf-8.  Jsonpath code may work in different encodings.  New
> string comparison code can work in different encodings.
>
> > - For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last
paragraph).
> > - I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.
>
> For object keys we don't actually care about whether strings are less
> or greater.  We only search for equal keys.  So, per-byte comparison
> we currently use should be fine.  The same states for "starts with"
> predicate.
>
> > My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot
discoverpotential issues there. I haven’t played around with nondeterministic ICU collations yet :( 
>
> That's OK. There should be other beta testers around :)

So, I'm going to push this if no objections.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: SQL/JSON path: collation for comparisons, minor typos in docs

От
Alexander Korotkov
Дата:
On Fri, Aug 9, 2019 at 5:27 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Aug 8, 2019 at 11:30 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > On Thu, Aug 8, 2019 at 11:53 AM Markus Winand <markus.winand@winand.at> wrote:
> > > The patch makes my tests pass.
> >
> > Cool.
> >
> > > I wonder about a few things:
> > >
> > > - Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?
> >
> > PostgreSQL supports ucs_basic, but it's alias to C collation and works
> > only for utf-8.  Jsonpath code may work in different encodings.  New
> > string comparison code can work in different encodings.
> >
> > > - For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last
paragraph).
> > > - I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the
standard.
> >
> > For object keys we don't actually care about whether strings are less
> > or greater.  We only search for equal keys.  So, per-byte comparison
> > we currently use should be fine.  The same states for "starts with"
> > predicate.
> >
> > > My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I
cannotdiscover potential issues there. I haven’t played around with nondeterministic ICU collations yet :( 
> >
> > That's OK. There should be other beta testers around :)
>
> So, I'm going to push this if no objections.

So, pushed.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company