Обсуждение: Define jsonpath functions as stable

Поиск
Список
Период
Сортировка

Define jsonpath functions as stable

От
Alexander Korotkov
Дата:
Hi!

During my work on bringing jsonpath patchset to commit, I was always
keeping in mind that we need to make jsonb_path_*() functions
immutable.  Having these functions immutable, users can build
expression indexes over them.  Naturally, in majority of cases one
doesn't need to index whole json documents, but only some parts of
them.  jsonpath provide great facilities to extract indexable parts of
document, much more powerful than our current operator set.

However, we've spotted some deviations between standard and our implementation.
 * like_regex predicate uses our regular expression engine, which
deviates from standard.
 * We always do numeric computations using numeric datatype.  Even if
user explicitly calls .double() method.  Probably, our current
implementation still fits standard.  But in future we may like to use
floating point computation in some cases for performance optimization.

These deviations don't look critical by itself.  But immutable
functions make problematic fixing them in future.  Also, I'm not sure
this is complete list of deviations we have.  We might have, for
example, hidden deviations in handling strict/lax modes, which are
hard to detect and understand.

Therefore, I'm going to mark jsonb_path_*() functions stable, not
immutable.  Nevertheless users will still have multiple options for
indexing:
1) jsonb_path_ops supports jsonpath matching operators in some cases.
2) One can wrap jsonb_path_*() in pl/* function and mark it as
immutable on his own risk.  This approach is widely used to build
indexes over to_date()/to_timestamp().
3) We're going to provide support of jsonpath operators in jsquery
extension before release of PostgreSQL 12.

I'd like to note I don't mean we wouldn't ever have immutable
functions for jsonpath evaluation.  I think once we sure enough that
we know immutable subset of jsonpath, we may define immutable
functions for its evaluation.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> During my work on bringing jsonpath patchset to commit, I was always
> keeping in mind that we need to make jsonb_path_*() functions
> immutable.  Having these functions immutable, users can build
> expression indexes over them.

Right.

> However, we've spotted some deviations between standard and our implementation.
>  * like_regex predicate uses our regular expression engine, which
> deviates from standard.
>  * We always do numeric computations using numeric datatype.  Even if
> user explicitly calls .double() method.  Probably, our current
> implementation still fits standard.  But in future we may like to use
> floating point computation in some cases for performance optimization.
> ...
> Therefore, I'm going to mark jsonb_path_*() functions stable, not
> immutable.

I dunno, I think you are applying a far more rigorous definition of
"immutable" than we ever have in the past.  The possibility that we
might change the implementation in the future should not be enough
to disqualify a function from being immutable --- if that were the
criterion, nothing more complex than int4pl could be immutable.

Wouldn't it be better that, in the hypothetical major version where
we change the implementation, we tell users that they must reindex
any affected indexes?

As a comparison point, we allow people to build indexes on tsvector
results, which are *easy* to change just by adjusting configuration
files.  The fact that this might force the need for reindexing hasn't
made it unworkable.

            regards, tom lane



Re: Define jsonpath functions as stable

От
Chapman Flack
Дата:
Hi,

On 7/29/19 10:25 AM, Alexander Korotkov wrote:

>  * like_regex predicate uses our regular expression engine, which
> deviates from standard.

I still favor adding some element to the syntax (like a 'posix' or 'pg'
keyword in the grammar for like_regex) that identifies it as using
a  different regexp flavor, so the way forward to a possible compliant
version later is not needlessly blocked (or consigned to a
standard_conforming_strings-like experience).

That would also resolve much of the case against calling that
predicate immutable.

It looks as if, in my first implementation of XQuery regexps, there
will have to be a "not-quite-standard" flag for those too, because
it turns out the SQL committee made some tweaks to XQuery regexps[1],
whereas any XQuery library one relies on is going to provide untweaked
XQuery regexps out of the box. (The differences only affect ^ $ . \s \S)

Regards,
-Chap


[1]
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions



Re: Define jsonpath functions as stable

От
Alexander Korotkov
Дата:
On Mon, Jul 29, 2019 at 5:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > However, we've spotted some deviations between standard and our implementation.
> >  * like_regex predicate uses our regular expression engine, which
> > deviates from standard.
> >  * We always do numeric computations using numeric datatype.  Even if
> > user explicitly calls .double() method.  Probably, our current
> > implementation still fits standard.  But in future we may like to use
> > floating point computation in some cases for performance optimization.
> > ...
> > Therefore, I'm going to mark jsonb_path_*() functions stable, not
> > immutable.
>
> I dunno, I think you are applying a far more rigorous definition of
> "immutable" than we ever have in the past.  The possibility that we
> might change the implementation in the future should not be enough
> to disqualify a function from being immutable --- if that were the
> criterion, nothing more complex than int4pl could be immutable.
>
> Wouldn't it be better that, in the hypothetical major version where
> we change the implementation, we tell users that they must reindex
> any affected indexes?
>
> As a comparison point, we allow people to build indexes on tsvector
> results, which are *easy* to change just by adjusting configuration
> files.  The fact that this might force the need for reindexing hasn't
> made it unworkable.

Thank you for the explanation.  Given that there is no need to mark
existing json_path_*() functions as stable.  We can just advise users
to rebuild their indexes if we have incompatible changes.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Define jsonpath functions as stable

От
Alexander Korotkov
Дата:
On Mon, Jul 29, 2019 at 5:55 PM Chapman Flack <chap@anastigmatix.net> wrote:
> On 7/29/19 10:25 AM, Alexander Korotkov wrote:
>
> >  * like_regex predicate uses our regular expression engine, which
> > deviates from standard.
>
> I still favor adding some element to the syntax (like a 'posix' or 'pg'
> keyword in the grammar for like_regex) that identifies it as using
> a  different regexp flavor, so the way forward to a possible compliant
> version later is not needlessly blocked (or consigned to a
> standard_conforming_strings-like experience).

What do you think about renaming existing operator from like_regex to
pg_like_regex?  Or introducing special flag indicating that PostgreSQL
regex engine is used ('p' for instance)?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Define jsonpath functions as stable

От
Chapman Flack
Дата:
On 07/29/19 18:27, Alexander Korotkov wrote:

> What do you think about renaming existing operator from like_regex to
> pg_like_regex?  Or introducing special flag indicating that PostgreSQL
> regex engine is used ('p' for instance)?

Renaming the operator is simple and certainly solves the problem.

I don't have a strong technical argument for or against any of:


$.** ? (@ pg_like_regex "O(w|v)" flag "i")
$.** ? (@ pg_like_regex "O(w|v)")


$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)


$.** ? (@ like_regex "O(w|v)" flag "ip")
$.** ? (@ like_regex "O(w|v)" flag "p")


It seems more of an aesthetic judgment (on which I am no particular
authority).

I think I would be -0.3 on the third approach just because of the need
to still spell out ' flag "p"' when there is no other flag you want.

I assume the first two approaches would be about equally easy to
implement, assuming there's a parser that already has an optional
production for "flag" STRING.

Both of the first two seem pretty safe from colliding with a
future addition to the standard.

To my aesthetic sense, pg_like_regex feels like "another operator
to remember" while like_regex ... pg feels like "ok, a slight variant
on the operator from the spec".

Later on, if a conformant version is added, the grammar might be a bit
simpler with just one name and an optional pg.

Going with a flag, there is some question of the likelihood of
the chosen flag letter being usurped by the standard at some point.

I'm leaning toward a flag for now in my own effort to provide the five SQL
functions (like_regex, occurrences_regex, position_regex, substring_regex,
and translate_regex), as for the time being it will be as an extension,
so no custom grammar for me, and I don't really want to make five
pg_* variant function names, and have that expand to ten function names
someday if the real ones are implemented. (Hmm, I suppose I could add
an optional function argument, distinct from flags; that would be
analogous to adding a pg in the grammar ... avoids overloading the flags,
avoids renaming the functions.)

I see in the Saxon library there is already a convention where it
allows a few flags undefined by the standard, after a semicolon in the
flag string. That has no official status; the XQuery spec
defines [smixq] and requires an error for anything else. But it
does have the advantage that the flag string can just be chopped
at the semicolon to eliminate all but the standard flags, and the
advantage (?) that at least one thing is already doing it.

Regards,
-Chap



Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
Hi,

On 7/29/19 8:33 PM, Chapman Flack wrote:
> On 07/29/19 18:27, Alexander Korotkov wrote:
>
>> What do you think about renaming existing operator from like_regex to
>> pg_like_regex?  Or introducing special flag indicating that PostgreSQL
>> regex engine is used ('p' for instance)?
>
> Renaming the operator is simple and certainly solves the problem.
>
> I don't have a strong technical argument for or against any of:
>
>
> $.** ? (@ pg_like_regex "O(w|v)" flag "i")
> $.** ? (@ pg_like_regex "O(w|v)")
>
>
> $.** ? (@ like_regex "O(w|v)" pg flag "i")
> $.** ? (@ like_regex "O(w|v)" pg)
>
>
> $.** ? (@ like_regex "O(w|v)" flag "ip")
> $.** ? (@ like_regex "O(w|v)" flag "p")
>
>
> It seems more of an aesthetic judgment (on which I am no particular
> authority).
>
> I think I would be -0.3 on the third approach just because of the need
> to still spell out ' flag "p"' when there is no other flag you want.
>
> I assume the first two approaches would be about equally easy to
> implement, assuming there's a parser that already has an optional
> production for "flag" STRING.
>
> Both of the first two seem pretty safe from colliding with a
> future addition to the standard.
>
> To my aesthetic sense, pg_like_regex feels like "another operator
> to remember" while like_regex ... pg feels like "ok, a slight variant
> on the operator from the spec".
>
> Later on, if a conformant version is added, the grammar might be a bit
> simpler with just one name and an optional pg.
>
> Going with a flag, there is some question of the likelihood of
> the chosen flag letter being usurped by the standard at some point.
>
> I'm leaning toward a flag for now in my own effort to provide the five SQL
> functions (like_regex, occurrences_regex, position_regex, substring_regex,
> and translate_regex), as for the time being it will be as an extension,
> so no custom grammar for me, and I don't really want to make five
> pg_* variant function names, and have that expand to ten function names
> someday if the real ones are implemented. (Hmm, I suppose I could add
> an optional function argument, distinct from flags; that would be
> analogous to adding a pg in the grammar ... avoids overloading the flags,
> avoids renaming the functions.)

Looking at this thread and[1] and the current state of open items[2], a
few thoughts:

It sounds like the easiest path to completion without potentially adding
futures headaches pushing back the release too far would be that, e.g.
these examples:

    $.** ? (@ like_regex "O(w|v)" pg flag "i")
    $.** ? (@ like_regex "O(w|v)" pg)

If it's using POSIX regexp, I would +1 using "posix" instead of "pg"

That said, from a user standpoint, it's slightly annoying to have to
include that keyword every time, and could potentially mean changing /
testing quite a bit of code once we do support XQuery regexps. Based on
how we currently handle regular expressions, we've already condition
user's to expect a certain behavior, and it would be inconsistent if we
do one thing in one place, and another thing here, so I would like for
us to be cognizant of that.

Reading the XQuery spec that Chapman provided[3], it sounds like there
are some challenges present if we were to try to implement XQuery-based
regexps.

I do agree with Alvaro's comment ("We have an opportunity to do
better")[4], but I think we have to weigh the likelihood of actually
supporting the XQuery behaviors before we add more burden to our users.
Based on what needs to be done, it does not sound like it is any time soon.

My first choice would be to leave it as is. We can make it abundantly
clear that if we make changes in a future version we advise our users on
what actions to take, and counsel on any behavior changes.

My second choice is to have a flag that makes it clear what kind of
regex's are being used, in which case "posix" -- this is abundantly
clearer to the user, but still default, at present, to using "posix"
expressions. If we ever do add the XQuery ones, we can debate whether we
default to the standard at that time, and if we do, we treat it like we
treat other deprecation issues and make abundantly clear what the
behavior is now.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/flat/5CF28EA0.80902%40anastigmatix.net
[2] https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items
[3]
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions
[4]
https://www.postgresql.org/message-id/20190618154907.GA6049%40alvherre.pgsql


Вложения

Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> It sounds like the easiest path to completion without potentially adding
> futures headaches pushing back the release too far would be that, e.g.
> these examples:

>     $.** ? (@ like_regex "O(w|v)" pg flag "i")
>     $.** ? (@ like_regex "O(w|v)" pg)

> If it's using POSIX regexp, I would +1 using "posix" instead of "pg"

I agree that we'd be better off to say "POSIX".  However, having just
looked through the references Chapman provided, it seems to me that
the regex language Henry Spencer's library provides is awful darn
close to what XPath is asking for.  The main thing I see in the XML/XPath
specs that we don't have is a bunch of character class escapes that are
specifically tied to Unicode character properties.  We could possibly
add code to implement those, but I'm not sure how it'd work in non-UTF8
database encodings.  There may also be subtle differences in the behavior
of character class escapes that we do have in common, such as "\s" for
white space; but again I'm not sure that those are any different than
what you get naturally from encoding or locale variations.

I think we could possibly get away with not having any special marker
on regexes, but just explaining in the documentation that "features
so-and-so are not implemented".  Writing that text would require closer
analysis than I've seen in this thread as to exactly what the differences
are.

            regards, tom lane



Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/16/19 11:20 AM, Tom Lane wrote:
> "Jonathan S. Katz" <jkatz@postgresql.org> writes:
>> It sounds like the easiest path to completion without potentially adding
>> futures headaches pushing back the release too far would be that, e.g.
>> these examples:
>
>>     $.** ? (@ like_regex "O(w|v)" pg flag "i")
>>     $.** ? (@ like_regex "O(w|v)" pg)
>
>> If it's using POSIX regexp, I would +1 using "posix" instead of "pg"
>
> I agree that we'd be better off to say "POSIX".  However, having just
> looked through the references Chapman provided, it seems to me that
> the regex language Henry Spencer's library provides is awful darn
> close to what XPath is asking for.  The main thing I see in the XML/XPath
> specs that we don't have is a bunch of character class escapes that are
> specifically tied to Unicode character properties.  We could possibly
> add code to implement those, but I'm not sure how it'd work in non-UTF8
> database encodings.

Maybe taking a page from the pg_saslprep implementation. For some cases
where the string in question would issue a "reject" under normal
SASLprep[1] considerations (really stringprep[2]), PostgreSQL just lets
the string passthrough to the next step, without alteration.

What's implied here is if the string is UTF-8, it goes through SASLprep,
but if not, it is just passed through.

So perhaps the answer is that if we implement XQuery, the escape for
UTF-8 character properties are only honored if the encoding is set to be
UTF-8, and ignored otherwise. We would have to document that said
escapes only work on UTF-8 encodings.

>  There may also be subtle differences in the behavior
> of character class escapes that we do have in common, such as "\s" for
> white space; but again I'm not sure that those are any different than
> what you get naturally from encoding or locale variations.
>
> I think we could possibly get away with not having any special marker
> on regexes, but just explaining in the documentation that "features
> so-and-so are not implemented".  Writing that text would require closer
> analysis than I've seen in this thread as to exactly what the differences
> are.

+1, and likely would need some example strings too that highlight the
difference in how they are processed.

And again, if we end up updating the behavior in the future, it becomes
a part of our standard deprecation notice at the beginning of the
release notes, though one that could require a lot of explanation.

Jonathan

[1] https://tools.ietf.org/html/rfc4013
[2] https://www.ietf.org/rfc/rfc3454.txt


Вложения

Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> On 9/16/19 11:20 AM, Tom Lane wrote:
>> I think we could possibly get away with not having any special marker
>> on regexes, but just explaining in the documentation that "features
>> so-and-so are not implemented".  Writing that text would require closer
>> analysis than I've seen in this thread as to exactly what the differences
>> are.

> +1, and likely would need some example strings too that highlight the
> difference in how they are processed.

I spent an hour digging through these specs.  I was initially troubled
by the fact that XML Schema regexps are implicitly anchored, ie must
match the whole string; that's a huge difference from POSIX.  However,
19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
except for some weirdness around newline matching; and that spec
clearly says that fn:matches treats its pattern argument as NOT anchored.
So it looks like we end up in the same place as POSIX for this.

Otherwise, the pattern language differences I could find are all details
of character class expressions (bracket expressions, such as "[a-z0-9]")
and escapes that are character class shorthands:

* We don't have "character class subtraction".  I'd be pretty hesitant
to add that to our regexp language because it seems to change "-" into
a metacharacter, which would break an awful lot of regexps.  I might
be misunderstanding their syntax for it, because elsewhere that spec
explicitly claims that "-" is not a metacharacter.

* Character class elements can be #xNN (NN being hex digits), which seems
equivalent to POSIX \xNN as long as you're using UTF8 encoding.  Again,
the compatibility costs of allowing that don't seem attractive, since #
isn't a metacharacter today.

* Character class elements can be \p{UnicodeProperty} or
the complement \P{UnicodeProperty}, where there are a bunch of different
possible properties.  Perhaps we could add that someday; since there's no
reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
compatibility hit.  But I'm content to document this as unimplemented
for now.

* XQuery adds character class shorthands \i (complement \I) for "initial
name characters" and \c (complement \C) for "NameChar".  Same as above;
maybe add someday, but no hurry.

* It looks like XQuery's \w class might allow more characters than our
interpretation does, and hence \W allows fewer.  But since \w devolves
to what libc thinks the "alnum" class is, it's at least possible that
some locales might do the same thing XQuery calls for.

* Likewise, any other discrepancies between the Unicode-centric character
class definitions in XQuery and what our stuff does are well within the
boundaries of locale variances.  So I don't feel too bad about that.

* The SQL-spec newline business mentioned above is a possible exception:
it appears to require that when '.' is allowed to match newlines, a
single '.' should match a '\r\n' Windows newline.  I think we can
document that and move on.

* The x flag in XQuery is defined as ignoring all whitespace in
the pattern except within character class expressions.  Spencer's
x flag does mostly that, but it thinks that "\ " means a literal space
whereas XQuery explicitly says that the space is ignored and the
backslash applies to the next non-space character.  (That's just
weird, in my book.)  Also, Spencer's x mode causes # to begin
a comment extending to EOL, which is a nice thing XQuery hasn't
got, and it says you can't put spaces within multi-character
symbols like "(?:", which presumably is allowed with XQuery's "x".

I feel a bit uncomfortable with these inconsistencies in x-flag
rules.  We could probably teach the regexp library to have an
alternate expanded mode that matches XQuery's rules, but that's
not a project to tackle for v12.  I tentatively recommend that
we remove the jsonpath "x" flag for the time being.

Also, I noted some things that seem to be flat out sloppiness
in the XQuery flag conversions:

* The newline-matching flags (m and s flags) can be mapped to
features of Spencer's library, but jsonpath_gram.y does so
incorrectly.

* XQuery says that the q flag overrides m, s, and x flags, which is
exactly the opposite of what our code does; besides which the code
is flag-order-sensitive which is just wrong.

These last two are simple to fix and we should just go do it.
Otherwise, I think we're okay with regarding Spencer's library
as being a sufficiently close approximation to LIKE_REGEX.
We need some documentation work though.

            regards, tom lane



Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/16/19 5:10 PM, Tom Lane wrote:
> "Jonathan S. Katz" <jkatz@postgresql.org> writes:
>> On 9/16/19 11:20 AM, Tom Lane wrote:
>>> I think we could possibly get away with not having any special marker
>>> on regexes, but just explaining in the documentation that "features
>>> so-and-so are not implemented".  Writing that text would require closer
>>> analysis than I've seen in this thread as to exactly what the differences
>>> are.
>
>> +1, and likely would need some example strings too that highlight the
>> difference in how they are processed.
>
> I spent an hour digging through these specs.

Thanks! That sounds like quite the endeavor...

>  I was initially troubled
> by the fact that XML Schema regexps are implicitly anchored, ie must
> match the whole string; that's a huge difference from POSIX.  However,
> 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
> predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
> defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
> except for some weirdness around newline matching; and that spec
> clearly says that fn:matches treats its pattern argument as NOT anchored.
> So it looks like we end up in the same place as POSIX for this.
>
> Otherwise, the pattern language differences I could find are all details
> of character class expressions (bracket expressions, such as "[a-z0-9]")
> and escapes that are character class shorthands:
>
> * We don't have "character class subtraction".  I'd be pretty hesitant
> to add that to our regexp language because it seems to change "-" into
> a metacharacter, which would break an awful lot of regexps.  I might
> be misunderstanding their syntax for it, because elsewhere that spec
> explicitly claims that "-" is not a metacharacter.

Using something I could understand[1] it looks like the syntax is like:

    [a-z-[aeiou]

e.g. all the consonants of the alphabet. I don't believe that would
break many, if any, regexps. I also don't know what kind of effort it
would take to add that in given I had not looked at the regexp code
until today (and only at some of the amusing comments in the header
file, which seemed like it wasn't expected the code would be read 20
years later), but it would likely not be a v12 problem.

> * Character class elements can be #xNN (NN being hex digits), which seems
> equivalent to POSIX \xNN as long as you're using UTF8 encoding.  Again,
> the compatibility costs of allowing that don't seem attractive, since #
> isn't a metacharacter today.

Seems reasonable.

> * Character class elements can be \p{UnicodeProperty} or
> the complement \P{UnicodeProperty}, where there are a bunch of different
> possible properties.  Perhaps we could add that someday; since there's no
> reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
> compatibility hit.  But I'm content to document this as unimplemented
> for now.

+1.

> * XQuery adds character class shorthands \i (complement \I) for "initial
> name characters" and \c (complement \C) for "NameChar".  Same as above;
> maybe add someday, but no hurry.

+1.

> * It looks like XQuery's \w class might allow more characters than our
> interpretation does, and hence \W allows fewer.  But since \w devolves
> to what libc thinks the "alnum" class is, it's at least possible that
> some locales might do the same thing XQuery calls for.

I'd still add this to the "to document" list.

> * The SQL-spec newline business mentioned above is a possible exception:
> it appears to require that when '.' is allowed to match newlines, a
> single '.' should match a '\r\n' Windows newline.  I think we can
> document that and move on.

+1.

> * The x flag in XQuery is defined as ignoring all whitespace in
> the pattern except within character class expressions.  Spencer's
> x flag does mostly that, but it thinks that "\ " means a literal space
> whereas XQuery explicitly says that the space is ignored and the
> backslash applies to the next non-space character.  (That's just
> weird, in my book.)  Also, Spencer's x mode causes # to begin
> a comment extending to EOL, which is a nice thing XQuery hasn't
> got, and it says you can't put spaces within multi-character
> symbols like "(?:", which presumably is allowed with XQuery's "x".
>
> I feel a bit uncomfortable with these inconsistencies in x-flag
> rules.  We could probably teach the regexp library to have an
> alternate expanded mode that matches XQuery's rules, but that's
> not a project to tackle for v12.

That does not sound fun by any means. But likely that would be a part of
an overall effort to implement XQuery rules.

>  I tentatively recommend that
> we remove the jsonpath "x" flag for the time being.

I would add an alternative suggestion of just removing that "x" is
supported in the documentation...but likely better to just remove the
flag + docs.

> Also, I noted some things that seem to be flat out sloppiness
> in the XQuery flag conversions:
>
> * The newline-matching flags (m and s flags) can be mapped to
> features of Spencer's library, but jsonpath_gram.y does so
> incorrectly
> * XQuery says that the q flag overrides m, s, and x flags, which is
> exactly the opposite of what our code does; besides which the code
> is flag-order-sensitive which is just wrong.
>
> These last two are simple to fix and we should just go do it.

+1.

> Otherwise, I think we're okay with regarding Spencer's library
> as being a sufficiently close approximation to LIKE_REGEX.
> We need some documentation work though.

My main question is "where" -- I'm thinking somewhere in the JSON
path[2] section. After reading your email 3 times, I may have enough
knowledge to attempt some documentation on the regexp in JSON path.

Jonathan

[1] https://www.regular-expressions.info/charclasssubtract.html
[2]
https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH


Вложения

Re: Define jsonpath functions as stable

От
Chapman Flack
Дата:
On 09/16/19 17:10, Tom Lane wrote:

> I was initially troubled
> by the fact that XML Schema regexps are implicitly anchored, ie must
> match the whole string; that's a huge difference from POSIX.  However,
> 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
> predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
> defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
> except for some weirdness around newline matching; and that spec
> clearly says that fn:matches treats its pattern argument as NOT anchored.

Yeah, it's a layer cake. XML Schema regexps[1] are implicitly anchored and
don't have any metacharacters devoted to anchoring.

XQuery regexps layer onto[2] XML Schema regexps, adding ^ and $ anchors,
rescinding the implicit anchored-ness, adding reluctant quantifiers,
capturing groups, and back-references, and defining flags.

Then ISO SQL adds a third layer changing the newline semantics, affecting
^, $, ., \s, and \S.

Regards,
-Chap


[1] https://www.w3.org/TR/xmlschema-2/#regexs
[2] https://www.w3.org/TR/xpath-functions-31/#regex-syntax



Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/16/19 6:39 PM, Jonathan S. Katz wrote:

> My main question is "where" -- I'm thinking somewhere in the JSON
> path[2] section. After reading your email 3 times, I may have enough
> knowledge to attempt some documentation on the regexp in JSON path.

Here is said attempt to document. Notes:

- I centered it around the specification for LIKE_REGEX, which uses
XQuery, but primarily noted where our implementation of POSIX regex's
differs from what is specified for LIKE_REGEX vis-a-vis XQuery

- I put the pith of the documentation in a subsection off of "POSIX
regular expressions"

- I noted that LIKE_REGEX is specified in SQL:2008, which I read on the
Internet(tm) but was not able to confirm in the spec as I do not have a copy

- For my explanation about the "x" flag differences, I talked about how
we extended it, but I could not capture how Tom described the nuances above.

- From the SQL/JSON path docs, I added a section on regular expressions
stating what the behavior is, and referring back to the main regex docs

- I removed the "x" flag being supported for like_regex in JSON path

I also presume it needs a bit of wordsmithing / accuracy checks, but
hope it's a good start and does not require a massive rewrite.

Thanks,

Jonathan

Вложения

Re: Define jsonpath functions as stable

От
Erik Rijkers
Дата:
On 2019-09-17 17:38, Jonathan S. Katz wrote:
> On 9/16/19 6:39 PM, Jonathan S. Katz wrote:
> [regex.patch]

A few things/typos caught my eye:

1.
'implementation' seems the wrong word in sentence:

"Several other parts of the SQL standard
also define LIKE_REGEX equivalents that refer
to this implementation, including the
SQL/JSON path like_regex filter."

As I understand this text, 'concept' seems better.
I'd drop 'also', too.

2.
'whereas the POSIX will those'  should be
'whereas POSIX will regard those'
  or maybe 'read those'

3.
+     The SQL/JSON standard borrows its definition for how regular 
expressions
+     from the <literal>LIKE_REGEX</literal> operator, which in turns 
uses the
+     XQuery standard.
That sentence needs the verb 'work', no?  'for how regular expressions 
work [..]'
Or alternatively drop 'how'.


thanks,

Erik Rijkers





Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/17/19 12:09 PM, Erik Rijkers wrote:
> On 2019-09-17 17:38, Jonathan S. Katz wrote:
>> [regex.patch]

Thanks for the review!

> "Several other parts of the SQL standard
> also define LIKE_REGEX equivalents that refer
> to this implementation, including the
> SQL/JSON path like_regex filter."
>
> As I understand this text, 'concept' seems better.
> I'd drop 'also', too.

I rewrote this to be:

"Several other parts of the SQL standard refer to the LIKE_REGEX
specification to define similar operations, including..."

> 2.
> 'whereas the POSIX will those'  should be
> 'whereas POSIX will regard those'
>  or maybe 'read those'

I used "treat those"

>
> 3.
> +     The SQL/JSON standard borrows its definition for how regular
> expressions
> +     from the <literal>LIKE_REGEX</literal> operator, which in turns
> uses the
> +     XQuery standard.
> That sentence needs the verb 'work', no?  'for how regular expressions
> work [..]'
> Or alternatively drop 'how'.

I dropped the "how".

v2 attached. Thanks!

Jonathan

Вложения

Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> v2 attached. Thanks!

I whacked this around some (well, quite a bit actually); notably,
I thought we'd better describe things that are in our engine but
not XQuery, as well as vice-versa.

After a re-read of the XQuery spec, it seems to me that the character
entry form that they have and we don't is actually "&#NNNN;" like
HTML, rather than just "#NN".  Can anyone double-check that?  Does
it work outside bracket expressions, or only inside?

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2b4fe0c..32e6610 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5970,6 +5970,134 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');

 <!-- end re_syntax.n man page -->

+   <sect3 id="posix-vs-xquery">
+   <title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title>
+
+    <para>
+     Since SQL:2008, the SQL standard includes
+     a <literal>LIKE_REGEX</literal> operator that performs pattern
+     matching according to the XQuery regular expression
+     standard.  <productname>PostgreSQL</productname> does not yet
+     implement this operator, but you can get very similar behavior using
+     the <function>regexp_match()</function> function.
+    </para>
+
+    <para>
+     Notable differences between the existing POSIX-based
+     regular-expression feature and XQuery regular expressions include:
+
+     <itemizedlist>
+      <listitem>
+       <para>
+        XQuery character class subtraction is not supported.  An example of
+        this feature is using the following to match only English
+        consonants: <literal>[a-z-[aeiou]]</literal>.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery allows a literal character in the pattern to be written as
+        an HTML-style Unicode character reference, for
+        instance <literal>&#<replaceable>NNNN</replaceable>;</literal>.
+        This is not supported by POSIX, but you can get the same effect by
+        writing <literal>\u<replaceable>NNNN</replaceable></literal>.  (The
+        equivalence is only exact when the database encoding is UTF-8.)
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        The SQL standard (not XQuery itself) attempts to cater for more
+        variants of <quote>newline</quote> than POSIX does.  The
+        newline-sensitive matching options described above consider only
+        ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
+        us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
+        (a Windows-style newline), and some Unicode-only characters like
+        LINE SEPARATOR (U+2028) as newlines as well.
+        Notably, <literal>.</literal> and <literal>\s</literal> should
+        count <literal>\r\n</literal> as one character not two according to
+        SQL.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery character class shorthands <literal>\c</literal>,
+        <literal>\C</literal>, <literal>\i</literal>,
+        and <literal>\I</literal> are not supported.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery character class elements
+        using <literal>\p{UnicodeProperty}</literal> or the
+        inverse <literal>\P{UnicodeProperty}</literal> are not supported.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        POSIX interprets character classes such as <literal>\w</literal>
+        (see <xref linkend="posix-class-shorthand-escapes-table"/>)
+        according to the prevailing locale (which you can control by
+        attaching a <literal>COLLATE</literal> clause to the operator or
+        function).  XQuery specifies these classes by reference to Unicode
+        character properties, so equivalent behavior is obtained only with
+        a locale that follows the Unicode rules.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        Of the character-entry escapes described in
+        <xref linkend="posix-character-entry-escapes-table"/>,
+        XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
+        and <literal>\t</literal>.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery does not support
+        the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
+        for character classes within bracket expressions.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery does not have lookahead or lookbehind constraints,
+        nor any of the constraint escapes described in
+        <xref linkend="posix-constraint-escapes-table"/>.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        The metasyntax forms described in <xref linkend="posix-metasyntax"/>
+        do not exist in XQuery.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        The regular expression flag letters defined by XQuery are
+        related to but not the same as the option letters for POSIX
+        (<xref linkend="posix-embedded-options-table"/>).  While the
+        <literal>i</literal> and <literal>q</literal> options behave the
+        same, others do not.
+        XQuery's <literal>s</literal> (allow dot to match newline)
+        and <literal>m</literal> (allow <literal>^</literal>
+        and <literal>$</literal> to match at newlines) flags provide access
+        to the same behaviors as POSIX's <literal>n</literal>,
+        <literal>p</literal> and <literal>w</literal> flags, but
+        do <emphasis>not</emphasis> match the behavior of
+        POSIX's <literal>s</literal> and <literal>m</literal> flags.
+        Note in particular that dot-matches-newline is the default behavior
+        in POSIX but not XQuery.
+        Also, XQuery's <literal>x</literal> (ignore whitespace in pattern)
+        flag is noticeably different from POSIX's expanded-mode flag.
+        POSIX's <literal>x</literal> flag also allows <literal>#</literal> to
+        begin a comment in the pattern, and POSIX will not ignore a
+        whitespace character after a backslash.
+       </para>
+      </listitem>
+     </itemizedlist>
+    </para>
+
+   </sect3>
   </sect2>
  </sect1>

@@ -11793,6 +11921,14 @@ table2-mapping
 </programlisting>
     </para>
    </listitem>
+
+   <listitem>
+    <para>
+     There are minor differences in the interpretation of regular
+     expression patterns used in <literal>like_regex</literal> filters, as
+     described in <xref linkend="jsonpath-regular-expressions"/>.
+    </para>
+   </listitem>
   </itemizedlist>

    <sect3 id="strict-and-lax-modes">
@@ -11872,6 +12008,36 @@ table2-mapping

    </sect3>

+   <sect3 id="jsonpath-regular-expressions">
+    <title>Regular Expressions</title>
+
+    <para>
+     SQL/JSON path expressions allow matching text to a regular expression
+     with the <literal>like_regex</literal> filter.  For example, the
+     following SQL/JSON path query would case-insensitively match all
+     strings in an array that start with an English vowel:
+<programlisting>
+'$[*] ? (@ like_regex "^[aeiou]" flag "i")'
+</programlisting>
+    </para>
+
+    <para>
+     The SQL/JSON standard borrows its definition for regular expressions
+     from the <literal>LIKE_REGEX</literal> operator, which in turn uses the
+     XQuery standard.  PostgreSQL does not currently support the
+     <literal>LIKE_REGEX</literal> operator.  Therefore,
+     the <literal>like_regex</literal> filter is implemented using the
+     POSIX regular expression engine described in
+     <xref linkend="functions-posix-regexp"/>.  This leads to various minor
+     discrepancies from standard SQL/JSON behavior, which are cataloged in
+     <xref linkend="posix-vs-xquery"/>.
+     Note, however, that the flag-letter incompatibilities described there
+     do not apply to SQL/JSON, as it translates the XQuery flag letters to
+     match what the POSIX engine expects.
+    </para>
+
+   </sect3>
+
    <sect3 id="functions-sqljson-path-operators">
    <title>SQL/JSON Path Operators and Methods</title>

@@ -12113,10 +12279,13 @@ table2-mapping
        <row>
         <entry><literal>like_regex</literal></entry>
         <entry>
-          Tests pattern matching with POSIX regular expressions
-          (see <xref linkend="functions-posix-regexp"/>).  Supported flags
-          are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>,
-          <literal>x</literal>, and <literal>q</literal>.</entry>
+          Tests whether the first operand matches the regular expression
+          given by the second operand (see
+          <xref linkend="jsonpath-regular-expressions"/>).
+          An optional <literal>flag</literal> string can be given.
+          Supported flags are <literal>i</literal>, <literal>m</literal>,
+          <literal>s</literal>, and <literal>q</literal>.
+        </entry>
         <entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry>
         <entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry>
         <entry><literal>"abc", "aBdC", "abdacb"</literal></entry>

Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/17/19 6:40 PM, Tom Lane wrote:
> "Jonathan S. Katz" <jkatz@postgresql.org> writes:
>> v2 attached. Thanks!
>
> I whacked this around some (well, quite a bit actually);

So I see :) Thanks.

> notably,
> I thought we'd better describe things that are in our engine but
> not XQuery, as well as vice-versa.

Yeah, that makes sense. Overall it reads really well. One question I had
in my head (and probably should have asked) was answered around the \w
character class wrt collation.

> After a re-read of the XQuery spec, it seems to me that the character
> entry form that they have and we don't is actually "&#NNNN;" like
> HTML, rather than just "#NN".  Can anyone double-check that?

Clicking through the XQuery spec eventual got me to here[1] (which warns
me that its out of date, but that is what its "current" specs linked me
to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;"
to specify characters (which I recognize as a character escape from
HTML, XML et al.).

So based on that, my answer is "yes."

>  Does
> it work outside bracket expressions, or only inside?

Looking at the parse tree (start with the "atom"[2]), I read it as being
able to use that syntax both inside and outside the bracket expressions.

Here is a v4. I added some more paragraphs the bullet point that
explains the different flags to make it feel a bit less dense.

Thanks,

Jonathan

[1] https://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-charref
[2] https://www.w3.org/TR/xmlschema-2/#nt-atom

Вложения

Re: Define jsonpath functions as stable

От
Chapman Flack
Дата:
On 09/17/19 21:13, Jonathan S. Katz wrote:

> to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;"

Er, that is, "&#[0-9]+;" and "&#x[0-9a-fA-F]+;" (with x for the hex case).

>>  Does
>> it work outside bracket expressions, or only inside?
> 
> Looking at the parse tree (start with the "atom"[2]), I read it as being
> able to use that syntax both inside and outside the bracket expressions.

Maybe I can plug a really handy environment for messin'-around-in-XQuery,
BaseX:  http://basex.org/

All the buzzwords on the landing page make it seem as if it's going to be
some monstrous thing to download and set up, but on the downloads page,
the "Core Package" option is a single standalone 3.8 MB jar file:

  http://files.basex.org/releases/9.2.4/BaseX924.jar

"java -jar BaseX924.jar" is all it takes to start up, and wham, you're
in a nice IDE-like environment where the editor pane is syntax-aware
for XQuery and will run your code and show results with a click of the
go button.

Regards,
-Chap



Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/17/19 10:00 PM, Chapman Flack wrote:
> On 09/17/19 21:13, Jonathan S. Katz wrote:
>
>> to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;"
>
> Er, that is, "&#[0-9]+;" and "&#x[0-9a-fA-F]+;" (with x for the hex case).

Correct, I missed the "x".

Thanks,

Jonathan


Вложения

Re: Define jsonpath functions as stable

От
Alexander Korotkov
Дата:
On Wed, Sep 18, 2019 at 4:13 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> Here is a v4. I added some more paragraphs the bullet point that
> explains the different flags to make it feel a bit less dense.

Sorry that I didn't participate this discussion till now.  FWIW, I
agree with selected approach to document differences with XQuery regex
and and forbid 'x' from jsonpath like_regex.  Patch also looks good
for me at the first glance.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> On 9/17/19 6:40 PM, Tom Lane wrote:
>> After a re-read of the XQuery spec, it seems to me that the character
>> entry form that they have and we don't is actually "&#NNNN;" like
>> HTML, rather than just "#NN".  Can anyone double-check that?

> Clicking through the XQuery spec eventual got me to here[1] (which warns
> me that its out of date, but that is what its "current" specs linked me
> to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;"
> to specify characters (which I recognize as a character escape from
> HTML, XML et al.).

After further reading, it seems like what that text is talking about
is not actually a regex feature, but an outgrowth of the fact that
the regex pattern is being expressed as a string literal in a language
for which XML character entities are a native aspect of the string
literal syntax.  So it looks to me like the entities get folded to
raw characters in a string-literal parser before the regex engine
ever sees them.

As such, I think this doesn't apply to SQL/JSON.  The SQL/JSON spec
seems to defer to Javascript/ECMAscript for syntax details, and
in either of those languages you have backslash escape sequences
for writing weird characters, *not* XML entities.  You certainly
wouldn't have use of such entities in a native implementation of
LIKE_REGEX in SQL.

So now I'm thinking we can just remove the handwaving about entities.
On the other hand, this points up a large gap in our docs about
SQL/JSON, which is that nowhere does it even address the question of
what the string literal syntax is within a path expression.  Much
less point out that that syntax is nothing like native SQL strings.
Good luck finding out from the docs that you'd better double any
backslashes you'd like to have in your regex --- but a moment's
testing proves that that is the case in our code as it stands.
Have we misread the spec badly enough to get this wrong?

            regards, tom lane



Re: Define jsonpath functions as stable

От
Chapman Flack
Дата:
On 09/18/19 17:12, Tom Lane wrote:

> After further reading, it seems like what that text is talking about
> is not actually a regex feature, but an outgrowth of the fact that
> the regex pattern is being expressed as a string literal in a language
> for which XML character entities are a native aspect of the string
> literal syntax.  So it looks to me like the entities get folded to
> raw characters in a string-literal parser before the regex engine
> ever sees them.

Hmm. That occurred to me too, but I thought the explicit mention of
'character reference' in the section specific to regexes[1] might not
mean that. It certainly could have been clearer.

But you seem to have the practical agreement of both BaseX:

let $foo := codepoints-to-string((38,35,120,54,49,59))
return ($foo, matches('a', $foo))
------
a
false

and the Saxon-based pljava example:

select occurrences_regex('a', 'a', w3cNewlines => true);
 occurrences_regex
-------------------
                 0

> As such, I think this doesn't apply to SQL/JSON.  The SQL/JSON spec
> seems to defer to Javascript/ECMAscript for syntax details, and
> in either of those languages you have backslash escape sequences
> for writing weird characters, *not* XML entities.  You certainly
> wouldn't have use of such entities in a native implementation of
> LIKE_REGEX in SQL.

So yeah, that seems to be correct.

The upshot seems to be a two-parter:

1. Whatever string literal syntax is used in front of the regex engine
   had better have some way to represent any character you could want
   to match, and
2. There is only one way to literally match a character that is a regex
   metacharacter, namely, to precede it with a backslash (that the regex
   engine will see; therefore doubled if necessary). Whatever codepoint
   escape form might be available in the string literal syntax does not
   offer another way to do that, because it happens too early, before
   the regex engine can see it.

> So now I'm thinking we can just remove the handwaving about entities.
> On the other hand, this points up a large gap in our docs about
> SQL/JSON, which is that nowhere does it even address the question of
> what the string literal syntax is within a path expression.

That does seem like it ought to be covered.

Regards,
-Chap



Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
Chapman Flack <chap@anastigmatix.net> writes:
> On 09/18/19 17:12, Tom Lane wrote:
>> As such, I think this doesn't apply to SQL/JSON.  The SQL/JSON spec
>> seems to defer to Javascript/ECMAscript for syntax details, and
>> in either of those languages you have backslash escape sequences
>> for writing weird characters, *not* XML entities.  You certainly
>> wouldn't have use of such entities in a native implementation of
>> LIKE_REGEX in SQL.

> So yeah, that seems to be correct.

Thanks for double-checking.  I removed that para from the patch.

>> So now I'm thinking we can just remove the handwaving about entities.
>> On the other hand, this points up a large gap in our docs about
>> SQL/JSON, which is that nowhere does it even address the question of
>> what the string literal syntax is within a path expression.

> That does seem like it ought to be covered.

I found a spot that seemed like a reasonable place, and added some
coverage of the point.  Updated patch attached.

It seems to me that there are some discrepancies between what the spec
says and what jsonpath_scan.l actually does, so maybe we should take a
hard look at that code too.  The biggest issue is that jsonpath_scan.l
seems to allow single- and double-quoted strings interchangeably, which is
OK per ECMAScript, but then the SQL/JSON spec seems to be saying that only
double-quoted strings are allowed.  I'd rather be conservative about this
than get out in front of the spec and use syntax space that they might do
something else with someday.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2b4fe0c..16e41a6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5970,6 +5970,145 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');

 <!-- end re_syntax.n man page -->

+   <sect3 id="posix-vs-xquery">
+   <title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title>
+
+   <indexterm zone="posix-vs-xquery">
+    <primary><literal>LIKE_REGEX</literal></primary>
+   </indexterm>
+
+   <indexterm zone="posix-vs-xquery">
+    <primary>XQuery regular expressions</primary>
+   </indexterm>
+
+    <para>
+     Since SQL:2008, the SQL standard includes
+     a <literal>LIKE_REGEX</literal> operator that performs pattern
+     matching according to the XQuery regular expression
+     standard.  <productname>PostgreSQL</productname> does not yet
+     implement this operator, but you can get very similar behavior using
+     the <function>regexp_match()</function> function, since XQuery
+     regular expressions are quite close to the ARE syntax described above.
+    </para>
+
+    <para>
+     Notable differences between the existing POSIX-based
+     regular-expression feature and XQuery regular expressions include:
+
+     <itemizedlist>
+      <listitem>
+       <para>
+        XQuery character class subtraction is not supported.  An example of
+        this feature is using the following to match only English
+        consonants: <literal>[a-z-[aeiou]]</literal>.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery character class shorthands <literal>\c</literal>,
+        <literal>\C</literal>, <literal>\i</literal>,
+        and <literal>\I</literal> are not supported.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery character class elements
+        using <literal>\p{UnicodeProperty}</literal> or the
+        inverse <literal>\P{UnicodeProperty}</literal> are not supported.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        POSIX interprets character classes such as <literal>\w</literal>
+        (see <xref linkend="posix-class-shorthand-escapes-table"/>)
+        according to the prevailing locale (which you can control by
+        attaching a <literal>COLLATE</literal> clause to the operator or
+        function).  XQuery specifies these classes by reference to Unicode
+        character properties, so equivalent behavior is obtained only with
+        a locale that follows the Unicode rules.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        The SQL standard (not XQuery itself) attempts to cater for more
+        variants of <quote>newline</quote> than POSIX does.  The
+        newline-sensitive matching options described above consider only
+        ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
+        us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
+        (a Windows-style newline), and some Unicode-only characters like
+        LINE SEPARATOR (U+2028) as newlines as well.
+        Notably, <literal>.</literal> and <literal>\s</literal> should
+        count <literal>\r\n</literal> as one character not two according to
+        SQL.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        Of the character-entry escapes described in
+        <xref linkend="posix-character-entry-escapes-table"/>,
+        XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
+        and <literal>\t</literal>.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery does not support
+        the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
+        for character classes within bracket expressions.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        XQuery does not have lookahead or lookbehind constraints,
+        nor any of the constraint escapes described in
+        <xref linkend="posix-constraint-escapes-table"/>.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        The metasyntax forms described in <xref linkend="posix-metasyntax"/>
+        do not exist in XQuery.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        The regular expression flag letters defined by XQuery are
+        related to but not the same as the option letters for POSIX
+        (<xref linkend="posix-embedded-options-table"/>).  While the
+        <literal>i</literal> and <literal>q</literal> options behave the
+        same, others do not:
+        <itemizedlist>
+         <listitem>
+          <para>
+           XQuery's <literal>s</literal> (allow dot to match newline)
+           and <literal>m</literal> (allow <literal>^</literal>
+           and <literal>$</literal> to match at newlines) flags provide
+           access to the same behaviors as
+           POSIX's <literal>n</literal>, <literal>p</literal>
+           and <literal>w</literal> flags, but they
+           do <emphasis>not</emphasis> match the behavior of
+           POSIX's <literal>s</literal> and <literal>m</literal> flags.
+           Note in particular that dot-matches-newline is the default
+           behavior in POSIX but not XQuery.
+          </para>
+         </listitem>
+         <listitem>
+          <para>
+           Also, XQuery's <literal>x</literal> (ignore whitespace in
+           pattern) flag is noticeably different from POSIX's expanded-mode
+           flag.  POSIX's <literal>x</literal> flag also
+           allows <literal>#</literal> to begin a comment in the pattern,
+           and POSIX will not ignore a whitespace character after a
+           backslash.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
+     </itemizedlist>
+    </para>
+
+   </sect3>
   </sect2>
  </sect1>

@@ -11793,6 +11932,14 @@ table2-mapping
 </programlisting>
     </para>
    </listitem>
+
+   <listitem>
+    <para>
+     There are minor differences in the interpretation of regular
+     expression patterns used in <literal>like_regex</literal> filters, as
+     described in <xref linkend="jsonpath-regular-expressions"/>.
+    </para>
+   </listitem>
   </itemizedlist>

    <sect3 id="strict-and-lax-modes">
@@ -11872,6 +12019,63 @@ table2-mapping

    </sect3>

+   <sect3 id="jsonpath-regular-expressions">
+    <title>Regular Expressions</title>
+
+    <indexterm zone="jsonpath-regular-expressions">
+     <primary><literal>LIKE_REGEX</literal></primary>
+     <secondary>in SQL/JSON</secondary>
+    </indexterm>
+
+    <para>
+     SQL/JSON path expressions allow matching text to a regular expression
+     with the <literal>like_regex</literal> filter.  For example, the
+     following SQL/JSON path query would case-insensitively match all
+     strings in an array that start with an English vowel:
+<programlisting>
+'$[*] ? (@ like_regex "^[aeiou]" flag "i")'
+</programlisting>
+    </para>
+
+    <para>
+     The optional <literal>flag</literal> string may include one or more of
+     the characters
+     <literal>i</literal> for case-insensitive match,
+     <literal>m</literal> to allow <literal>^</literal>
+     and <literal>$</literal> to match at newlines,
+     <literal>s</literal> to allow <literal>.</literal> to match a newline,
+     and <literal>q</literal> to quote the whole pattern (reducing the
+     behavior to a simple substring match).
+    </para>
+
+    <para>
+     The SQL/JSON standard borrows its definition for regular expressions
+     from the <literal>LIKE_REGEX</literal> operator, which in turn uses the
+     XQuery standard.  PostgreSQL does not currently support the
+     <literal>LIKE_REGEX</literal> operator.  Therefore,
+     the <literal>like_regex</literal> filter is implemented using the
+     POSIX regular expression engine described in
+     <xref linkend="functions-posix-regexp"/>.  This leads to various minor
+     discrepancies from standard SQL/JSON behavior, which are cataloged in
+     <xref linkend="posix-vs-xquery"/>.
+     Note, however, that the flag-letter incompatibilities described there
+     do not apply to SQL/JSON, as it translates the XQuery flag letters to
+     match what the POSIX engine expects.
+    </para>
+
+    <para>
+     Keep in mind that the pattern argument of <literal>like_regex</literal>
+     is a JSON path string literal, written according to the rules given in
+     <xref linkend="datatype-jsonpath"/>.  This means in particular that any
+     backslashes you want to use in the regular expression must be doubled.
+     For example, to match strings that contain only digits:
+<programlisting>
+'$ ? (@ like_regex "^\\d+$")'
+</programlisting>
+    </para>
+
+   </sect3>
+
    <sect3 id="functions-sqljson-path-operators">
    <title>SQL/JSON Path Operators and Methods</title>

@@ -12113,10 +12317,11 @@ table2-mapping
        <row>
         <entry><literal>like_regex</literal></entry>
         <entry>
-          Tests pattern matching with POSIX regular expressions
-          (see <xref linkend="functions-posix-regexp"/>).  Supported flags
-          are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>,
-          <literal>x</literal>, and <literal>q</literal>.</entry>
+          Tests whether the first operand matches the regular expression
+          given by the second operand, optionally with modifications
+          described by a string of <literal>flag</literal> characters (see
+          <xref linkend="jsonpath-regular-expressions"/>)
+        </entry>
         <entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry>
         <entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry>
         <entry><literal>"abc", "aBdC", "abdacb"</literal></entry>
diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 4f566a4..45b22b6 100644
--- a/doc/src/sgml/json.sgml
+++ b/doc/src/sgml/json.sgml
@@ -666,13 +666,32 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu
   </itemizedlist>

   <para>
-   An SQL/JSON path expression is an SQL character string literal,
-   so it must be enclosed in single quotes when passed to an SQL/JSON
-   query function. Following the JavaScript
-   conventions, character string literals within the path expression
-   must be enclosed in double quotes. Any single quotes within this
-   character string literal must be escaped with a single quote
-   by the SQL convention.
+   An SQL/JSON path expression is typically written in an SQL query as an
+   SQL character string literal, so it must be enclosed in single quotes,
+   and any single quotes desired within the value must be doubled
+   (see <xref linkend="sql-syntax-strings"/>).
+   Some forms of path expressions require string literals within them.
+   These embedded string literals follow JavaScript/ECMAScript conventions:
+   they must be surrounded by double quotes, and backslash escapes may be
+   used within them to represent otherwise-hard-to-type characters.
+   In particular, the way to write a double quote within an embedded string
+   literal is <literal>\"</literal>, and to write a backslash itself, you
+   must write <literal>\\</literal>.  Other special backslash sequences
+   include those recognized in JSON strings:
+   <literal>\b</literal>,
+   <literal>\f</literal>,
+   <literal>\n</literal>,
+   <literal>\r</literal>,
+   <literal>\t</literal>,
+   <literal>\v</literal>
+   for various ASCII control characters, and
+   <literal>\u<replaceable>NNNN</replaceable></literal> for a Unicode
+   character identified by its 4-hex-digit code point.  The backslash
+   syntax also includes two cases not allowed by JSON:
+   <literal>\x<replaceable>NN</replaceable></literal> for a character code
+   written with only two hex digits, and
+   <literal>\u{<replaceable>N...</replaceable>}</literal> for a character
+   code written with 1 to 6 hex digits.
   </para>

   <para>

Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
I wrote:
> I found a spot that seemed like a reasonable place, and added some
> coverage of the point.  Updated patch attached.

Doc patch pushed.

> It seems to me that there are some discrepancies between what the spec
> says and what jsonpath_scan.l actually does, so maybe we should take a
> hard look at that code too.  The biggest issue is that jsonpath_scan.l
> seems to allow single- and double-quoted strings interchangeably, which is
> OK per ECMAScript, but then the SQL/JSON spec seems to be saying that only
> double-quoted strings are allowed.  I'd rather be conservative about this
> than get out in front of the spec and use syntax space that they might do
> something else with someday.

The attached proposed patch makes these changes:

1. Remove support for single-quoted literals in jsonpath.

2. Treat an unrecognized escape (e.g., "\z") as meaning the escaped
   character, rather than throwing an error.

3. A few cosmetic adjustments to make the jsonpath_scan code shorter and
   clearer (IMHO).

As for #1, although the SQL/JSON tech report does reference ECMAScript
which allows both single- and double-quoted strings, it seems to me
that their intent is to allow only the double-quoted variant.  They
specifically reference JSON string literals at one point, and of course
JSON only allows double-quoted.  Also, all of their discussion and
examples use double-quoted.  Plus you'd have to be pretty nuts to want
to use single-quoted when writing a jsonpath string literal inside a SQL
literal (and the tech report seems to contemplate that jsonpaths MUST be
string literals, though of course our implementation does not require
that).

As for #2, the existing code throws an error, but this is contrary
to clear statements in every single one of the relevant standards.

            regards, tom lane

diff --git a/src/backend/utils/adt/jsonpath_scan.l b/src/backend/utils/adt/jsonpath_scan.l
index 2165ffc..245255f 100644
--- a/src/backend/utils/adt/jsonpath_scan.l
+++ b/src/backend/utils/adt/jsonpath_scan.l
@@ -59,24 +59,23 @@ fprintf_to_ereport(const char *fmt, const char *msg)
 %option noyyfree

 /*
- * We use exclusive states for quoted, signle-quoted and non-quoted strings,
- * quoted variable names and C-tyle comments.
+ * We use exclusive states for quoted and non-quoted strings,
+ * quoted variable names and C-style comments.
  * Exclusive states:
  *  <xq> - quoted strings
  *  <xnq> - non-quoted strings
  *  <xvq> - quoted variable names
- *  <xsq> - single-quoted strings
  *  <xc> - C-style comment
  */

 %x xq
 %x xnq
 %x xvq
-%x xsq
 %x xc

-special         [\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/]
-any            [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\"\' \t\n\r\f]
+/* "other" means anything that's not special, blank, or '\' or '"' */
+special        [\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/]
+other        [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f]
 blank        [ \t\n\r\f]

 digit        [0-9]
@@ -95,7 +94,7 @@ hex_fail    \\x{hex_dig}{0,1}

 %%

-<xnq>{any}+                        {
+<xnq>{other}+                    {
                                     addstring(false, yytext, yyleng);
                                 }

@@ -105,13 +104,12 @@ hex_fail    \\x{hex_dig}{0,1}
                                     return checkKeyword();
                                 }

-
 <xnq>\/\*                        {
                                     yylval->str = scanstring;
                                     BEGIN xc;
                                 }

-<xnq>({special}|\"|\')            {
+<xnq>({special}|\")                {
                                     yylval->str = scanstring;
                                     yyless(0);
                                     BEGIN INITIAL;
@@ -124,39 +122,37 @@ hex_fail    \\x{hex_dig}{0,1}
                                     return checkKeyword();
                                 }

-<xnq,xq,xvq,xsq>\\[\"\'\\]        { addchar(false, yytext[1]); }
-
-<xnq,xq,xvq,xsq>\\b                { addchar(false, '\b'); }
+<xnq,xq,xvq>\\b                { addchar(false, '\b'); }

-<xnq,xq,xvq,xsq>\\f                { addchar(false, '\f'); }
+<xnq,xq,xvq>\\f                { addchar(false, '\f'); }

-<xnq,xq,xvq,xsq>\\n                { addchar(false, '\n'); }
+<xnq,xq,xvq>\\n                { addchar(false, '\n'); }

-<xnq,xq,xvq,xsq>\\r                { addchar(false, '\r'); }
+<xnq,xq,xvq>\\r                { addchar(false, '\r'); }

-<xnq,xq,xvq,xsq>\\t                { addchar(false, '\t'); }
+<xnq,xq,xvq>\\t                { addchar(false, '\t'); }

-<xnq,xq,xvq,xsq>\\v                { addchar(false, '\v'); }
+<xnq,xq,xvq>\\v                { addchar(false, '\v'); }

-<xnq,xq,xvq,xsq>{unicode}+        { parseUnicode(yytext, yyleng); }
+<xnq,xq,xvq>{unicode}+        { parseUnicode(yytext, yyleng); }

-<xnq,xq,xvq,xsq>{hex_char}        { parseHexChar(yytext); }
+<xnq,xq,xvq>{hex_char}        { parseHexChar(yytext); }

-<xnq,xq,xvq,xsq>{unicode}*{unicodefail}    { yyerror(NULL, "invalid unicode sequence"); }
+<xnq,xq,xvq>{unicode}*{unicodefail}    { yyerror(NULL, "invalid unicode sequence"); }

-<xnq,xq,xvq,xsq>{hex_fail}        { yyerror(NULL, "invalid hex character sequence"); }
+<xnq,xq,xvq>{hex_fail}        { yyerror(NULL, "invalid hex character sequence"); }

-<xnq,xq,xvq,xsq>{unicode}+\\    {
-                                    /* throw back the \\, and treat as unicode */
-                                    yyless(yyleng - 1);
-                                    parseUnicode(yytext, yyleng);
-                                }
+<xnq,xq,xvq>{unicode}+\\    {
+                                /* throw back the \\, and treat as unicode */
+                                yyless(yyleng - 1);
+                                parseUnicode(yytext, yyleng);
+                            }

-<xnq,xq,xvq,xsq>\\.                { yyerror(NULL, "escape sequence is invalid"); }
+<xnq,xq,xvq>\\.                { addchar(false, yytext[1]); }

-<xnq,xq,xvq,xsq>\\                { yyerror(NULL, "unexpected end after backslash"); }
+<xnq,xq,xvq>\\                { yyerror(NULL, "unexpected end after backslash"); }

-<xq,xvq,xsq><<EOF>>                { yyerror(NULL, "unexpected end of quoted string"); }
+<xq,xvq><<EOF>>                { yyerror(NULL, "unexpected end of quoted string"); }

 <xq>\"                            {
                                     yylval->str = scanstring;
@@ -170,16 +166,8 @@ hex_fail    \\x{hex_dig}{0,1}
                                     return VARIABLE_P;
                                 }

-<xsq>\'                            {
-                                    yylval->str = scanstring;
-                                    BEGIN INITIAL;
-                                    return STRING_P;
-                                }
-
 <xq,xvq>[^\\\"]+                { addstring(false, yytext, yyleng); }

-<xsq>[^\\\']+                    { addstring(false, yytext, yyleng); }
-
 <xc>\*\/                        { BEGIN INITIAL; }

 <xc>[^\*]+                        { }
@@ -210,7 +198,7 @@ hex_fail    \\x{hex_dig}{0,1}

 \>                                { return GREATER_P; }

-\${any}+                        {
+\${other}+                        {
                                     addstring(true, yytext + 1, yyleng - 1);
                                     addchar(false, '\0');
                                     yylval->str = scanstring;
@@ -263,27 +251,22 @@ hex_fail    \\x{hex_dig}{0,1}

 ({realfail1}|{realfail2})        { yyerror(NULL, "invalid floating point number"); }

-{any}+                            {
-                                    addstring(true, yytext, yyleng);
-                                    BEGIN xnq;
-                                }
-
 \"                                {
                                     addchar(true, '\0');
                                     BEGIN xq;
                                 }

-\'                                {
-                                    addchar(true, '\0');
-                                    BEGIN xsq;
-                                }
-
 \\                                {
                                     yyless(0);
                                     addchar(true, '\0');
                                     BEGIN xnq;
                                 }

+{other}+                        {
+                                    addstring(true, yytext, yyleng);
+                                    BEGIN xnq;
+                                }
+
 <<EOF>>                            { yyterminate(); }

 %%
diff --git a/src/test/regress/expected/jsonpath.out b/src/test/regress/expected/jsonpath.out
index ea42ae3..fc971dc 100644
--- a/src/test/regress/expected/jsonpath.out
+++ b/src/test/regress/expected/jsonpath.out
@@ -171,30 +171,24 @@ select '"\b\f\r\n\t\v\"\''\\"'::jsonpath;
  "\b\f\r\n\t\u000b\"'\\"
 (1 row)

-select '''\b\f\r\n\t\v\"\''\\'''::jsonpath;
-        jsonpath
--------------------------
- "\b\f\r\n\t\u000b\"'\\"
-(1 row)
-
 select '"\x50\u0067\u{53}\u{051}\u{00004C}"'::jsonpath;
  jsonpath
 ----------
  "PgSQL"
 (1 row)

-select '''\x50\u0067\u{53}\u{051}\u{00004C}'''::jsonpath;
- jsonpath
-----------
- "PgSQL"
-(1 row)
-
 select '$.foo\x50\u0067\u{53}\u{051}\u{00004C}\t\"bar'::jsonpath;
       jsonpath
 ---------------------
  $."fooPgSQL\t\"bar"
 (1 row)

+select '"\z"'::jsonpath;  -- unrecognized escape is just the literal char
+ jsonpath
+----------
+ "z"
+(1 row)
+
 select '$.g ? ($.a == 1)'::jsonpath;
       jsonpath
 --------------------
diff --git a/src/test/regress/expected/jsonpath_encoding.out b/src/test/regress/expected/jsonpath_encoding.out
index 8db6e47..ecffe09 100644
--- a/src/test/regress/expected/jsonpath_encoding.out
+++ b/src/test/regress/expected/jsonpath_encoding.out
@@ -81,84 +81,6 @@ select '"null \\u0000 escape"'::jsonpath as not_an_escape;
  "null \\u0000 escape"
 (1 row)

--- checks for single-quoted values
--- basic unicode input
-SELECT E'\'\u\''::jsonpath;        -- ERROR, incomplete escape
-ERROR:  invalid Unicode escape
-LINE 1: SELECT E'\'\u\''::jsonpath;
-               ^
-HINT:  Unicode escapes must be \uXXXX or \UXXXXXXXX.
-SELECT E'\'\u00\''::jsonpath;    -- ERROR, incomplete escape
-ERROR:  invalid Unicode escape
-LINE 1: SELECT E'\'\u00\''::jsonpath;
-               ^
-HINT:  Unicode escapes must be \uXXXX or \UXXXXXXXX.
-SELECT E'\'\u000g\''::jsonpath;    -- ERROR, g is not a hex digit
-ERROR:  invalid Unicode escape
-LINE 1: SELECT E'\'\u000g\''::jsonpath;
-               ^
-HINT:  Unicode escapes must be \uXXXX or \UXXXXXXXX.
-SELECT E'\'\u0000\''::jsonpath;    -- OK, legal escape
-ERROR:  invalid Unicode escape value at or near "E'\'\u0000"
-LINE 1: SELECT E'\'\u0000\''::jsonpath;
-               ^
-SELECT E'\'\uaBcD\''::jsonpath;    -- OK, uppercase and lower case both OK
- jsonpath
-----------
- "ꯍ"
-(1 row)
-
--- handling of unicode surrogate pairs
-select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_in_utf8;
- correct_in_utf8
------------------
- "😄🐶"
-(1 row)
-
-select E'\'\ud83d\ud83d\''::jsonpath; -- 2 high surrogates in a row
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ud83d\ud83d"
-LINE 1: select E'\'\ud83d\ud83d\''::jsonpath;
-               ^
-select E'\'\ude04\ud83d\''::jsonpath; -- surrogates in wrong order
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ude04"
-LINE 1: select E'\'\ude04\ud83d\''::jsonpath;
-               ^
-select E'\'\ud83dX\''::jsonpath; -- orphan high surrogate
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ud83dX"
-LINE 1: select E'\'\ud83dX\''::jsonpath;
-               ^
-select E'\'\ude04X\''::jsonpath; -- orphan low surrogate
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ude04"
-LINE 1: select E'\'\ude04X\''::jsonpath;
-               ^
---handling of simple unicode escapes
-select E'\'the Copyright \u00a9 sign\''::jsonpath as correct_in_utf8;
-    correct_in_utf8
-------------------------
- "the Copyright © sign"
-(1 row)
-
-select E'\'dollar \u0024 character\''::jsonpath as correct_everywhere;
-  correct_everywhere
-----------------------
- "dollar $ character"
-(1 row)
-
-select E'\'dollar \\u0024 character\''::jsonpath as not_an_escape;
-    not_an_escape
-----------------------
- "dollar $ character"
-(1 row)
-
-select E'\'null \u0000 escape\''::jsonpath as not_unescaped;
-ERROR:  invalid Unicode escape value at or near "E'\'null \u0000"
-LINE 1: select E'\'null \u0000 escape\''::jsonpath as not_unescaped;
-               ^
-select E'\'null \\u0000 escape\''::jsonpath as not_an_escape;
-ERROR:  unsupported Unicode escape sequence
-LINE 1: select E'\'null \\u0000 escape\''::jsonpath as not_an_escape...
-               ^
-DETAIL:  \u0000 cannot be converted to text.
 -- checks for quoted key names
 -- basic unicode input
 SELECT '$."\u"'::jsonpath;        -- ERROR, incomplete escape
diff --git a/src/test/regress/expected/jsonpath_encoding_1.out b/src/test/regress/expected/jsonpath_encoding_1.out
index e6dff25..c8cc217 100644
--- a/src/test/regress/expected/jsonpath_encoding_1.out
+++ b/src/test/regress/expected/jsonpath_encoding_1.out
@@ -78,78 +78,6 @@ select '"null \\u0000 escape"'::jsonpath as not_an_escape;
  "null \\u0000 escape"
 (1 row)

--- checks for single-quoted values
--- basic unicode input
-SELECT E'\'\u\''::jsonpath;        -- ERROR, incomplete escape
-ERROR:  invalid Unicode escape
-LINE 1: SELECT E'\'\u\''::jsonpath;
-               ^
-HINT:  Unicode escapes must be \uXXXX or \UXXXXXXXX.
-SELECT E'\'\u00\''::jsonpath;    -- ERROR, incomplete escape
-ERROR:  invalid Unicode escape
-LINE 1: SELECT E'\'\u00\''::jsonpath;
-               ^
-HINT:  Unicode escapes must be \uXXXX or \UXXXXXXXX.
-SELECT E'\'\u000g\''::jsonpath;    -- ERROR, g is not a hex digit
-ERROR:  invalid Unicode escape
-LINE 1: SELECT E'\'\u000g\''::jsonpath;
-               ^
-HINT:  Unicode escapes must be \uXXXX or \UXXXXXXXX.
-SELECT E'\'\u0000\''::jsonpath;    -- OK, legal escape
-ERROR:  invalid Unicode escape value at or near "E'\'\u0000"
-LINE 1: SELECT E'\'\u0000\''::jsonpath;
-               ^
-SELECT E'\'\uaBcD\''::jsonpath;    -- OK, uppercase and lower case both OK
-ERROR:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at
ornear "E'\'\uaBcD" 
-LINE 1: SELECT E'\'\uaBcD\''::jsonpath;
-               ^
--- handling of unicode surrogate pairs
-select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_in_utf8;
-ERROR:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at
ornear "E'\'\ud83d\ude04" 
-LINE 1: select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_...
-               ^
-select E'\'\ud83d\ud83d\''::jsonpath; -- 2 high surrogates in a row
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ud83d\ud83d"
-LINE 1: select E'\'\ud83d\ud83d\''::jsonpath;
-               ^
-select E'\'\ude04\ud83d\''::jsonpath; -- surrogates in wrong order
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ude04"
-LINE 1: select E'\'\ude04\ud83d\''::jsonpath;
-               ^
-select E'\'\ud83dX\''::jsonpath; -- orphan high surrogate
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ud83dX"
-LINE 1: select E'\'\ud83dX\''::jsonpath;
-               ^
-select E'\'\ude04X\''::jsonpath; -- orphan low surrogate
-ERROR:  invalid Unicode surrogate pair at or near "E'\'\ude04"
-LINE 1: select E'\'\ude04X\''::jsonpath;
-               ^
---handling of simple unicode escapes
-select E'\'the Copyright \u00a9 sign\''::jsonpath as correct_in_utf8;
-ERROR:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at
ornear "E'\'the Copyright \u00a9" 
-LINE 1: select E'\'the Copyright \u00a9 sign\''::jsonpath as correct...
-               ^
-select E'\'dollar \u0024 character\''::jsonpath as correct_everywhere;
-  correct_everywhere
-----------------------
- "dollar $ character"
-(1 row)
-
-select E'\'dollar \\u0024 character\''::jsonpath as not_an_escape;
-    not_an_escape
-----------------------
- "dollar $ character"
-(1 row)
-
-select E'\'null \u0000 escape\''::jsonpath as not_unescaped;
-ERROR:  invalid Unicode escape value at or near "E'\'null \u0000"
-LINE 1: select E'\'null \u0000 escape\''::jsonpath as not_unescaped;
-               ^
-select E'\'null \\u0000 escape\''::jsonpath as not_an_escape;
-ERROR:  unsupported Unicode escape sequence
-LINE 1: select E'\'null \\u0000 escape\''::jsonpath as not_an_escape...
-               ^
-DETAIL:  \u0000 cannot be converted to text.
 -- checks for quoted key names
 -- basic unicode input
 SELECT '$."\u"'::jsonpath;        -- ERROR, incomplete escape
diff --git a/src/test/regress/sql/jsonpath.sql b/src/test/regress/sql/jsonpath.sql
index 29ea77a..7afe252 100644
--- a/src/test/regress/sql/jsonpath.sql
+++ b/src/test/regress/sql/jsonpath.sql
@@ -30,10 +30,9 @@ select '$.a/+-1'::jsonpath;
 select '1 * 2 + 4 % -3 != false'::jsonpath;

 select '"\b\f\r\n\t\v\"\''\\"'::jsonpath;
-select '''\b\f\r\n\t\v\"\''\\'''::jsonpath;
 select '"\x50\u0067\u{53}\u{051}\u{00004C}"'::jsonpath;
-select '''\x50\u0067\u{53}\u{051}\u{00004C}'''::jsonpath;
 select '$.foo\x50\u0067\u{53}\u{051}\u{00004C}\t\"bar'::jsonpath;
+select '"\z"'::jsonpath;  -- unrecognized escape is just the literal char

 select '$.g ? ($.a == 1)'::jsonpath;
 select '$.g ? (@ == 1)'::jsonpath;
diff --git a/src/test/regress/sql/jsonpath_encoding.sql b/src/test/regress/sql/jsonpath_encoding.sql
index a3b5bc3..3a23b72 100644
--- a/src/test/regress/sql/jsonpath_encoding.sql
+++ b/src/test/regress/sql/jsonpath_encoding.sql
@@ -24,29 +24,6 @@ select '"dollar \\u0024 character"'::jsonpath as not_an_escape;
 select '"null \u0000 escape"'::jsonpath as not_unescaped;
 select '"null \\u0000 escape"'::jsonpath as not_an_escape;

--- checks for single-quoted values
-
--- basic unicode input
-SELECT E'\'\u\''::jsonpath;        -- ERROR, incomplete escape
-SELECT E'\'\u00\''::jsonpath;    -- ERROR, incomplete escape
-SELECT E'\'\u000g\''::jsonpath;    -- ERROR, g is not a hex digit
-SELECT E'\'\u0000\''::jsonpath;    -- OK, legal escape
-SELECT E'\'\uaBcD\''::jsonpath;    -- OK, uppercase and lower case both OK
-
--- handling of unicode surrogate pairs
-select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_in_utf8;
-select E'\'\ud83d\ud83d\''::jsonpath; -- 2 high surrogates in a row
-select E'\'\ude04\ud83d\''::jsonpath; -- surrogates in wrong order
-select E'\'\ud83dX\''::jsonpath; -- orphan high surrogate
-select E'\'\ude04X\''::jsonpath; -- orphan low surrogate
-
---handling of simple unicode escapes
-select E'\'the Copyright \u00a9 sign\''::jsonpath as correct_in_utf8;
-select E'\'dollar \u0024 character\''::jsonpath as correct_everywhere;
-select E'\'dollar \\u0024 character\''::jsonpath as not_an_escape;
-select E'\'null \u0000 escape\''::jsonpath as not_unescaped;
-select E'\'null \\u0000 escape\''::jsonpath as not_an_escape;
-
 -- checks for quoted key names

 -- basic unicode input

Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/19/19 12:25 PM, Tom Lane wrote:
> I wrote:
>> I found a spot that seemed like a reasonable place, and added some
>> coverage of the point.  Updated patch attached.
>
> Doc patch pushed.

Thanks! I did not get to review them last night but upon review not too
long ago, they looked great.

>> It seems to me that there are some discrepancies between what the spec
>> says and what jsonpath_scan.l actually does, so maybe we should take a
>> hard look at that code too.  The biggest issue is that jsonpath_scan.l
>> seems to allow single- and double-quoted strings interchangeably, which is
>> OK per ECMAScript, but then the SQL/JSON spec seems to be saying that only
>> double-quoted strings are allowed.  I'd rather be conservative about this
>> than get out in front of the spec and use syntax space that they might do
>> something else with someday.

I agree with erring on the side of the spec vs. what ECMAScript does. In
JSON, strings, identifiers, etc. are double-quoted. Anything that is
single quoted with throw an error in a compliant JSON parser.

Looking at the user documentation for how some other databases with
SQL/JSON support, this seems to back up your analysis.

>
> The attached proposed patch makes these changes:
>
> 1. Remove support for single-quoted literals in jsonpath.
>
> 2. Treat an unrecognized escape (e.g., "\z") as meaning the escaped
>    character, rather than throwing an error.
>
> 3. A few cosmetic adjustments to make the jsonpath_scan code shorter and
>    clearer (IMHO).

If this refers to s/any/other/, yes I would agree it's clearer.

> As for #1, although the SQL/JSON tech report does reference ECMAScript
> which allows both single- and double-quoted strings, it seems to me
> that their intent is to allow only the double-quoted variant.  They
> specifically reference JSON string literals at one point, and of course
> JSON only allows double-quoted.  Also, all of their discussion and
> examples use double-quoted.  Plus you'd have to be pretty nuts to want
> to use single-quoted when writing a jsonpath string literal inside a SQL
> literal (and the tech report seems to contemplate that jsonpaths MUST be
> string literals, though of course our implementation does not require
> that).

I agree with the above (though wrt single-quoting and literals, I have
seen stranger things).

> As for #2, the existing code throws an error, but this is contrary
> to clear statements in every single one of the relevant standards.

Makes sense.

I looked at the patch, but did not test it. From what I can see, it
looks good, but perhaps we add a test in it to show that single-quoted
literals are unsupported?

Jonathan


Вложения

Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> I looked at the patch, but did not test it. From what I can see, it
> looks good, but perhaps we add a test in it to show that single-quoted
> literals are unsupported?

I thought about that, but it seems like it'd be memorializing some
other weird behavior:

regression=# select '''foo'''::jsonpath;
ERROR:  syntax error, unexpected IDENT_P at end of jsonpath input
LINE 1: select '''foo'''::jsonpath;
               ^

regression=# select '''foo'' <= ''bar'''::jsonpath;
ERROR:  syntax error, unexpected IDENT_P at or near " " of jsonpath input
LINE 1: select '''foo'' <= ''bar'''::jsonpath;
               ^

There isn't anything I like about these error messages.  Seems like
the error handling in jsonpath_gram.y could use some cleanup too
... although I don't think it's a task to tackle while we're
rushing to get v12 shippable.

            regards, tom lane



Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/19/19 3:48 PM, Tom Lane wrote:
> "Jonathan S. Katz" <jkatz@postgresql.org> writes:
>> I looked at the patch, but did not test it. From what I can see, it
>> looks good, but perhaps we add a test in it to show that single-quoted
>> literals are unsupported?
>
> I thought about that, but it seems like it'd be memorializing some
> other weird behavior:
>
> regression=# select '''foo'''::jsonpath;
> ERROR:  syntax error, unexpected IDENT_P at end of jsonpath input
> LINE 1: select '''foo'''::jsonpath;
>                ^
>
> regression=# select '''foo'' <= ''bar'''::jsonpath;
> ERROR:  syntax error, unexpected IDENT_P at or near " " of jsonpath input
> LINE 1: select '''foo'' <= ''bar'''::jsonpath;
>                ^

Ah yeah, those are some interesting errors.

> There isn't anything I like about these error messages.

Agreed. It would be nice to have tests around it, but yes, I think
looking at the regression outpout one may scratch their head.

>  Seems like
> the error handling in jsonpath_gram.y could use some cleanup too
> ... although I don't think it's a task to tackle while we're
> rushing to get v12 shippable.

IIRC if we want to change the contents of an error message we wait until
major releases. Is there anything we can do before 12 to avoid messages
like "unexpected IDENT_P" coming to a user? Would that be something
acceptable to fix as a 12.1 or would it have to wait until 13?

Jonathan


Вложения

Re: Define jsonpath functions as stable

От
Tom Lane
Дата:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> On 9/19/19 3:48 PM, Tom Lane wrote:
>> Seems like
>> the error handling in jsonpath_gram.y could use some cleanup too
>> ... although I don't think it's a task to tackle while we're
>> rushing to get v12 shippable.

> IIRC if we want to change the contents of an error message we wait until
> major releases. Is there anything we can do before 12 to avoid messages
> like "unexpected IDENT_P" coming to a user? Would that be something
> acceptable to fix as a 12.1 or would it have to wait until 13?

I think these messages are sufficiently confusing that we could call
it a bug fix to improve them.  As long as we don't change the SQLSTATE
that's thrown, it's hard to claim that there's any real application
compatibility hazard from changing them.

I just don't want to call this point a release blocker.  It's not
about changing any semantics or the set of things that work.

            regards, tom lane



Re: Define jsonpath functions as stable

От
"Jonathan S. Katz"
Дата:
On 9/19/19 6:18 PM, Tom Lane wrote:
> "Jonathan S. Katz" <jkatz@postgresql.org> writes:
>> On 9/19/19 3:48 PM, Tom Lane wrote:
>>> Seems like
>>> the error handling in jsonpath_gram.y could use some cleanup too
>>> ... although I don't think it's a task to tackle while we're
>>> rushing to get v12 shippable.
>
>> IIRC if we want to change the contents of an error message we wait until
>> major releases. Is there anything we can do before 12 to avoid messages
>> like "unexpected IDENT_P" coming to a user? Would that be something
>> acceptable to fix as a 12.1 or would it have to wait until 13?
>
> I think these messages are sufficiently confusing that we could call
> it a bug fix to improve them.  As long as we don't change the SQLSTATE
> that's thrown, it's hard to claim that there's any real application
> compatibility hazard from changing them.

Great. +1 on that.

> I just don't want to call this point a release blocker.  It's not
> about changing any semantics or the set of things that work.

+100 on that.

Jonathan


Вложения