Обсуждение: Define jsonpath functions as stable
Hi! During my work on bringing jsonpath patchset to commit, I was always keeping in mind that we need to make jsonb_path_*() functions immutable. Having these functions immutable, users can build expression indexes over them. Naturally, in majority of cases one doesn't need to index whole json documents, but only some parts of them. jsonpath provide great facilities to extract indexable parts of document, much more powerful than our current operator set. However, we've spotted some deviations between standard and our implementation. * like_regex predicate uses our regular expression engine, which deviates from standard. * We always do numeric computations using numeric datatype. Even if user explicitly calls .double() method. Probably, our current implementation still fits standard. But in future we may like to use floating point computation in some cases for performance optimization. These deviations don't look critical by itself. But immutable functions make problematic fixing them in future. Also, I'm not sure this is complete list of deviations we have. We might have, for example, hidden deviations in handling strict/lax modes, which are hard to detect and understand. Therefore, I'm going to mark jsonb_path_*() functions stable, not immutable. Nevertheless users will still have multiple options for indexing: 1) jsonb_path_ops supports jsonpath matching operators in some cases. 2) One can wrap jsonb_path_*() in pl/* function and mark it as immutable on his own risk. This approach is widely used to build indexes over to_date()/to_timestamp(). 3) We're going to provide support of jsonpath operators in jsquery extension before release of PostgreSQL 12. I'd like to note I don't mean we wouldn't ever have immutable functions for jsonpath evaluation. I think once we sure enough that we know immutable subset of jsonpath, we may define immutable functions for its evaluation. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Вложения
Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > During my work on bringing jsonpath patchset to commit, I was always > keeping in mind that we need to make jsonb_path_*() functions > immutable. Having these functions immutable, users can build > expression indexes over them. Right. > However, we've spotted some deviations between standard and our implementation. > * like_regex predicate uses our regular expression engine, which > deviates from standard. > * We always do numeric computations using numeric datatype. Even if > user explicitly calls .double() method. Probably, our current > implementation still fits standard. But in future we may like to use > floating point computation in some cases for performance optimization. > ... > Therefore, I'm going to mark jsonb_path_*() functions stable, not > immutable. I dunno, I think you are applying a far more rigorous definition of "immutable" than we ever have in the past. The possibility that we might change the implementation in the future should not be enough to disqualify a function from being immutable --- if that were the criterion, nothing more complex than int4pl could be immutable. Wouldn't it be better that, in the hypothetical major version where we change the implementation, we tell users that they must reindex any affected indexes? As a comparison point, we allow people to build indexes on tsvector results, which are *easy* to change just by adjusting configuration files. The fact that this might force the need for reindexing hasn't made it unworkable. regards, tom lane
Hi, On 7/29/19 10:25 AM, Alexander Korotkov wrote: > * like_regex predicate uses our regular expression engine, which > deviates from standard. I still favor adding some element to the syntax (like a 'posix' or 'pg' keyword in the grammar for like_regex) that identifies it as using a different regexp flavor, so the way forward to a possible compliant version later is not needlessly blocked (or consigned to a standard_conforming_strings-like experience). That would also resolve much of the case against calling that predicate immutable. It looks as if, in my first implementation of XQuery regexps, there will have to be a "not-quite-standard" flag for those too, because it turns out the SQL committee made some tweaks to XQuery regexps[1], whereas any XQuery library one relies on is going to provide untweaked XQuery regexps out of the box. (The differences only affect ^ $ . \s \S) Regards, -Chap [1] https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions
On Mon, Jul 29, 2019 at 5:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > However, we've spotted some deviations between standard and our implementation. > > * like_regex predicate uses our regular expression engine, which > > deviates from standard. > > * We always do numeric computations using numeric datatype. Even if > > user explicitly calls .double() method. Probably, our current > > implementation still fits standard. But in future we may like to use > > floating point computation in some cases for performance optimization. > > ... > > Therefore, I'm going to mark jsonb_path_*() functions stable, not > > immutable. > > I dunno, I think you are applying a far more rigorous definition of > "immutable" than we ever have in the past. The possibility that we > might change the implementation in the future should not be enough > to disqualify a function from being immutable --- if that were the > criterion, nothing more complex than int4pl could be immutable. > > Wouldn't it be better that, in the hypothetical major version where > we change the implementation, we tell users that they must reindex > any affected indexes? > > As a comparison point, we allow people to build indexes on tsvector > results, which are *easy* to change just by adjusting configuration > files. The fact that this might force the need for reindexing hasn't > made it unworkable. Thank you for the explanation. Given that there is no need to mark existing json_path_*() functions as stable. We can just advise users to rebuild their indexes if we have incompatible changes. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Jul 29, 2019 at 5:55 PM Chapman Flack <chap@anastigmatix.net> wrote: > On 7/29/19 10:25 AM, Alexander Korotkov wrote: > > > * like_regex predicate uses our regular expression engine, which > > deviates from standard. > > I still favor adding some element to the syntax (like a 'posix' or 'pg' > keyword in the grammar for like_regex) that identifies it as using > a different regexp flavor, so the way forward to a possible compliant > version later is not needlessly blocked (or consigned to a > standard_conforming_strings-like experience). What do you think about renaming existing operator from like_regex to pg_like_regex? Or introducing special flag indicating that PostgreSQL regex engine is used ('p' for instance)? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 07/29/19 18:27, Alexander Korotkov wrote: > What do you think about renaming existing operator from like_regex to > pg_like_regex? Or introducing special flag indicating that PostgreSQL > regex engine is used ('p' for instance)? Renaming the operator is simple and certainly solves the problem. I don't have a strong technical argument for or against any of: $.** ? (@ pg_like_regex "O(w|v)" flag "i") $.** ? (@ pg_like_regex "O(w|v)") $.** ? (@ like_regex "O(w|v)" pg flag "i") $.** ? (@ like_regex "O(w|v)" pg) $.** ? (@ like_regex "O(w|v)" flag "ip") $.** ? (@ like_regex "O(w|v)" flag "p") It seems more of an aesthetic judgment (on which I am no particular authority). I think I would be -0.3 on the third approach just because of the need to still spell out ' flag "p"' when there is no other flag you want. I assume the first two approaches would be about equally easy to implement, assuming there's a parser that already has an optional production for "flag" STRING. Both of the first two seem pretty safe from colliding with a future addition to the standard. To my aesthetic sense, pg_like_regex feels like "another operator to remember" while like_regex ... pg feels like "ok, a slight variant on the operator from the spec". Later on, if a conformant version is added, the grammar might be a bit simpler with just one name and an optional pg. Going with a flag, there is some question of the likelihood of the chosen flag letter being usurped by the standard at some point. I'm leaning toward a flag for now in my own effort to provide the five SQL functions (like_regex, occurrences_regex, position_regex, substring_regex, and translate_regex), as for the time being it will be as an extension, so no custom grammar for me, and I don't really want to make five pg_* variant function names, and have that expand to ten function names someday if the real ones are implemented. (Hmm, I suppose I could add an optional function argument, distinct from flags; that would be analogous to adding a pg in the grammar ... avoids overloading the flags, avoids renaming the functions.) I see in the Saxon library there is already a convention where it allows a few flags undefined by the standard, after a semicolon in the flag string. That has no official status; the XQuery spec defines [smixq] and requires an error for anything else. But it does have the advantage that the flag string can just be chopped at the semicolon to eliminate all but the standard flags, and the advantage (?) that at least one thing is already doing it. Regards, -Chap
Hi, On 7/29/19 8:33 PM, Chapman Flack wrote: > On 07/29/19 18:27, Alexander Korotkov wrote: > >> What do you think about renaming existing operator from like_regex to >> pg_like_regex? Or introducing special flag indicating that PostgreSQL >> regex engine is used ('p' for instance)? > > Renaming the operator is simple and certainly solves the problem. > > I don't have a strong technical argument for or against any of: > > > $.** ? (@ pg_like_regex "O(w|v)" flag "i") > $.** ? (@ pg_like_regex "O(w|v)") > > > $.** ? (@ like_regex "O(w|v)" pg flag "i") > $.** ? (@ like_regex "O(w|v)" pg) > > > $.** ? (@ like_regex "O(w|v)" flag "ip") > $.** ? (@ like_regex "O(w|v)" flag "p") > > > It seems more of an aesthetic judgment (on which I am no particular > authority). > > I think I would be -0.3 on the third approach just because of the need > to still spell out ' flag "p"' when there is no other flag you want. > > I assume the first two approaches would be about equally easy to > implement, assuming there's a parser that already has an optional > production for "flag" STRING. > > Both of the first two seem pretty safe from colliding with a > future addition to the standard. > > To my aesthetic sense, pg_like_regex feels like "another operator > to remember" while like_regex ... pg feels like "ok, a slight variant > on the operator from the spec". > > Later on, if a conformant version is added, the grammar might be a bit > simpler with just one name and an optional pg. > > Going with a flag, there is some question of the likelihood of > the chosen flag letter being usurped by the standard at some point. > > I'm leaning toward a flag for now in my own effort to provide the five SQL > functions (like_regex, occurrences_regex, position_regex, substring_regex, > and translate_regex), as for the time being it will be as an extension, > so no custom grammar for me, and I don't really want to make five > pg_* variant function names, and have that expand to ten function names > someday if the real ones are implemented. (Hmm, I suppose I could add > an optional function argument, distinct from flags; that would be > analogous to adding a pg in the grammar ... avoids overloading the flags, > avoids renaming the functions.) Looking at this thread and[1] and the current state of open items[2], a few thoughts: It sounds like the easiest path to completion without potentially adding futures headaches pushing back the release too far would be that, e.g. these examples: $.** ? (@ like_regex "O(w|v)" pg flag "i") $.** ? (@ like_regex "O(w|v)" pg) If it's using POSIX regexp, I would +1 using "posix" instead of "pg" That said, from a user standpoint, it's slightly annoying to have to include that keyword every time, and could potentially mean changing / testing quite a bit of code once we do support XQuery regexps. Based on how we currently handle regular expressions, we've already condition user's to expect a certain behavior, and it would be inconsistent if we do one thing in one place, and another thing here, so I would like for us to be cognizant of that. Reading the XQuery spec that Chapman provided[3], it sounds like there are some challenges present if we were to try to implement XQuery-based regexps. I do agree with Alvaro's comment ("We have an opportunity to do better")[4], but I think we have to weigh the likelihood of actually supporting the XQuery behaviors before we add more burden to our users. Based on what needs to be done, it does not sound like it is any time soon. My first choice would be to leave it as is. We can make it abundantly clear that if we make changes in a future version we advise our users on what actions to take, and counsel on any behavior changes. My second choice is to have a flag that makes it clear what kind of regex's are being used, in which case "posix" -- this is abundantly clearer to the user, but still default, at present, to using "posix" expressions. If we ever do add the XQuery ones, we can debate whether we default to the standard at that time, and if we do, we treat it like we treat other deprecation issues and make abundantly clear what the behavior is now. Thanks, Jonathan [1] https://www.postgresql.org/message-id/flat/5CF28EA0.80902%40anastigmatix.net [2] https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items [3] https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions [4] https://www.postgresql.org/message-id/20190618154907.GA6049%40alvherre.pgsql
Вложения
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > It sounds like the easiest path to completion without potentially adding > futures headaches pushing back the release too far would be that, e.g. > these examples: > $.** ? (@ like_regex "O(w|v)" pg flag "i") > $.** ? (@ like_regex "O(w|v)" pg) > If it's using POSIX regexp, I would +1 using "posix" instead of "pg" I agree that we'd be better off to say "POSIX". However, having just looked through the references Chapman provided, it seems to me that the regex language Henry Spencer's library provides is awful darn close to what XPath is asking for. The main thing I see in the XML/XPath specs that we don't have is a bunch of character class escapes that are specifically tied to Unicode character properties. We could possibly add code to implement those, but I'm not sure how it'd work in non-UTF8 database encodings. There may also be subtle differences in the behavior of character class escapes that we do have in common, such as "\s" for white space; but again I'm not sure that those are any different than what you get naturally from encoding or locale variations. I think we could possibly get away with not having any special marker on regexes, but just explaining in the documentation that "features so-and-so are not implemented". Writing that text would require closer analysis than I've seen in this thread as to exactly what the differences are. regards, tom lane
On 9/16/19 11:20 AM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> It sounds like the easiest path to completion without potentially adding >> futures headaches pushing back the release too far would be that, e.g. >> these examples: > >> $.** ? (@ like_regex "O(w|v)" pg flag "i") >> $.** ? (@ like_regex "O(w|v)" pg) > >> If it's using POSIX regexp, I would +1 using "posix" instead of "pg" > > I agree that we'd be better off to say "POSIX". However, having just > looked through the references Chapman provided, it seems to me that > the regex language Henry Spencer's library provides is awful darn > close to what XPath is asking for. The main thing I see in the XML/XPath > specs that we don't have is a bunch of character class escapes that are > specifically tied to Unicode character properties. We could possibly > add code to implement those, but I'm not sure how it'd work in non-UTF8 > database encodings. Maybe taking a page from the pg_saslprep implementation. For some cases where the string in question would issue a "reject" under normal SASLprep[1] considerations (really stringprep[2]), PostgreSQL just lets the string passthrough to the next step, without alteration. What's implied here is if the string is UTF-8, it goes through SASLprep, but if not, it is just passed through. So perhaps the answer is that if we implement XQuery, the escape for UTF-8 character properties are only honored if the encoding is set to be UTF-8, and ignored otherwise. We would have to document that said escapes only work on UTF-8 encodings. > There may also be subtle differences in the behavior > of character class escapes that we do have in common, such as "\s" for > white space; but again I'm not sure that those are any different than > what you get naturally from encoding or locale variations. > > I think we could possibly get away with not having any special marker > on regexes, but just explaining in the documentation that "features > so-and-so are not implemented". Writing that text would require closer > analysis than I've seen in this thread as to exactly what the differences > are. +1, and likely would need some example strings too that highlight the difference in how they are processed. And again, if we end up updating the behavior in the future, it becomes a part of our standard deprecation notice at the beginning of the release notes, though one that could require a lot of explanation. Jonathan [1] https://tools.ietf.org/html/rfc4013 [2] https://www.ietf.org/rfc/rfc3454.txt
Вложения
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 9/16/19 11:20 AM, Tom Lane wrote: >> I think we could possibly get away with not having any special marker >> on regexes, but just explaining in the documentation that "features >> so-and-so are not implemented". Writing that text would require closer >> analysis than I've seen in this thread as to exactly what the differences >> are. > +1, and likely would need some example strings too that highlight the > difference in how they are processed. I spent an hour digging through these specs. I was initially troubled by the fact that XML Schema regexps are implicitly anchored, ie must match the whole string; that's a huge difference from POSIX. However, 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching" defines LIKE_REGEX to work exactly like XQuery's fn:matches function, except for some weirdness around newline matching; and that spec clearly says that fn:matches treats its pattern argument as NOT anchored. So it looks like we end up in the same place as POSIX for this. Otherwise, the pattern language differences I could find are all details of character class expressions (bracket expressions, such as "[a-z0-9]") and escapes that are character class shorthands: * We don't have "character class subtraction". I'd be pretty hesitant to add that to our regexp language because it seems to change "-" into a metacharacter, which would break an awful lot of regexps. I might be misunderstanding their syntax for it, because elsewhere that spec explicitly claims that "-" is not a metacharacter. * Character class elements can be #xNN (NN being hex digits), which seems equivalent to POSIX \xNN as long as you're using UTF8 encoding. Again, the compatibility costs of allowing that don't seem attractive, since # isn't a metacharacter today. * Character class elements can be \p{UnicodeProperty} or the complement \P{UnicodeProperty}, where there are a bunch of different possible properties. Perhaps we could add that someday; since there's no reason to escape "p" or "P" today, this doesn't seem like it'd be a huge compatibility hit. But I'm content to document this as unimplemented for now. * XQuery adds character class shorthands \i (complement \I) for "initial name characters" and \c (complement \C) for "NameChar". Same as above; maybe add someday, but no hurry. * It looks like XQuery's \w class might allow more characters than our interpretation does, and hence \W allows fewer. But since \w devolves to what libc thinks the "alnum" class is, it's at least possible that some locales might do the same thing XQuery calls for. * Likewise, any other discrepancies between the Unicode-centric character class definitions in XQuery and what our stuff does are well within the boundaries of locale variances. So I don't feel too bad about that. * The SQL-spec newline business mentioned above is a possible exception: it appears to require that when '.' is allowed to match newlines, a single '.' should match a '\r\n' Windows newline. I think we can document that and move on. * The x flag in XQuery is defined as ignoring all whitespace in the pattern except within character class expressions. Spencer's x flag does mostly that, but it thinks that "\ " means a literal space whereas XQuery explicitly says that the space is ignored and the backslash applies to the next non-space character. (That's just weird, in my book.) Also, Spencer's x mode causes # to begin a comment extending to EOL, which is a nice thing XQuery hasn't got, and it says you can't put spaces within multi-character symbols like "(?:", which presumably is allowed with XQuery's "x". I feel a bit uncomfortable with these inconsistencies in x-flag rules. We could probably teach the regexp library to have an alternate expanded mode that matches XQuery's rules, but that's not a project to tackle for v12. I tentatively recommend that we remove the jsonpath "x" flag for the time being. Also, I noted some things that seem to be flat out sloppiness in the XQuery flag conversions: * The newline-matching flags (m and s flags) can be mapped to features of Spencer's library, but jsonpath_gram.y does so incorrectly. * XQuery says that the q flag overrides m, s, and x flags, which is exactly the opposite of what our code does; besides which the code is flag-order-sensitive which is just wrong. These last two are simple to fix and we should just go do it. Otherwise, I think we're okay with regarding Spencer's library as being a sufficiently close approximation to LIKE_REGEX. We need some documentation work though. regards, tom lane
On 9/16/19 5:10 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> On 9/16/19 11:20 AM, Tom Lane wrote: >>> I think we could possibly get away with not having any special marker >>> on regexes, but just explaining in the documentation that "features >>> so-and-so are not implemented". Writing that text would require closer >>> analysis than I've seen in this thread as to exactly what the differences >>> are. > >> +1, and likely would need some example strings too that highlight the >> difference in how they are processed. > > I spent an hour digging through these specs. Thanks! That sounds like quite the endeavor... > I was initially troubled > by the fact that XML Schema regexps are implicitly anchored, ie must > match the whole string; that's a huge difference from POSIX. However, > 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX > predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching" > defines LIKE_REGEX to work exactly like XQuery's fn:matches function, > except for some weirdness around newline matching; and that spec > clearly says that fn:matches treats its pattern argument as NOT anchored. > So it looks like we end up in the same place as POSIX for this. > > Otherwise, the pattern language differences I could find are all details > of character class expressions (bracket expressions, such as "[a-z0-9]") > and escapes that are character class shorthands: > > * We don't have "character class subtraction". I'd be pretty hesitant > to add that to our regexp language because it seems to change "-" into > a metacharacter, which would break an awful lot of regexps. I might > be misunderstanding their syntax for it, because elsewhere that spec > explicitly claims that "-" is not a metacharacter. Using something I could understand[1] it looks like the syntax is like: [a-z-[aeiou] e.g. all the consonants of the alphabet. I don't believe that would break many, if any, regexps. I also don't know what kind of effort it would take to add that in given I had not looked at the regexp code until today (and only at some of the amusing comments in the header file, which seemed like it wasn't expected the code would be read 20 years later), but it would likely not be a v12 problem. > * Character class elements can be #xNN (NN being hex digits), which seems > equivalent to POSIX \xNN as long as you're using UTF8 encoding. Again, > the compatibility costs of allowing that don't seem attractive, since # > isn't a metacharacter today. Seems reasonable. > * Character class elements can be \p{UnicodeProperty} or > the complement \P{UnicodeProperty}, where there are a bunch of different > possible properties. Perhaps we could add that someday; since there's no > reason to escape "p" or "P" today, this doesn't seem like it'd be a huge > compatibility hit. But I'm content to document this as unimplemented > for now. +1. > * XQuery adds character class shorthands \i (complement \I) for "initial > name characters" and \c (complement \C) for "NameChar". Same as above; > maybe add someday, but no hurry. +1. > * It looks like XQuery's \w class might allow more characters than our > interpretation does, and hence \W allows fewer. But since \w devolves > to what libc thinks the "alnum" class is, it's at least possible that > some locales might do the same thing XQuery calls for. I'd still add this to the "to document" list. > * The SQL-spec newline business mentioned above is a possible exception: > it appears to require that when '.' is allowed to match newlines, a > single '.' should match a '\r\n' Windows newline. I think we can > document that and move on. +1. > * The x flag in XQuery is defined as ignoring all whitespace in > the pattern except within character class expressions. Spencer's > x flag does mostly that, but it thinks that "\ " means a literal space > whereas XQuery explicitly says that the space is ignored and the > backslash applies to the next non-space character. (That's just > weird, in my book.) Also, Spencer's x mode causes # to begin > a comment extending to EOL, which is a nice thing XQuery hasn't > got, and it says you can't put spaces within multi-character > symbols like "(?:", which presumably is allowed with XQuery's "x". > > I feel a bit uncomfortable with these inconsistencies in x-flag > rules. We could probably teach the regexp library to have an > alternate expanded mode that matches XQuery's rules, but that's > not a project to tackle for v12. That does not sound fun by any means. But likely that would be a part of an overall effort to implement XQuery rules. > I tentatively recommend that > we remove the jsonpath "x" flag for the time being. I would add an alternative suggestion of just removing that "x" is supported in the documentation...but likely better to just remove the flag + docs. > Also, I noted some things that seem to be flat out sloppiness > in the XQuery flag conversions: > > * The newline-matching flags (m and s flags) can be mapped to > features of Spencer's library, but jsonpath_gram.y does so > incorrectly > * XQuery says that the q flag overrides m, s, and x flags, which is > exactly the opposite of what our code does; besides which the code > is flag-order-sensitive which is just wrong. > > These last two are simple to fix and we should just go do it. +1. > Otherwise, I think we're okay with regarding Spencer's library > as being a sufficiently close approximation to LIKE_REGEX. > We need some documentation work though. My main question is "where" -- I'm thinking somewhere in the JSON path[2] section. After reading your email 3 times, I may have enough knowledge to attempt some documentation on the regexp in JSON path. Jonathan [1] https://www.regular-expressions.info/charclasssubtract.html [2] https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH
Вложения
On 09/16/19 17:10, Tom Lane wrote: > I was initially troubled > by the fact that XML Schema regexps are implicitly anchored, ie must > match the whole string; that's a huge difference from POSIX. However, > 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX > predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching" > defines LIKE_REGEX to work exactly like XQuery's fn:matches function, > except for some weirdness around newline matching; and that spec > clearly says that fn:matches treats its pattern argument as NOT anchored. Yeah, it's a layer cake. XML Schema regexps[1] are implicitly anchored and don't have any metacharacters devoted to anchoring. XQuery regexps layer onto[2] XML Schema regexps, adding ^ and $ anchors, rescinding the implicit anchored-ness, adding reluctant quantifiers, capturing groups, and back-references, and defining flags. Then ISO SQL adds a third layer changing the newline semantics, affecting ^, $, ., \s, and \S. Regards, -Chap [1] https://www.w3.org/TR/xmlschema-2/#regexs [2] https://www.w3.org/TR/xpath-functions-31/#regex-syntax
On 9/16/19 6:39 PM, Jonathan S. Katz wrote: > My main question is "where" -- I'm thinking somewhere in the JSON > path[2] section. After reading your email 3 times, I may have enough > knowledge to attempt some documentation on the regexp in JSON path. Here is said attempt to document. Notes: - I centered it around the specification for LIKE_REGEX, which uses XQuery, but primarily noted where our implementation of POSIX regex's differs from what is specified for LIKE_REGEX vis-a-vis XQuery - I put the pith of the documentation in a subsection off of "POSIX regular expressions" - I noted that LIKE_REGEX is specified in SQL:2008, which I read on the Internet(tm) but was not able to confirm in the spec as I do not have a copy - For my explanation about the "x" flag differences, I talked about how we extended it, but I could not capture how Tom described the nuances above. - From the SQL/JSON path docs, I added a section on regular expressions stating what the behavior is, and referring back to the main regex docs - I removed the "x" flag being supported for like_regex in JSON path I also presume it needs a bit of wordsmithing / accuracy checks, but hope it's a good start and does not require a massive rewrite. Thanks, Jonathan
Вложения
On 2019-09-17 17:38, Jonathan S. Katz wrote: > On 9/16/19 6:39 PM, Jonathan S. Katz wrote: > [regex.patch] A few things/typos caught my eye: 1. 'implementation' seems the wrong word in sentence: "Several other parts of the SQL standard also define LIKE_REGEX equivalents that refer to this implementation, including the SQL/JSON path like_regex filter." As I understand this text, 'concept' seems better. I'd drop 'also', too. 2. 'whereas the POSIX will those' should be 'whereas POSIX will regard those' or maybe 'read those' 3. + The SQL/JSON standard borrows its definition for how regular expressions + from the <literal>LIKE_REGEX</literal> operator, which in turns uses the + XQuery standard. That sentence needs the verb 'work', no? 'for how regular expressions work [..]' Or alternatively drop 'how'. thanks, Erik Rijkers
On 9/17/19 12:09 PM, Erik Rijkers wrote: > On 2019-09-17 17:38, Jonathan S. Katz wrote: >> [regex.patch] Thanks for the review! > "Several other parts of the SQL standard > also define LIKE_REGEX equivalents that refer > to this implementation, including the > SQL/JSON path like_regex filter." > > As I understand this text, 'concept' seems better. > I'd drop 'also', too. I rewrote this to be: "Several other parts of the SQL standard refer to the LIKE_REGEX specification to define similar operations, including..." > 2. > 'whereas the POSIX will those' should be > 'whereas POSIX will regard those' > or maybe 'read those' I used "treat those" > > 3. > + The SQL/JSON standard borrows its definition for how regular > expressions > + from the <literal>LIKE_REGEX</literal> operator, which in turns > uses the > + XQuery standard. > That sentence needs the verb 'work', no? 'for how regular expressions > work [..]' > Or alternatively drop 'how'. I dropped the "how". v2 attached. Thanks! Jonathan
Вложения
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > v2 attached. Thanks! I whacked this around some (well, quite a bit actually); notably, I thought we'd better describe things that are in our engine but not XQuery, as well as vice-versa. After a re-read of the XQuery spec, it seems to me that the character entry form that they have and we don't is actually "NNNN;" like HTML, rather than just "#NN". Can anyone double-check that? Does it work outside bracket expressions, or only inside? regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2b4fe0c..32e6610 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5970,6 +5970,134 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); <!-- end re_syntax.n man page --> + <sect3 id="posix-vs-xquery"> + <title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title> + + <para> + Since SQL:2008, the SQL standard includes + a <literal>LIKE_REGEX</literal> operator that performs pattern + matching according to the XQuery regular expression + standard. <productname>PostgreSQL</productname> does not yet + implement this operator, but you can get very similar behavior using + the <function>regexp_match()</function> function. + </para> + + <para> + Notable differences between the existing POSIX-based + regular-expression feature and XQuery regular expressions include: + + <itemizedlist> + <listitem> + <para> + XQuery character class subtraction is not supported. An example of + this feature is using the following to match only English + consonants: <literal>[a-z-[aeiou]]</literal>. + </para> + </listitem> + <listitem> + <para> + XQuery allows a literal character in the pattern to be written as + an HTML-style Unicode character reference, for + instance <literal>&#<replaceable>NNNN</replaceable>;</literal>. + This is not supported by POSIX, but you can get the same effect by + writing <literal>\u<replaceable>NNNN</replaceable></literal>. (The + equivalence is only exact when the database encoding is UTF-8.) + </para> + </listitem> + <listitem> + <para> + The SQL standard (not XQuery itself) attempts to cater for more + variants of <quote>newline</quote> than POSIX does. The + newline-sensitive matching options described above consider only + ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have + us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>) + (a Windows-style newline), and some Unicode-only characters like + LINE SEPARATOR (U+2028) as newlines as well. + Notably, <literal>.</literal> and <literal>\s</literal> should + count <literal>\r\n</literal> as one character not two according to + SQL. + </para> + </listitem> + <listitem> + <para> + XQuery character class shorthands <literal>\c</literal>, + <literal>\C</literal>, <literal>\i</literal>, + and <literal>\I</literal> are not supported. + </para> + </listitem> + <listitem> + <para> + XQuery character class elements + using <literal>\p{UnicodeProperty}</literal> or the + inverse <literal>\P{UnicodeProperty}</literal> are not supported. + </para> + </listitem> + <listitem> + <para> + POSIX interprets character classes such as <literal>\w</literal> + (see <xref linkend="posix-class-shorthand-escapes-table"/>) + according to the prevailing locale (which you can control by + attaching a <literal>COLLATE</literal> clause to the operator or + function). XQuery specifies these classes by reference to Unicode + character properties, so equivalent behavior is obtained only with + a locale that follows the Unicode rules. + </para> + </listitem> + <listitem> + <para> + Of the character-entry escapes described in + <xref linkend="posix-character-entry-escapes-table"/>, + XQuery supports only <literal>\n</literal>, <literal>\r</literal>, + and <literal>\t</literal>. + </para> + </listitem> + <listitem> + <para> + XQuery does not support + the <literal>[:<replaceable>name</replaceable>:]</literal> syntax + for character classes within bracket expressions. + </para> + </listitem> + <listitem> + <para> + XQuery does not have lookahead or lookbehind constraints, + nor any of the constraint escapes described in + <xref linkend="posix-constraint-escapes-table"/>. + </para> + </listitem> + <listitem> + <para> + The metasyntax forms described in <xref linkend="posix-metasyntax"/> + do not exist in XQuery. + </para> + </listitem> + <listitem> + <para> + The regular expression flag letters defined by XQuery are + related to but not the same as the option letters for POSIX + (<xref linkend="posix-embedded-options-table"/>). While the + <literal>i</literal> and <literal>q</literal> options behave the + same, others do not. + XQuery's <literal>s</literal> (allow dot to match newline) + and <literal>m</literal> (allow <literal>^</literal> + and <literal>$</literal> to match at newlines) flags provide access + to the same behaviors as POSIX's <literal>n</literal>, + <literal>p</literal> and <literal>w</literal> flags, but + do <emphasis>not</emphasis> match the behavior of + POSIX's <literal>s</literal> and <literal>m</literal> flags. + Note in particular that dot-matches-newline is the default behavior + in POSIX but not XQuery. + Also, XQuery's <literal>x</literal> (ignore whitespace in pattern) + flag is noticeably different from POSIX's expanded-mode flag. + POSIX's <literal>x</literal> flag also allows <literal>#</literal> to + begin a comment in the pattern, and POSIX will not ignore a + whitespace character after a backslash. + </para> + </listitem> + </itemizedlist> + </para> + + </sect3> </sect2> </sect1> @@ -11793,6 +11921,14 @@ table2-mapping </programlisting> </para> </listitem> + + <listitem> + <para> + There are minor differences in the interpretation of regular + expression patterns used in <literal>like_regex</literal> filters, as + described in <xref linkend="jsonpath-regular-expressions"/>. + </para> + </listitem> </itemizedlist> <sect3 id="strict-and-lax-modes"> @@ -11872,6 +12008,36 @@ table2-mapping </sect3> + <sect3 id="jsonpath-regular-expressions"> + <title>Regular Expressions</title> + + <para> + SQL/JSON path expressions allow matching text to a regular expression + with the <literal>like_regex</literal> filter. For example, the + following SQL/JSON path query would case-insensitively match all + strings in an array that start with an English vowel: +<programlisting> +'$[*] ? (@ like_regex "^[aeiou]" flag "i")' +</programlisting> + </para> + + <para> + The SQL/JSON standard borrows its definition for regular expressions + from the <literal>LIKE_REGEX</literal> operator, which in turn uses the + XQuery standard. PostgreSQL does not currently support the + <literal>LIKE_REGEX</literal> operator. Therefore, + the <literal>like_regex</literal> filter is implemented using the + POSIX regular expression engine described in + <xref linkend="functions-posix-regexp"/>. This leads to various minor + discrepancies from standard SQL/JSON behavior, which are cataloged in + <xref linkend="posix-vs-xquery"/>. + Note, however, that the flag-letter incompatibilities described there + do not apply to SQL/JSON, as it translates the XQuery flag letters to + match what the POSIX engine expects. + </para> + + </sect3> + <sect3 id="functions-sqljson-path-operators"> <title>SQL/JSON Path Operators and Methods</title> @@ -12113,10 +12279,13 @@ table2-mapping <row> <entry><literal>like_regex</literal></entry> <entry> - Tests pattern matching with POSIX regular expressions - (see <xref linkend="functions-posix-regexp"/>). Supported flags - are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>, - <literal>x</literal>, and <literal>q</literal>.</entry> + Tests whether the first operand matches the regular expression + given by the second operand (see + <xref linkend="jsonpath-regular-expressions"/>). + An optional <literal>flag</literal> string can be given. + Supported flags are <literal>i</literal>, <literal>m</literal>, + <literal>s</literal>, and <literal>q</literal>. + </entry> <entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry> <entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry> <entry><literal>"abc", "aBdC", "abdacb"</literal></entry>
On 9/17/19 6:40 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> v2 attached. Thanks! > > I whacked this around some (well, quite a bit actually); So I see :) Thanks. > notably, > I thought we'd better describe things that are in our engine but > not XQuery, as well as vice-versa. Yeah, that makes sense. Overall it reads really well. One question I had in my head (and probably should have asked) was answered around the \w character class wrt collation. > After a re-read of the XQuery spec, it seems to me that the character > entry form that they have and we don't is actually "NNNN;" like > HTML, rather than just "#NN". Can anyone double-check that? Clicking through the XQuery spec eventual got me to here[1] (which warns me that its out of date, but that is what its "current" specs linked me to), which describes being able to use "[0-9]+;" and "[0-9a-fA-F]+;" to specify characters (which I recognize as a character escape from HTML, XML et al.). So based on that, my answer is "yes." > Does > it work outside bracket expressions, or only inside? Looking at the parse tree (start with the "atom"[2]), I read it as being able to use that syntax both inside and outside the bracket expressions. Here is a v4. I added some more paragraphs the bullet point that explains the different flags to make it feel a bit less dense. Thanks, Jonathan [1] https://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-charref [2] https://www.w3.org/TR/xmlschema-2/#nt-atom
Вложения
On 09/17/19 21:13, Jonathan S. Katz wrote: > to), which describes being able to use "[0-9]+;" and "[0-9a-fA-F]+;" Er, that is, "[0-9]+;" and "[0-9a-fA-F]+;" (with x for the hex case). >> Does >> it work outside bracket expressions, or only inside? > > Looking at the parse tree (start with the "atom"[2]), I read it as being > able to use that syntax both inside and outside the bracket expressions. Maybe I can plug a really handy environment for messin'-around-in-XQuery, BaseX: http://basex.org/ All the buzzwords on the landing page make it seem as if it's going to be some monstrous thing to download and set up, but on the downloads page, the "Core Package" option is a single standalone 3.8 MB jar file: http://files.basex.org/releases/9.2.4/BaseX924.jar "java -jar BaseX924.jar" is all it takes to start up, and wham, you're in a nice IDE-like environment where the editor pane is syntax-aware for XQuery and will run your code and show results with a click of the go button. Regards, -Chap
On 9/17/19 10:00 PM, Chapman Flack wrote: > On 09/17/19 21:13, Jonathan S. Katz wrote: > >> to), which describes being able to use "[0-9]+;" and "[0-9a-fA-F]+;" > > Er, that is, "[0-9]+;" and "[0-9a-fA-F]+;" (with x for the hex case). Correct, I missed the "x". Thanks, Jonathan
Вложения
On Wed, Sep 18, 2019 at 4:13 AM Jonathan S. Katz <jkatz@postgresql.org> wrote: > Here is a v4. I added some more paragraphs the bullet point that > explains the different flags to make it feel a bit less dense. Sorry that I didn't participate this discussion till now. FWIW, I agree with selected approach to document differences with XQuery regex and and forbid 'x' from jsonpath like_regex. Patch also looks good for me at the first glance. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 9/17/19 6:40 PM, Tom Lane wrote: >> After a re-read of the XQuery spec, it seems to me that the character >> entry form that they have and we don't is actually "NNNN;" like >> HTML, rather than just "#NN". Can anyone double-check that? > Clicking through the XQuery spec eventual got me to here[1] (which warns > me that its out of date, but that is what its "current" specs linked me > to), which describes being able to use "[0-9]+;" and "[0-9a-fA-F]+;" > to specify characters (which I recognize as a character escape from > HTML, XML et al.). After further reading, it seems like what that text is talking about is not actually a regex feature, but an outgrowth of the fact that the regex pattern is being expressed as a string literal in a language for which XML character entities are a native aspect of the string literal syntax. So it looks to me like the entities get folded to raw characters in a string-literal parser before the regex engine ever sees them. As such, I think this doesn't apply to SQL/JSON. The SQL/JSON spec seems to defer to Javascript/ECMAscript for syntax details, and in either of those languages you have backslash escape sequences for writing weird characters, *not* XML entities. You certainly wouldn't have use of such entities in a native implementation of LIKE_REGEX in SQL. So now I'm thinking we can just remove the handwaving about entities. On the other hand, this points up a large gap in our docs about SQL/JSON, which is that nowhere does it even address the question of what the string literal syntax is within a path expression. Much less point out that that syntax is nothing like native SQL strings. Good luck finding out from the docs that you'd better double any backslashes you'd like to have in your regex --- but a moment's testing proves that that is the case in our code as it stands. Have we misread the spec badly enough to get this wrong? regards, tom lane
On 09/18/19 17:12, Tom Lane wrote: > After further reading, it seems like what that text is talking about > is not actually a regex feature, but an outgrowth of the fact that > the regex pattern is being expressed as a string literal in a language > for which XML character entities are a native aspect of the string > literal syntax. So it looks to me like the entities get folded to > raw characters in a string-literal parser before the regex engine > ever sees them. Hmm. That occurred to me too, but I thought the explicit mention of 'character reference' in the section specific to regexes[1] might not mean that. It certainly could have been clearer. But you seem to have the practical agreement of both BaseX: let $foo := codepoints-to-string((38,35,120,54,49,59)) return ($foo, matches('a', $foo)) ------ a false and the Saxon-based pljava example: select occurrences_regex('a', 'a', w3cNewlines => true); occurrences_regex ------------------- 0 > As such, I think this doesn't apply to SQL/JSON. The SQL/JSON spec > seems to defer to Javascript/ECMAscript for syntax details, and > in either of those languages you have backslash escape sequences > for writing weird characters, *not* XML entities. You certainly > wouldn't have use of such entities in a native implementation of > LIKE_REGEX in SQL. So yeah, that seems to be correct. The upshot seems to be a two-parter: 1. Whatever string literal syntax is used in front of the regex engine had better have some way to represent any character you could want to match, and 2. There is only one way to literally match a character that is a regex metacharacter, namely, to precede it with a backslash (that the regex engine will see; therefore doubled if necessary). Whatever codepoint escape form might be available in the string literal syntax does not offer another way to do that, because it happens too early, before the regex engine can see it. > So now I'm thinking we can just remove the handwaving about entities. > On the other hand, this points up a large gap in our docs about > SQL/JSON, which is that nowhere does it even address the question of > what the string literal syntax is within a path expression. That does seem like it ought to be covered. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 09/18/19 17:12, Tom Lane wrote: >> As such, I think this doesn't apply to SQL/JSON. The SQL/JSON spec >> seems to defer to Javascript/ECMAscript for syntax details, and >> in either of those languages you have backslash escape sequences >> for writing weird characters, *not* XML entities. You certainly >> wouldn't have use of such entities in a native implementation of >> LIKE_REGEX in SQL. > So yeah, that seems to be correct. Thanks for double-checking. I removed that para from the patch. >> So now I'm thinking we can just remove the handwaving about entities. >> On the other hand, this points up a large gap in our docs about >> SQL/JSON, which is that nowhere does it even address the question of >> what the string literal syntax is within a path expression. > That does seem like it ought to be covered. I found a spot that seemed like a reasonable place, and added some coverage of the point. Updated patch attached. It seems to me that there are some discrepancies between what the spec says and what jsonpath_scan.l actually does, so maybe we should take a hard look at that code too. The biggest issue is that jsonpath_scan.l seems to allow single- and double-quoted strings interchangeably, which is OK per ECMAScript, but then the SQL/JSON spec seems to be saying that only double-quoted strings are allowed. I'd rather be conservative about this than get out in front of the spec and use syntax space that they might do something else with someday. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2b4fe0c..16e41a6 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5970,6 +5970,145 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); <!-- end re_syntax.n man page --> + <sect3 id="posix-vs-xquery"> + <title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title> + + <indexterm zone="posix-vs-xquery"> + <primary><literal>LIKE_REGEX</literal></primary> + </indexterm> + + <indexterm zone="posix-vs-xquery"> + <primary>XQuery regular expressions</primary> + </indexterm> + + <para> + Since SQL:2008, the SQL standard includes + a <literal>LIKE_REGEX</literal> operator that performs pattern + matching according to the XQuery regular expression + standard. <productname>PostgreSQL</productname> does not yet + implement this operator, but you can get very similar behavior using + the <function>regexp_match()</function> function, since XQuery + regular expressions are quite close to the ARE syntax described above. + </para> + + <para> + Notable differences between the existing POSIX-based + regular-expression feature and XQuery regular expressions include: + + <itemizedlist> + <listitem> + <para> + XQuery character class subtraction is not supported. An example of + this feature is using the following to match only English + consonants: <literal>[a-z-[aeiou]]</literal>. + </para> + </listitem> + <listitem> + <para> + XQuery character class shorthands <literal>\c</literal>, + <literal>\C</literal>, <literal>\i</literal>, + and <literal>\I</literal> are not supported. + </para> + </listitem> + <listitem> + <para> + XQuery character class elements + using <literal>\p{UnicodeProperty}</literal> or the + inverse <literal>\P{UnicodeProperty}</literal> are not supported. + </para> + </listitem> + <listitem> + <para> + POSIX interprets character classes such as <literal>\w</literal> + (see <xref linkend="posix-class-shorthand-escapes-table"/>) + according to the prevailing locale (which you can control by + attaching a <literal>COLLATE</literal> clause to the operator or + function). XQuery specifies these classes by reference to Unicode + character properties, so equivalent behavior is obtained only with + a locale that follows the Unicode rules. + </para> + </listitem> + <listitem> + <para> + The SQL standard (not XQuery itself) attempts to cater for more + variants of <quote>newline</quote> than POSIX does. The + newline-sensitive matching options described above consider only + ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have + us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>) + (a Windows-style newline), and some Unicode-only characters like + LINE SEPARATOR (U+2028) as newlines as well. + Notably, <literal>.</literal> and <literal>\s</literal> should + count <literal>\r\n</literal> as one character not two according to + SQL. + </para> + </listitem> + <listitem> + <para> + Of the character-entry escapes described in + <xref linkend="posix-character-entry-escapes-table"/>, + XQuery supports only <literal>\n</literal>, <literal>\r</literal>, + and <literal>\t</literal>. + </para> + </listitem> + <listitem> + <para> + XQuery does not support + the <literal>[:<replaceable>name</replaceable>:]</literal> syntax + for character classes within bracket expressions. + </para> + </listitem> + <listitem> + <para> + XQuery does not have lookahead or lookbehind constraints, + nor any of the constraint escapes described in + <xref linkend="posix-constraint-escapes-table"/>. + </para> + </listitem> + <listitem> + <para> + The metasyntax forms described in <xref linkend="posix-metasyntax"/> + do not exist in XQuery. + </para> + </listitem> + <listitem> + <para> + The regular expression flag letters defined by XQuery are + related to but not the same as the option letters for POSIX + (<xref linkend="posix-embedded-options-table"/>). While the + <literal>i</literal> and <literal>q</literal> options behave the + same, others do not: + <itemizedlist> + <listitem> + <para> + XQuery's <literal>s</literal> (allow dot to match newline) + and <literal>m</literal> (allow <literal>^</literal> + and <literal>$</literal> to match at newlines) flags provide + access to the same behaviors as + POSIX's <literal>n</literal>, <literal>p</literal> + and <literal>w</literal> flags, but they + do <emphasis>not</emphasis> match the behavior of + POSIX's <literal>s</literal> and <literal>m</literal> flags. + Note in particular that dot-matches-newline is the default + behavior in POSIX but not XQuery. + </para> + </listitem> + <listitem> + <para> + Also, XQuery's <literal>x</literal> (ignore whitespace in + pattern) flag is noticeably different from POSIX's expanded-mode + flag. POSIX's <literal>x</literal> flag also + allows <literal>#</literal> to begin a comment in the pattern, + and POSIX will not ignore a whitespace character after a + backslash. + </para> + </listitem> + </itemizedlist> + </para> + </listitem> + </itemizedlist> + </para> + + </sect3> </sect2> </sect1> @@ -11793,6 +11932,14 @@ table2-mapping </programlisting> </para> </listitem> + + <listitem> + <para> + There are minor differences in the interpretation of regular + expression patterns used in <literal>like_regex</literal> filters, as + described in <xref linkend="jsonpath-regular-expressions"/>. + </para> + </listitem> </itemizedlist> <sect3 id="strict-and-lax-modes"> @@ -11872,6 +12019,63 @@ table2-mapping </sect3> + <sect3 id="jsonpath-regular-expressions"> + <title>Regular Expressions</title> + + <indexterm zone="jsonpath-regular-expressions"> + <primary><literal>LIKE_REGEX</literal></primary> + <secondary>in SQL/JSON</secondary> + </indexterm> + + <para> + SQL/JSON path expressions allow matching text to a regular expression + with the <literal>like_regex</literal> filter. For example, the + following SQL/JSON path query would case-insensitively match all + strings in an array that start with an English vowel: +<programlisting> +'$[*] ? (@ like_regex "^[aeiou]" flag "i")' +</programlisting> + </para> + + <para> + The optional <literal>flag</literal> string may include one or more of + the characters + <literal>i</literal> for case-insensitive match, + <literal>m</literal> to allow <literal>^</literal> + and <literal>$</literal> to match at newlines, + <literal>s</literal> to allow <literal>.</literal> to match a newline, + and <literal>q</literal> to quote the whole pattern (reducing the + behavior to a simple substring match). + </para> + + <para> + The SQL/JSON standard borrows its definition for regular expressions + from the <literal>LIKE_REGEX</literal> operator, which in turn uses the + XQuery standard. PostgreSQL does not currently support the + <literal>LIKE_REGEX</literal> operator. Therefore, + the <literal>like_regex</literal> filter is implemented using the + POSIX regular expression engine described in + <xref linkend="functions-posix-regexp"/>. This leads to various minor + discrepancies from standard SQL/JSON behavior, which are cataloged in + <xref linkend="posix-vs-xquery"/>. + Note, however, that the flag-letter incompatibilities described there + do not apply to SQL/JSON, as it translates the XQuery flag letters to + match what the POSIX engine expects. + </para> + + <para> + Keep in mind that the pattern argument of <literal>like_regex</literal> + is a JSON path string literal, written according to the rules given in + <xref linkend="datatype-jsonpath"/>. This means in particular that any + backslashes you want to use in the regular expression must be doubled. + For example, to match strings that contain only digits: +<programlisting> +'$ ? (@ like_regex "^\\d+$")' +</programlisting> + </para> + + </sect3> + <sect3 id="functions-sqljson-path-operators"> <title>SQL/JSON Path Operators and Methods</title> @@ -12113,10 +12317,11 @@ table2-mapping <row> <entry><literal>like_regex</literal></entry> <entry> - Tests pattern matching with POSIX regular expressions - (see <xref linkend="functions-posix-regexp"/>). Supported flags - are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>, - <literal>x</literal>, and <literal>q</literal>.</entry> + Tests whether the first operand matches the regular expression + given by the second operand, optionally with modifications + described by a string of <literal>flag</literal> characters (see + <xref linkend="jsonpath-regular-expressions"/>) + </entry> <entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry> <entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry> <entry><literal>"abc", "aBdC", "abdacb"</literal></entry> diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 4f566a4..45b22b6 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -666,13 +666,32 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu </itemizedlist> <para> - An SQL/JSON path expression is an SQL character string literal, - so it must be enclosed in single quotes when passed to an SQL/JSON - query function. Following the JavaScript - conventions, character string literals within the path expression - must be enclosed in double quotes. Any single quotes within this - character string literal must be escaped with a single quote - by the SQL convention. + An SQL/JSON path expression is typically written in an SQL query as an + SQL character string literal, so it must be enclosed in single quotes, + and any single quotes desired within the value must be doubled + (see <xref linkend="sql-syntax-strings"/>). + Some forms of path expressions require string literals within them. + These embedded string literals follow JavaScript/ECMAScript conventions: + they must be surrounded by double quotes, and backslash escapes may be + used within them to represent otherwise-hard-to-type characters. + In particular, the way to write a double quote within an embedded string + literal is <literal>\"</literal>, and to write a backslash itself, you + must write <literal>\\</literal>. Other special backslash sequences + include those recognized in JSON strings: + <literal>\b</literal>, + <literal>\f</literal>, + <literal>\n</literal>, + <literal>\r</literal>, + <literal>\t</literal>, + <literal>\v</literal> + for various ASCII control characters, and + <literal>\u<replaceable>NNNN</replaceable></literal> for a Unicode + character identified by its 4-hex-digit code point. The backslash + syntax also includes two cases not allowed by JSON: + <literal>\x<replaceable>NN</replaceable></literal> for a character code + written with only two hex digits, and + <literal>\u{<replaceable>N...</replaceable>}</literal> for a character + code written with 1 to 6 hex digits. </para> <para>
I wrote: > I found a spot that seemed like a reasonable place, and added some > coverage of the point. Updated patch attached. Doc patch pushed. > It seems to me that there are some discrepancies between what the spec > says and what jsonpath_scan.l actually does, so maybe we should take a > hard look at that code too. The biggest issue is that jsonpath_scan.l > seems to allow single- and double-quoted strings interchangeably, which is > OK per ECMAScript, but then the SQL/JSON spec seems to be saying that only > double-quoted strings are allowed. I'd rather be conservative about this > than get out in front of the spec and use syntax space that they might do > something else with someday. The attached proposed patch makes these changes: 1. Remove support for single-quoted literals in jsonpath. 2. Treat an unrecognized escape (e.g., "\z") as meaning the escaped character, rather than throwing an error. 3. A few cosmetic adjustments to make the jsonpath_scan code shorter and clearer (IMHO). As for #1, although the SQL/JSON tech report does reference ECMAScript which allows both single- and double-quoted strings, it seems to me that their intent is to allow only the double-quoted variant. They specifically reference JSON string literals at one point, and of course JSON only allows double-quoted. Also, all of their discussion and examples use double-quoted. Plus you'd have to be pretty nuts to want to use single-quoted when writing a jsonpath string literal inside a SQL literal (and the tech report seems to contemplate that jsonpaths MUST be string literals, though of course our implementation does not require that). As for #2, the existing code throws an error, but this is contrary to clear statements in every single one of the relevant standards. regards, tom lane diff --git a/src/backend/utils/adt/jsonpath_scan.l b/src/backend/utils/adt/jsonpath_scan.l index 2165ffc..245255f 100644 --- a/src/backend/utils/adt/jsonpath_scan.l +++ b/src/backend/utils/adt/jsonpath_scan.l @@ -59,24 +59,23 @@ fprintf_to_ereport(const char *fmt, const char *msg) %option noyyfree /* - * We use exclusive states for quoted, signle-quoted and non-quoted strings, - * quoted variable names and C-tyle comments. + * We use exclusive states for quoted and non-quoted strings, + * quoted variable names and C-style comments. * Exclusive states: * <xq> - quoted strings * <xnq> - non-quoted strings * <xvq> - quoted variable names - * <xsq> - single-quoted strings * <xc> - C-style comment */ %x xq %x xnq %x xvq -%x xsq %x xc -special [\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/] -any [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\"\' \t\n\r\f] +/* "other" means anything that's not special, blank, or '\' or '"' */ +special [\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/] +other [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f] blank [ \t\n\r\f] digit [0-9] @@ -95,7 +94,7 @@ hex_fail \\x{hex_dig}{0,1} %% -<xnq>{any}+ { +<xnq>{other}+ { addstring(false, yytext, yyleng); } @@ -105,13 +104,12 @@ hex_fail \\x{hex_dig}{0,1} return checkKeyword(); } - <xnq>\/\* { yylval->str = scanstring; BEGIN xc; } -<xnq>({special}|\"|\') { +<xnq>({special}|\") { yylval->str = scanstring; yyless(0); BEGIN INITIAL; @@ -124,39 +122,37 @@ hex_fail \\x{hex_dig}{0,1} return checkKeyword(); } -<xnq,xq,xvq,xsq>\\[\"\'\\] { addchar(false, yytext[1]); } - -<xnq,xq,xvq,xsq>\\b { addchar(false, '\b'); } +<xnq,xq,xvq>\\b { addchar(false, '\b'); } -<xnq,xq,xvq,xsq>\\f { addchar(false, '\f'); } +<xnq,xq,xvq>\\f { addchar(false, '\f'); } -<xnq,xq,xvq,xsq>\\n { addchar(false, '\n'); } +<xnq,xq,xvq>\\n { addchar(false, '\n'); } -<xnq,xq,xvq,xsq>\\r { addchar(false, '\r'); } +<xnq,xq,xvq>\\r { addchar(false, '\r'); } -<xnq,xq,xvq,xsq>\\t { addchar(false, '\t'); } +<xnq,xq,xvq>\\t { addchar(false, '\t'); } -<xnq,xq,xvq,xsq>\\v { addchar(false, '\v'); } +<xnq,xq,xvq>\\v { addchar(false, '\v'); } -<xnq,xq,xvq,xsq>{unicode}+ { parseUnicode(yytext, yyleng); } +<xnq,xq,xvq>{unicode}+ { parseUnicode(yytext, yyleng); } -<xnq,xq,xvq,xsq>{hex_char} { parseHexChar(yytext); } +<xnq,xq,xvq>{hex_char} { parseHexChar(yytext); } -<xnq,xq,xvq,xsq>{unicode}*{unicodefail} { yyerror(NULL, "invalid unicode sequence"); } +<xnq,xq,xvq>{unicode}*{unicodefail} { yyerror(NULL, "invalid unicode sequence"); } -<xnq,xq,xvq,xsq>{hex_fail} { yyerror(NULL, "invalid hex character sequence"); } +<xnq,xq,xvq>{hex_fail} { yyerror(NULL, "invalid hex character sequence"); } -<xnq,xq,xvq,xsq>{unicode}+\\ { - /* throw back the \\, and treat as unicode */ - yyless(yyleng - 1); - parseUnicode(yytext, yyleng); - } +<xnq,xq,xvq>{unicode}+\\ { + /* throw back the \\, and treat as unicode */ + yyless(yyleng - 1); + parseUnicode(yytext, yyleng); + } -<xnq,xq,xvq,xsq>\\. { yyerror(NULL, "escape sequence is invalid"); } +<xnq,xq,xvq>\\. { addchar(false, yytext[1]); } -<xnq,xq,xvq,xsq>\\ { yyerror(NULL, "unexpected end after backslash"); } +<xnq,xq,xvq>\\ { yyerror(NULL, "unexpected end after backslash"); } -<xq,xvq,xsq><<EOF>> { yyerror(NULL, "unexpected end of quoted string"); } +<xq,xvq><<EOF>> { yyerror(NULL, "unexpected end of quoted string"); } <xq>\" { yylval->str = scanstring; @@ -170,16 +166,8 @@ hex_fail \\x{hex_dig}{0,1} return VARIABLE_P; } -<xsq>\' { - yylval->str = scanstring; - BEGIN INITIAL; - return STRING_P; - } - <xq,xvq>[^\\\"]+ { addstring(false, yytext, yyleng); } -<xsq>[^\\\']+ { addstring(false, yytext, yyleng); } - <xc>\*\/ { BEGIN INITIAL; } <xc>[^\*]+ { } @@ -210,7 +198,7 @@ hex_fail \\x{hex_dig}{0,1} \> { return GREATER_P; } -\${any}+ { +\${other}+ { addstring(true, yytext + 1, yyleng - 1); addchar(false, '\0'); yylval->str = scanstring; @@ -263,27 +251,22 @@ hex_fail \\x{hex_dig}{0,1} ({realfail1}|{realfail2}) { yyerror(NULL, "invalid floating point number"); } -{any}+ { - addstring(true, yytext, yyleng); - BEGIN xnq; - } - \" { addchar(true, '\0'); BEGIN xq; } -\' { - addchar(true, '\0'); - BEGIN xsq; - } - \\ { yyless(0); addchar(true, '\0'); BEGIN xnq; } +{other}+ { + addstring(true, yytext, yyleng); + BEGIN xnq; + } + <<EOF>> { yyterminate(); } %% diff --git a/src/test/regress/expected/jsonpath.out b/src/test/regress/expected/jsonpath.out index ea42ae3..fc971dc 100644 --- a/src/test/regress/expected/jsonpath.out +++ b/src/test/regress/expected/jsonpath.out @@ -171,30 +171,24 @@ select '"\b\f\r\n\t\v\"\''\\"'::jsonpath; "\b\f\r\n\t\u000b\"'\\" (1 row) -select '''\b\f\r\n\t\v\"\''\\'''::jsonpath; - jsonpath -------------------------- - "\b\f\r\n\t\u000b\"'\\" -(1 row) - select '"\x50\u0067\u{53}\u{051}\u{00004C}"'::jsonpath; jsonpath ---------- "PgSQL" (1 row) -select '''\x50\u0067\u{53}\u{051}\u{00004C}'''::jsonpath; - jsonpath ----------- - "PgSQL" -(1 row) - select '$.foo\x50\u0067\u{53}\u{051}\u{00004C}\t\"bar'::jsonpath; jsonpath --------------------- $."fooPgSQL\t\"bar" (1 row) +select '"\z"'::jsonpath; -- unrecognized escape is just the literal char + jsonpath +---------- + "z" +(1 row) + select '$.g ? ($.a == 1)'::jsonpath; jsonpath -------------------- diff --git a/src/test/regress/expected/jsonpath_encoding.out b/src/test/regress/expected/jsonpath_encoding.out index 8db6e47..ecffe09 100644 --- a/src/test/regress/expected/jsonpath_encoding.out +++ b/src/test/regress/expected/jsonpath_encoding.out @@ -81,84 +81,6 @@ select '"null \\u0000 escape"'::jsonpath as not_an_escape; "null \\u0000 escape" (1 row) --- checks for single-quoted values --- basic unicode input -SELECT E'\'\u\''::jsonpath; -- ERROR, incomplete escape -ERROR: invalid Unicode escape -LINE 1: SELECT E'\'\u\''::jsonpath; - ^ -HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. -SELECT E'\'\u00\''::jsonpath; -- ERROR, incomplete escape -ERROR: invalid Unicode escape -LINE 1: SELECT E'\'\u00\''::jsonpath; - ^ -HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. -SELECT E'\'\u000g\''::jsonpath; -- ERROR, g is not a hex digit -ERROR: invalid Unicode escape -LINE 1: SELECT E'\'\u000g\''::jsonpath; - ^ -HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. -SELECT E'\'\u0000\''::jsonpath; -- OK, legal escape -ERROR: invalid Unicode escape value at or near "E'\'\u0000" -LINE 1: SELECT E'\'\u0000\''::jsonpath; - ^ -SELECT E'\'\uaBcD\''::jsonpath; -- OK, uppercase and lower case both OK - jsonpath ----------- - "ꯍ" -(1 row) - --- handling of unicode surrogate pairs -select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_in_utf8; - correct_in_utf8 ------------------ - "😄🐶" -(1 row) - -select E'\'\ud83d\ud83d\''::jsonpath; -- 2 high surrogates in a row -ERROR: invalid Unicode surrogate pair at or near "E'\'\ud83d\ud83d" -LINE 1: select E'\'\ud83d\ud83d\''::jsonpath; - ^ -select E'\'\ude04\ud83d\''::jsonpath; -- surrogates in wrong order -ERROR: invalid Unicode surrogate pair at or near "E'\'\ude04" -LINE 1: select E'\'\ude04\ud83d\''::jsonpath; - ^ -select E'\'\ud83dX\''::jsonpath; -- orphan high surrogate -ERROR: invalid Unicode surrogate pair at or near "E'\'\ud83dX" -LINE 1: select E'\'\ud83dX\''::jsonpath; - ^ -select E'\'\ude04X\''::jsonpath; -- orphan low surrogate -ERROR: invalid Unicode surrogate pair at or near "E'\'\ude04" -LINE 1: select E'\'\ude04X\''::jsonpath; - ^ ---handling of simple unicode escapes -select E'\'the Copyright \u00a9 sign\''::jsonpath as correct_in_utf8; - correct_in_utf8 ------------------------- - "the Copyright © sign" -(1 row) - -select E'\'dollar \u0024 character\''::jsonpath as correct_everywhere; - correct_everywhere ----------------------- - "dollar $ character" -(1 row) - -select E'\'dollar \\u0024 character\''::jsonpath as not_an_escape; - not_an_escape ----------------------- - "dollar $ character" -(1 row) - -select E'\'null \u0000 escape\''::jsonpath as not_unescaped; -ERROR: invalid Unicode escape value at or near "E'\'null \u0000" -LINE 1: select E'\'null \u0000 escape\''::jsonpath as not_unescaped; - ^ -select E'\'null \\u0000 escape\''::jsonpath as not_an_escape; -ERROR: unsupported Unicode escape sequence -LINE 1: select E'\'null \\u0000 escape\''::jsonpath as not_an_escape... - ^ -DETAIL: \u0000 cannot be converted to text. -- checks for quoted key names -- basic unicode input SELECT '$."\u"'::jsonpath; -- ERROR, incomplete escape diff --git a/src/test/regress/expected/jsonpath_encoding_1.out b/src/test/regress/expected/jsonpath_encoding_1.out index e6dff25..c8cc217 100644 --- a/src/test/regress/expected/jsonpath_encoding_1.out +++ b/src/test/regress/expected/jsonpath_encoding_1.out @@ -78,78 +78,6 @@ select '"null \\u0000 escape"'::jsonpath as not_an_escape; "null \\u0000 escape" (1 row) --- checks for single-quoted values --- basic unicode input -SELECT E'\'\u\''::jsonpath; -- ERROR, incomplete escape -ERROR: invalid Unicode escape -LINE 1: SELECT E'\'\u\''::jsonpath; - ^ -HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. -SELECT E'\'\u00\''::jsonpath; -- ERROR, incomplete escape -ERROR: invalid Unicode escape -LINE 1: SELECT E'\'\u00\''::jsonpath; - ^ -HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. -SELECT E'\'\u000g\''::jsonpath; -- ERROR, g is not a hex digit -ERROR: invalid Unicode escape -LINE 1: SELECT E'\'\u000g\''::jsonpath; - ^ -HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. -SELECT E'\'\u0000\''::jsonpath; -- OK, legal escape -ERROR: invalid Unicode escape value at or near "E'\'\u0000" -LINE 1: SELECT E'\'\u0000\''::jsonpath; - ^ -SELECT E'\'\uaBcD\''::jsonpath; -- OK, uppercase and lower case both OK -ERROR: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at ornear "E'\'\uaBcD" -LINE 1: SELECT E'\'\uaBcD\''::jsonpath; - ^ --- handling of unicode surrogate pairs -select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_in_utf8; -ERROR: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at ornear "E'\'\ud83d\ude04" -LINE 1: select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_... - ^ -select E'\'\ud83d\ud83d\''::jsonpath; -- 2 high surrogates in a row -ERROR: invalid Unicode surrogate pair at or near "E'\'\ud83d\ud83d" -LINE 1: select E'\'\ud83d\ud83d\''::jsonpath; - ^ -select E'\'\ude04\ud83d\''::jsonpath; -- surrogates in wrong order -ERROR: invalid Unicode surrogate pair at or near "E'\'\ude04" -LINE 1: select E'\'\ude04\ud83d\''::jsonpath; - ^ -select E'\'\ud83dX\''::jsonpath; -- orphan high surrogate -ERROR: invalid Unicode surrogate pair at or near "E'\'\ud83dX" -LINE 1: select E'\'\ud83dX\''::jsonpath; - ^ -select E'\'\ude04X\''::jsonpath; -- orphan low surrogate -ERROR: invalid Unicode surrogate pair at or near "E'\'\ude04" -LINE 1: select E'\'\ude04X\''::jsonpath; - ^ ---handling of simple unicode escapes -select E'\'the Copyright \u00a9 sign\''::jsonpath as correct_in_utf8; -ERROR: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at ornear "E'\'the Copyright \u00a9" -LINE 1: select E'\'the Copyright \u00a9 sign\''::jsonpath as correct... - ^ -select E'\'dollar \u0024 character\''::jsonpath as correct_everywhere; - correct_everywhere ----------------------- - "dollar $ character" -(1 row) - -select E'\'dollar \\u0024 character\''::jsonpath as not_an_escape; - not_an_escape ----------------------- - "dollar $ character" -(1 row) - -select E'\'null \u0000 escape\''::jsonpath as not_unescaped; -ERROR: invalid Unicode escape value at or near "E'\'null \u0000" -LINE 1: select E'\'null \u0000 escape\''::jsonpath as not_unescaped; - ^ -select E'\'null \\u0000 escape\''::jsonpath as not_an_escape; -ERROR: unsupported Unicode escape sequence -LINE 1: select E'\'null \\u0000 escape\''::jsonpath as not_an_escape... - ^ -DETAIL: \u0000 cannot be converted to text. -- checks for quoted key names -- basic unicode input SELECT '$."\u"'::jsonpath; -- ERROR, incomplete escape diff --git a/src/test/regress/sql/jsonpath.sql b/src/test/regress/sql/jsonpath.sql index 29ea77a..7afe252 100644 --- a/src/test/regress/sql/jsonpath.sql +++ b/src/test/regress/sql/jsonpath.sql @@ -30,10 +30,9 @@ select '$.a/+-1'::jsonpath; select '1 * 2 + 4 % -3 != false'::jsonpath; select '"\b\f\r\n\t\v\"\''\\"'::jsonpath; -select '''\b\f\r\n\t\v\"\''\\'''::jsonpath; select '"\x50\u0067\u{53}\u{051}\u{00004C}"'::jsonpath; -select '''\x50\u0067\u{53}\u{051}\u{00004C}'''::jsonpath; select '$.foo\x50\u0067\u{53}\u{051}\u{00004C}\t\"bar'::jsonpath; +select '"\z"'::jsonpath; -- unrecognized escape is just the literal char select '$.g ? ($.a == 1)'::jsonpath; select '$.g ? (@ == 1)'::jsonpath; diff --git a/src/test/regress/sql/jsonpath_encoding.sql b/src/test/regress/sql/jsonpath_encoding.sql index a3b5bc3..3a23b72 100644 --- a/src/test/regress/sql/jsonpath_encoding.sql +++ b/src/test/regress/sql/jsonpath_encoding.sql @@ -24,29 +24,6 @@ select '"dollar \\u0024 character"'::jsonpath as not_an_escape; select '"null \u0000 escape"'::jsonpath as not_unescaped; select '"null \\u0000 escape"'::jsonpath as not_an_escape; --- checks for single-quoted values - --- basic unicode input -SELECT E'\'\u\''::jsonpath; -- ERROR, incomplete escape -SELECT E'\'\u00\''::jsonpath; -- ERROR, incomplete escape -SELECT E'\'\u000g\''::jsonpath; -- ERROR, g is not a hex digit -SELECT E'\'\u0000\''::jsonpath; -- OK, legal escape -SELECT E'\'\uaBcD\''::jsonpath; -- OK, uppercase and lower case both OK - --- handling of unicode surrogate pairs -select E'\'\ud83d\ude04\ud83d\udc36\''::jsonpath as correct_in_utf8; -select E'\'\ud83d\ud83d\''::jsonpath; -- 2 high surrogates in a row -select E'\'\ude04\ud83d\''::jsonpath; -- surrogates in wrong order -select E'\'\ud83dX\''::jsonpath; -- orphan high surrogate -select E'\'\ude04X\''::jsonpath; -- orphan low surrogate - ---handling of simple unicode escapes -select E'\'the Copyright \u00a9 sign\''::jsonpath as correct_in_utf8; -select E'\'dollar \u0024 character\''::jsonpath as correct_everywhere; -select E'\'dollar \\u0024 character\''::jsonpath as not_an_escape; -select E'\'null \u0000 escape\''::jsonpath as not_unescaped; -select E'\'null \\u0000 escape\''::jsonpath as not_an_escape; - -- checks for quoted key names -- basic unicode input
On 9/19/19 12:25 PM, Tom Lane wrote: > I wrote: >> I found a spot that seemed like a reasonable place, and added some >> coverage of the point. Updated patch attached. > > Doc patch pushed. Thanks! I did not get to review them last night but upon review not too long ago, they looked great. >> It seems to me that there are some discrepancies between what the spec >> says and what jsonpath_scan.l actually does, so maybe we should take a >> hard look at that code too. The biggest issue is that jsonpath_scan.l >> seems to allow single- and double-quoted strings interchangeably, which is >> OK per ECMAScript, but then the SQL/JSON spec seems to be saying that only >> double-quoted strings are allowed. I'd rather be conservative about this >> than get out in front of the spec and use syntax space that they might do >> something else with someday. I agree with erring on the side of the spec vs. what ECMAScript does. In JSON, strings, identifiers, etc. are double-quoted. Anything that is single quoted with throw an error in a compliant JSON parser. Looking at the user documentation for how some other databases with SQL/JSON support, this seems to back up your analysis. > > The attached proposed patch makes these changes: > > 1. Remove support for single-quoted literals in jsonpath. > > 2. Treat an unrecognized escape (e.g., "\z") as meaning the escaped > character, rather than throwing an error. > > 3. A few cosmetic adjustments to make the jsonpath_scan code shorter and > clearer (IMHO). If this refers to s/any/other/, yes I would agree it's clearer. > As for #1, although the SQL/JSON tech report does reference ECMAScript > which allows both single- and double-quoted strings, it seems to me > that their intent is to allow only the double-quoted variant. They > specifically reference JSON string literals at one point, and of course > JSON only allows double-quoted. Also, all of their discussion and > examples use double-quoted. Plus you'd have to be pretty nuts to want > to use single-quoted when writing a jsonpath string literal inside a SQL > literal (and the tech report seems to contemplate that jsonpaths MUST be > string literals, though of course our implementation does not require > that). I agree with the above (though wrt single-quoting and literals, I have seen stranger things). > As for #2, the existing code throws an error, but this is contrary > to clear statements in every single one of the relevant standards. Makes sense. I looked at the patch, but did not test it. From what I can see, it looks good, but perhaps we add a test in it to show that single-quoted literals are unsupported? Jonathan
Вложения
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > I looked at the patch, but did not test it. From what I can see, it > looks good, but perhaps we add a test in it to show that single-quoted > literals are unsupported? I thought about that, but it seems like it'd be memorializing some other weird behavior: regression=# select '''foo'''::jsonpath; ERROR: syntax error, unexpected IDENT_P at end of jsonpath input LINE 1: select '''foo'''::jsonpath; ^ regression=# select '''foo'' <= ''bar'''::jsonpath; ERROR: syntax error, unexpected IDENT_P at or near " " of jsonpath input LINE 1: select '''foo'' <= ''bar'''::jsonpath; ^ There isn't anything I like about these error messages. Seems like the error handling in jsonpath_gram.y could use some cleanup too ... although I don't think it's a task to tackle while we're rushing to get v12 shippable. regards, tom lane
On 9/19/19 3:48 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> I looked at the patch, but did not test it. From what I can see, it >> looks good, but perhaps we add a test in it to show that single-quoted >> literals are unsupported? > > I thought about that, but it seems like it'd be memorializing some > other weird behavior: > > regression=# select '''foo'''::jsonpath; > ERROR: syntax error, unexpected IDENT_P at end of jsonpath input > LINE 1: select '''foo'''::jsonpath; > ^ > > regression=# select '''foo'' <= ''bar'''::jsonpath; > ERROR: syntax error, unexpected IDENT_P at or near " " of jsonpath input > LINE 1: select '''foo'' <= ''bar'''::jsonpath; > ^ Ah yeah, those are some interesting errors. > There isn't anything I like about these error messages. Agreed. It would be nice to have tests around it, but yes, I think looking at the regression outpout one may scratch their head. > Seems like > the error handling in jsonpath_gram.y could use some cleanup too > ... although I don't think it's a task to tackle while we're > rushing to get v12 shippable. IIRC if we want to change the contents of an error message we wait until major releases. Is there anything we can do before 12 to avoid messages like "unexpected IDENT_P" coming to a user? Would that be something acceptable to fix as a 12.1 or would it have to wait until 13? Jonathan
Вложения
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 9/19/19 3:48 PM, Tom Lane wrote: >> Seems like >> the error handling in jsonpath_gram.y could use some cleanup too >> ... although I don't think it's a task to tackle while we're >> rushing to get v12 shippable. > IIRC if we want to change the contents of an error message we wait until > major releases. Is there anything we can do before 12 to avoid messages > like "unexpected IDENT_P" coming to a user? Would that be something > acceptable to fix as a 12.1 or would it have to wait until 13? I think these messages are sufficiently confusing that we could call it a bug fix to improve them. As long as we don't change the SQLSTATE that's thrown, it's hard to claim that there's any real application compatibility hazard from changing them. I just don't want to call this point a release blocker. It's not about changing any semantics or the set of things that work. regards, tom lane
On 9/19/19 6:18 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> On 9/19/19 3:48 PM, Tom Lane wrote: >>> Seems like >>> the error handling in jsonpath_gram.y could use some cleanup too >>> ... although I don't think it's a task to tackle while we're >>> rushing to get v12 shippable. > >> IIRC if we want to change the contents of an error message we wait until >> major releases. Is there anything we can do before 12 to avoid messages >> like "unexpected IDENT_P" coming to a user? Would that be something >> acceptable to fix as a 12.1 or would it have to wait until 13? > > I think these messages are sufficiently confusing that we could call > it a bug fix to improve them. As long as we don't change the SQLSTATE > that's thrown, it's hard to claim that there's any real application > compatibility hazard from changing them. Great. +1 on that. > I just don't want to call this point a release blocker. It's not > about changing any semantics or the set of things that work. +100 on that. Jonathan