Re: Define jsonpath functions as stable

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Define jsonpath functions as stable
Дата
Msg-id 31931.1568668225@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Define jsonpath functions as stable  ("Jonathan S. Katz" <jkatz@postgresql.org>)
Ответы Re: Define jsonpath functions as stable  ("Jonathan S. Katz" <jkatz@postgresql.org>)
Re: Define jsonpath functions as stable  (Chapman Flack <chap@anastigmatix.net>)
Список pgsql-hackers
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> On 9/16/19 11:20 AM, Tom Lane wrote:
>> I think we could possibly get away with not having any special marker
>> on regexes, but just explaining in the documentation that "features
>> so-and-so are not implemented".  Writing that text would require closer
>> analysis than I've seen in this thread as to exactly what the differences
>> are.

> +1, and likely would need some example strings too that highlight the
> difference in how they are processed.

I spent an hour digging through these specs.  I was initially troubled
by the fact that XML Schema regexps are implicitly anchored, ie must
match the whole string; that's a huge difference from POSIX.  However,
19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
except for some weirdness around newline matching; and that spec
clearly says that fn:matches treats its pattern argument as NOT anchored.
So it looks like we end up in the same place as POSIX for this.

Otherwise, the pattern language differences I could find are all details
of character class expressions (bracket expressions, such as "[a-z0-9]")
and escapes that are character class shorthands:

* We don't have "character class subtraction".  I'd be pretty hesitant
to add that to our regexp language because it seems to change "-" into
a metacharacter, which would break an awful lot of regexps.  I might
be misunderstanding their syntax for it, because elsewhere that spec
explicitly claims that "-" is not a metacharacter.

* Character class elements can be #xNN (NN being hex digits), which seems
equivalent to POSIX \xNN as long as you're using UTF8 encoding.  Again,
the compatibility costs of allowing that don't seem attractive, since #
isn't a metacharacter today.

* Character class elements can be \p{UnicodeProperty} or
the complement \P{UnicodeProperty}, where there are a bunch of different
possible properties.  Perhaps we could add that someday; since there's no
reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
compatibility hit.  But I'm content to document this as unimplemented
for now.

* XQuery adds character class shorthands \i (complement \I) for "initial
name characters" and \c (complement \C) for "NameChar".  Same as above;
maybe add someday, but no hurry.

* It looks like XQuery's \w class might allow more characters than our
interpretation does, and hence \W allows fewer.  But since \w devolves
to what libc thinks the "alnum" class is, it's at least possible that
some locales might do the same thing XQuery calls for.

* Likewise, any other discrepancies between the Unicode-centric character
class definitions in XQuery and what our stuff does are well within the
boundaries of locale variances.  So I don't feel too bad about that.

* The SQL-spec newline business mentioned above is a possible exception:
it appears to require that when '.' is allowed to match newlines, a
single '.' should match a '\r\n' Windows newline.  I think we can
document that and move on.

* The x flag in XQuery is defined as ignoring all whitespace in
the pattern except within character class expressions.  Spencer's
x flag does mostly that, but it thinks that "\ " means a literal space
whereas XQuery explicitly says that the space is ignored and the
backslash applies to the next non-space character.  (That's just
weird, in my book.)  Also, Spencer's x mode causes # to begin
a comment extending to EOL, which is a nice thing XQuery hasn't
got, and it says you can't put spaces within multi-character
symbols like "(?:", which presumably is allowed with XQuery's "x".

I feel a bit uncomfortable with these inconsistencies in x-flag
rules.  We could probably teach the regexp library to have an
alternate expanded mode that matches XQuery's rules, but that's
not a project to tackle for v12.  I tentatively recommend that
we remove the jsonpath "x" flag for the time being.

Also, I noted some things that seem to be flat out sloppiness
in the XQuery flag conversions:

* The newline-matching flags (m and s flags) can be mapped to
features of Spencer's library, but jsonpath_gram.y does so
incorrectly.

* XQuery says that the q flag overrides m, s, and x flags, which is
exactly the opposite of what our code does; besides which the code
is flag-order-sensitive which is just wrong.

These last two are simple to fix and we should just go do it.
Otherwise, I think we're okay with regarding Spencer's library
as being a sufficiently close approximation to LIKE_REGEX.
We need some documentation work though.

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.