Fix parsing of identifiers in jsonpath

Поиск
Список
Период
Сортировка
От Nikita Glukhov
Тема Fix parsing of identifiers in jsonpath
Дата
Msg-id 3603843f-b5fc-d8d9-c842-ccde2288b967@postgrespro.ru
обсуждение исходный текст
Ответы Re: Fix parsing of identifiers in jsonpath  (Chapman Flack <chap@anastigmatix.net>)
Re: Fix parsing of identifiers in jsonpath  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Fix parsing of identifiers in jsonpath  (Nikita Glukhov <n.gluhov@postgrespro.ru>)
Список pgsql-hackers
Hi!

Unfortunately, jsonpath lexer, in contrast to jsonpath parser, was written by
Teodor and me without a proper attention to the stanard.  JSON path lexics is
is borrowed from the external ECMAScript [1], and we did not study it carefully.

There were numerous deviations from the ECMAScript standard in our jsonpath
implementation that were mostly fixed in the attached patch:

1. Identifiers (unquoted JSON key names) should start from the one of (see [2]):  - Unicode symbol having Unicode property "ID_Start" (see [3])  - Unicode escape sequence '\uXXXX' or '\u{X...}'  - '$'  - '_'
  And they should continue with the one of:  - Unicode symbol having Unicode property "ID_Continue" (see [3])  - Unicode escape sequence  - '$'  - ZWNJ  - ZWJ

2. '$' is also allowed inside the identifiers, so it is possible to write  something like '$.a$$b'.

3. Variable references '$var' are regular identifiers simply starting from the  '$' sign, and there is no syntax like '$"var"', because quotes are not  allowed in identifiers.

4. Even if the Unicode escape sequence '\uXXXX' is used, it cannot produce  special symbols or whitespace, because the identifiers are displayed without  quoting (i.e. '$\u{20}' is not possible to display as '$" "' or even more as  string '"$ "').

5. All codepoints in '\u{XXXXXX}' greater than 0x10FFFF should be forbidden.

6. 6 single-character escape sequences (\b \t \r \f \n \v) should only be  supported inside quoted strings.


I don't know if it is possible to check Unicode properties "ID_Start" and
"ID_Continue" in Postgres, and what ZWNJ/ZWJ is.  Now, identifier's starting
character set is simply determined by the exclusion of all recognized special
characters.


The patch is not so simple, but I believe that it's not too late to fix v12.


[1] https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar
[2] https://www.ecma-international.org/ecma-262/10.0/index.html#sec-names-and-keywords
[3] https://unicode.org/reports/tr31/
--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Nondeterministic collations vs. text_pattern_ops
Следующее
От: Tom Lane
Дата:
Сообщение: Re: PGCOLOR? (Re: pgsql: Unified logging system for command-line programs)