Re: BUG #15273: Lexer bug with UESCAPE

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: BUG #15273: Lexer bug with UESCAPE
Дата
Msg-id 18384.1531343105@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: BUG #15273: Lexer bug with UESCAPE  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Ответы Re: BUG #15273: Lexer bug with UESCAPE  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> Also, I'm going to push back on the claim that allowing comments
>  Tom> there is required by the SQL spec.

> These are the rules that (as far as I can see) apply to that case:
> 5.2 <token> and <separator>
>   7) Any <token> may be followed by a <separator>.

Right, but that only gets the result you claim if you suppose that a
<Unicode character string literal> consists of more than one <token>.
I don't think so, because the start of the section states

<token> ::=
    <nondelimiter token>
  | <delimiter token>

<nondelimiter token> ::=
    <regular identifier>
  | <key word>
  | <unsigned numeric literal>
  | <national character string literal>
  | <binary string literal>
  | <large object length token>
  | <Unicode delimited identifier>
  | <Unicode character string literal>
  | <SQL language identifier>

and then reading down, we have

<Unicode character string literal> ::=
  [ <introducer> <character set specification> ]
      U <ampersand> <quote> [ <Unicode representation>... ] <quote>
      [ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
      <Unicode escape specifier>

There isn't anything here that equates <token> with any sub-part
of a <Unicode character string literal>.  Unless you want to argue
that <ampersand> and <quote> can be derived from <delimiter token>,
but if you take that path you're left to explain why whitespace
at the start of the literal contents is data and not a <separator>.

Of course, there's certainly an argument to be made that the intent
is that the U&'...' part be one token and then UESCAPE and the escape
string are two more, but the SQL committee just can't specify their
way out of a paper bag.  We knew that already.

Anyway, as I said before, I can't see that we would want to fix this
by extending the existing implementation --- you'd need a bunch more
exclusive lexer states which would be a pain in the rear, and possibly
a performance problem too.  I do wonder though why Peter did it like that.
You could imagine returning three tokens to the grammar and letting the
grammar merge them, which'd make the lexer aspect of this far simpler
and perhaps not complicate the grammar too much.

Another thing I noticed about the existing implementation is that it's
very unfriendly if you write an invalid escape specifier:

postgres=# select U&'foo' uescape 'bar';
ERROR:  syntax error at or near "'bar'"
LINE 1: select U&'foo' uescape 'bar';
                               ^

It'd be much nicer to say something along the line of "Unicode escape
specifier must be a single character", but shoehorning that into the
lexer-based implementation would be a giant pain.

I'm not excited enough about any of this to spend more time on it,
but maybe somebody else is.

            regards, tom lane


В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15276: pl/pgSQL function caches wrong plan
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15273: Lexer bug with UESCAPE