Обсуждение: A really subtle lexer bug

Поиск
Список
Период
Сортировка

A really subtle lexer bug

От
Andrew Gierth
Дата:
Currently in PG, the precedence of = and <> is supposed to be equal, and
the precedence of unary - is very high.

So, (true=1 between 1 and 1) parses as ((true)=(1 between 1 and 1)),
and (true=-1 between 1 and 1) parses as ((true)=((-1) between 1 and 1)).
All good so far.

(true<>-1 between 1 and 1) parses as ((true<>(-1)) between 1 and 1). ???

The fault here is in the lexer. The input "<>-" is being lexed as an Op
followed by '-' rather than as NOT_EQUAL followed by '-' because it
looks like a match for a multi-character operator, with the - being
thrown back into the input afterwards. So the precedence of <> gets
inflated to that of Op, which is higher than BETWEEN.

More seriously, this also breaks named arguments:

create function f(a integer) returns integer language sql
 as $$ select a; $$;

select f(a => 1);  -- works
select f(a => -1);  -- works
select f(a =>-1);  -- ERROR:  column "a" does not exist

I guess the fix is to extend the existing special case code that checks
for one character left after removing trailing [+-] and also check for
the two-character ops "<>" ">=" "<=" "=>" "!=". 

-- 
Andrew (irc:RhodiumToad)


Re: A really subtle lexer bug

От
Andrew Gierth
Дата:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> select f(a =>-1);  -- ERROR:  column "a" does not exist

 Andrew> I guess the fix is to extend the existing special case code
 Andrew> that checks for one character left after removing trailing [+-]
 Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>"
 Andrew> "!=".

Patch attached.

This fixes two bugs: first the mis-lexing of two-char ops as mentioned
originally; second, the O(N^3) lexing time of strings of - or +
characters is reduced to O(N^2) (in practice it's better than O(N^2)
once N gets large because the bison stack gets blown out, ending the
loop early).

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 0cd782827a..cc855ffb1c 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -885,20 +885,34 @@ other            .
                      * to forbid operator names like '?-' that could not be
                      * sequences of SQL operators.
                      */
-                    while (nchars > 1 &&
-                           (yytext[nchars - 1] == '+' ||
-                            yytext[nchars - 1] == '-'))
+                    if (nchars > 1 &&
+                        (yytext[nchars - 1] == '+' ||
+                         yytext[nchars - 1] == '-'))
                     {
                         int            ic;
 
                         for (ic = nchars - 2; ic >= 0; ic--)
                         {
-                            if (strchr("~!@#^&|`?%", yytext[ic]))
+                            char c = yytext[ic];
+                            if (c == '~' || c == '!' || c == '@' ||
+                                c == '#' || c == '^' || c == '&' ||
+                                c == '|' || c == '`' || c == '?' ||
+                                c == '%')
                                 break;
                         }
-                        if (ic >= 0)
-                            break; /* found a char that makes it OK */
-                        nchars--; /* else remove the +/-, and check again */
+                        if (ic < 0)
+                        {
+                            /*
+                             * didn't find a qualifying character, so remove
+                             * all trailing [+-]
+                             */
+                            for (nchars--;
+                                 nchars > 1 &&
+                                     (yytext[nchars - 1] == '+' ||
+                                      yytext[nchars - 1] == '-');
+                                 nchars--)
+                                continue;
+                        }
                     }
 
                     SET_YYLLOC();
@@ -916,6 +930,25 @@ other            .
                         if (nchars == 1 &&
                             strchr(",()[].;:+-*/%^<>=", yytext[0]))
                             return yytext[0];
+                        /*
+                         * Likewise, if what we have left is two chars, and
+                         * those match the tokens ">=", "<=", "=>", "<>" or
+                         * "!=", then we must return the sppropriate token
+                         * rather than the generic Op.
+                         */
+                        if (nchars == 2)
+                        {
+                            if (yytext[0] == '=' && yytext[1] == '>')
+                                return EQUALS_GREATER;
+                            if (yytext[0] == '>' && yytext[1] == '=')
+                                return GREATER_EQUALS;
+                            if (yytext[0] == '<' && yytext[1] == '=')
+                                return LESS_EQUALS;
+                            if (yytext[0] == '<' && yytext[1] == '>')
+                                return NOT_EQUALS;
+                            if (yytext[0] == '!' && yytext[1] == '=')
+                                return NOT_EQUALS;
+                        }
                     }
 
                     /*

Re: A really subtle lexer bug

От
Andrew Gierth
Дата:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> Patch attached.

 Andrew> This fixes two bugs:

I'm going to split this into two patches, since the O(N^3) fix can be
backpatched further than the operator token fix; the latter bug was
introduced in 9.5 when operator precedences were corrected, while the
former has been there forever.

(But don't wait for me to post those before commenting on either issue)

-- 
Andrew (irc:RhodiumToad)


Re: A really subtle lexer bug

От
Tom Lane
Дата:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
>  Andrew> I guess the fix is to extend the existing special case code
>  Andrew> that checks for one character left after removing trailing [+-]
>  Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>"
>  Andrew> "!=".

> Patch attached.
> This fixes two bugs: first the mis-lexing of two-char ops as mentioned
> originally; second, the O(N^3) lexing time of strings of - or +
> characters is reduced to O(N^2) (in practice it's better than O(N^2)
> once N gets large because the bison stack gets blown out, ending the
> loop early).

Looks reasonable offhand (didn't test).  A couple of thoughts:

* Some regression tests exercising these code paths might be a good thing.

* There should likely be a comment near where EQUALS_GREATER and
friends are defined, pointing out that if we add any more multi-character
operators with special precedences, this code has to be taught about them.

            regards, tom lane


Re: A really subtle lexer bug

От
Andrew Gierth
Дата:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 >> Patch attached.
 >> This fixes two bugs: first the mis-lexing of two-char ops as mentioned
 >> originally; second, the O(N^3) lexing time of strings of - or +
 >> characters is reduced to O(N^2) (in practice it's better than O(N^2)
 >> once N gets large because the bison stack gets blown out, ending the
 >> loop early).

 Tom> Looks reasonable offhand (didn't test).  A couple of thoughts:

 Tom> * Some regression tests exercising these code paths might be a
 Tom> good thing.

Agreed. Any preferences where they should go?

 Tom> * There should likely be a comment near where EQUALS_GREATER and
 Tom> friends are defined, pointing out that if we add any more
 Tom> multi-character operators with special precedences, this code has
 Tom> to be taught about them.

Agreed; will do this.

-- 
Andrew (irc:RhodiumToad)


Re: A really subtle lexer bug

От
Tom Lane
Дата:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> * Some regression tests exercising these code paths might be a
>  Tom> good thing.

> Agreed. Any preferences where they should go?

There's not really a "lexer/grammar" test script.  Maybe you could
drop it into create_operator.sql, especially if you need to make any
funny operators to test with?

            regards, tom lane


Re: A really subtle lexer bug

От
Andrew Gierth
Дата:
Here's what I will push unless there's something important I missed.

I split the regression tests between create_operator.sql and
polymorphism.sql, because => is really "named argument syntax" rather
than an operator as such, and polymorphism.sql is where the existing
tests were for that.

I tried to keep the three lexers as closely matched as possible, even
though psqlscan.l doesn't actually need some of the changes.

-- 
Andrew (irc:RhodiumToad)


Вложения

Re: A really subtle lexer bug

От
Tom Lane
Дата:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> Here's what I will push unless there's something important I missed.

Stylistic nitpick: I don't especially like "continue" as the body of
a for-loop.  How about instead of this:

                for (nchars--;
                     nchars > 1 &&
                     (yytext[nchars - 1] == '+' ||
                      yytext[nchars - 1] == '-');
                     nchars--)
                    continue;

do this:

                do {
                    nchars--;
                } while (nchars > 1 &&
                         (yytext[nchars - 1] == '+' ||
                          yytext[nchars - 1] == '-'));

That's a clearer separation between loop action and loop test, and
it makes it more obvious that you're relying on the loop condition
to be true at the start.

Also, I'm not entirely convinced that replacing the strchr() with
a handmade equivalent is a good idea.  Some versions of strchr()
are pretty quick.

No objections beyond that.

            regards, tom lane