Обсуждение: A really subtle lexer bug
Currently in PG, the precedence of = and <> is supposed to be equal, and the precedence of unary - is very high. So, (true=1 between 1 and 1) parses as ((true)=(1 between 1 and 1)), and (true=-1 between 1 and 1) parses as ((true)=((-1) between 1 and 1)). All good so far. (true<>-1 between 1 and 1) parses as ((true<>(-1)) between 1 and 1). ??? The fault here is in the lexer. The input "<>-" is being lexed as an Op followed by '-' rather than as NOT_EQUAL followed by '-' because it looks like a match for a multi-character operator, with the - being thrown back into the input afterwards. So the precedence of <> gets inflated to that of Op, which is higher than BETWEEN. More seriously, this also breaks named arguments: create function f(a integer) returns integer language sql as $$ select a; $$; select f(a => 1); -- works select f(a => -1); -- works select f(a =>-1); -- ERROR: column "a" does not exist I guess the fix is to extend the existing special case code that checks for one character left after removing trailing [+-] and also check for the two-character ops "<>" ">=" "<=" "=>" "!=". -- Andrew (irc:RhodiumToad)
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Andrew> select f(a =>-1); -- ERROR: column "a" does not exist Andrew> I guess the fix is to extend the existing special case code Andrew> that checks for one character left after removing trailing [+-] Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>" Andrew> "!=". Patch attached. This fixes two bugs: first the mis-lexing of two-char ops as mentioned originally; second, the O(N^3) lexing time of strings of - or + characters is reduced to O(N^2) (in practice it's better than O(N^2) once N gets large because the bison stack gets blown out, ending the loop early). -- Andrew (irc:RhodiumToad) diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index 0cd782827a..cc855ffb1c 100644 --- a/src/backend/parser/scan.l +++ b/src/backend/parser/scan.l @@ -885,20 +885,34 @@ other . * to forbid operator names like '?-' that could not be * sequences of SQL operators. */ - while (nchars > 1 && - (yytext[nchars - 1] == '+' || - yytext[nchars - 1] == '-')) + if (nchars > 1 && + (yytext[nchars - 1] == '+' || + yytext[nchars - 1] == '-')) { int ic; for (ic = nchars - 2; ic >= 0; ic--) { - if (strchr("~!@#^&|`?%", yytext[ic])) + char c = yytext[ic]; + if (c == '~' || c == '!' || c == '@' || + c == '#' || c == '^' || c == '&' || + c == '|' || c == '`' || c == '?' || + c == '%') break; } - if (ic >= 0) - break; /* found a char that makes it OK */ - nchars--; /* else remove the +/-, and check again */ + if (ic < 0) + { + /* + * didn't find a qualifying character, so remove + * all trailing [+-] + */ + for (nchars--; + nchars > 1 && + (yytext[nchars - 1] == '+' || + yytext[nchars - 1] == '-'); + nchars--) + continue; + } } SET_YYLLOC(); @@ -916,6 +930,25 @@ other . if (nchars == 1 && strchr(",()[].;:+-*/%^<>=", yytext[0])) return yytext[0]; + /* + * Likewise, if what we have left is two chars, and + * those match the tokens ">=", "<=", "=>", "<>" or + * "!=", then we must return the sppropriate token + * rather than the generic Op. + */ + if (nchars == 2) + { + if (yytext[0] == '=' && yytext[1] == '>') + return EQUALS_GREATER; + if (yytext[0] == '>' && yytext[1] == '=') + return GREATER_EQUALS; + if (yytext[0] == '<' && yytext[1] == '=') + return LESS_EQUALS; + if (yytext[0] == '<' && yytext[1] == '>') + return NOT_EQUALS; + if (yytext[0] == '!' && yytext[1] == '=') + return NOT_EQUALS; + } } /*
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Andrew> Patch attached. Andrew> This fixes two bugs: I'm going to split this into two patches, since the O(N^3) fix can be backpatched further than the operator token fix; the latter bug was introduced in 9.5 when operator precedences were corrected, while the former has been there forever. (But don't wait for me to post those before commenting on either issue) -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Andrew> I guess the fix is to extend the existing special case code > Andrew> that checks for one character left after removing trailing [+-] > Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>" > Andrew> "!=". > Patch attached. > This fixes two bugs: first the mis-lexing of two-char ops as mentioned > originally; second, the O(N^3) lexing time of strings of - or + > characters is reduced to O(N^2) (in practice it's better than O(N^2) > once N gets large because the bison stack gets blown out, ending the > loop early). Looks reasonable offhand (didn't test). A couple of thoughts: * Some regression tests exercising these code paths might be a good thing. * There should likely be a comment near where EQUALS_GREATER and friends are defined, pointing out that if we add any more multi-character operators with special precedences, this code has to be taught about them. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Patch attached. >> This fixes two bugs: first the mis-lexing of two-char ops as mentioned >> originally; second, the O(N^3) lexing time of strings of - or + >> characters is reduced to O(N^2) (in practice it's better than O(N^2) >> once N gets large because the bison stack gets blown out, ending the >> loop early). Tom> Looks reasonable offhand (didn't test). A couple of thoughts: Tom> * Some regression tests exercising these code paths might be a Tom> good thing. Agreed. Any preferences where they should go? Tom> * There should likely be a comment near where EQUALS_GREATER and Tom> friends are defined, pointing out that if we add any more Tom> multi-character operators with special precedences, this code has Tom> to be taught about them. Agreed; will do this. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> * Some regression tests exercising these code paths might be a > Tom> good thing. > Agreed. Any preferences where they should go? There's not really a "lexer/grammar" test script. Maybe you could drop it into create_operator.sql, especially if you need to make any funny operators to test with? regards, tom lane
Here's what I will push unless there's something important I missed. I split the regression tests between create_operator.sql and polymorphism.sql, because => is really "named argument syntax" rather than an operator as such, and polymorphism.sql is where the existing tests were for that. I tried to keep the three lexers as closely matched as possible, even though psqlscan.l doesn't actually need some of the changes. -- Andrew (irc:RhodiumToad)
Вложения
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Here's what I will push unless there's something important I missed. Stylistic nitpick: I don't especially like "continue" as the body of a for-loop. How about instead of this: for (nchars--; nchars > 1 && (yytext[nchars - 1] == '+' || yytext[nchars - 1] == '-'); nchars--) continue; do this: do { nchars--; } while (nchars > 1 && (yytext[nchars - 1] == '+' || yytext[nchars - 1] == '-')); That's a clearer separation between loop action and loop test, and it makes it more obvious that you're relying on the loop condition to be true at the start. Also, I'm not entirely convinced that replacing the strchr() with a handmade equivalent is a good idea. Some versions of strchr() are pretty quick. No objections beyond that. regards, tom lane