Обсуждение: Status report: getting plpgsql to use the core lexer

Поиск
Список
Период
Сортировка

Status report: getting plpgsql to use the core lexer

От
Tom Lane
Дата:
I've spent the past several days working on the project I suggested here:
http://archives.postgresql.org/message-id/18653.1239741426@sss.pgh.pa.us
of getting rid of plpgsql's private lexer and having it use the core
lexer instead.  I've run out of time for that and need to go focus on
commitfest reviewing, but I thought I'd make a brain dump first.
I have a patch that sort of works (attached for amusement purposes);
but it doesn't quite pass all the regression tests, and I've concluded
that there are a couple of basic mistakes in the way I approached it.

One problem that wasn't obvious when I started is that if you are trying
to use a reentrant lexer, Bison insists on including its YYSTYPE union
in the call signature of the lexer.  Of course, YYSTYPE means different
things to the core grammar and plpgsql's grammar.  I tried to work around
that by having an interface layer that would (among other duties)
translate as needed.  It turned out to be a real PITA, not least because
you can't include both definitions into the same C file.  The scheme
I have has more or less failed --- I think I'd need *two* interface layers
to make it work without unmaintainable kluges.  It would probably be
better to try to adjust the core lexer's API some more so that it does
not depend on the core YYSTYPE, but I'm not sure yet how to get Bison
to play along without injecting an interface layer (and hence wasted
cycles) into the core grammar/lexer interface.

Another pretty serious issue is that the current plpgsql lexer treats
various sorts of qualified names as single tokens.  I had thought this
could be worked around in the interface layer by doing more lookahead.
You can do that, and it mostly works, but it's mighty tedious.  The big
problem is that "yytext" gets out of step --- it will point at the last
token the core lexer has processed, and there's no good way to back it up
after lookahead.  I spent a fair amount of time trying to work around that
by eliminating uses of "yytext" in plpgsql, and mostly succeeded, but
there are still some left.  (Some of the remaining regression failures are
error messages that point at the wrong token because they rely on yytext.)

Now, having name lookup happen at the lexical level is pretty bogus
anyhow.  The long-term solution here is probably to avoid doing lookup in
the plpgsql lexer and move it into some sort of callback hook in the main
parser, as we've discussed before.  I didn't want to get into that right
away, but I'm now thinking it has to happen before not after refactoring
the lexer code.  One issue that has to be surmounted before that can
happen is that plpgsql currently throws away all knowledge of syntactic
scope after initial processing of a function --- the "name stack" is no
longer available when we want to parse individual SQL commands.  We can
probably rearrange that design but it's another bit of work I don't have
time for right now.

The plpgsql code also currently has a bunch of "unreserved keywords"
that are only significant in one or two contexts, and are implemented
by doing pg_strcasecmp(yytext, "keyword") tests instead of letting the
lexer know they exist.  This is one of the things that got broken by
the above yytext-synchronization issue.  What the patch is doing is
checking for identifiers spelled to match, but it's really wrong because
it can't distinguish quoted and unquoted occurrences of a keyword.
I'm inclined to think the right fix is to make an honest-to-goodness
concept of reserved and unreserved keywords in plpgsql, as we do in
the core grammar.  A lot of the words that currently are reserved words
probably wouldn't need to be, if we had that infrastructure.

Also, the more I messed with this the more I thought it was time to throw
away plpgsql's error location tracking and start over.  That was all
written before we had decent location tracking in the core lexer/parser,
and we could do a much better and more consistent job if we relied on
the core facilities instead of rolling our own.

So there's a lot left to do here, and it has to go on the back burner
for now.

            regards, tom lane


Вложения

Re: Status report: getting plpgsql to use the core lexer

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> Another pretty serious issue is that the current plpgsql lexer treats
> various sorts of qualified names as single tokens.  I had thought this
> could be worked around in the interface layer by doing more lookahead.
> You can do that, and it mostly works, but it's mighty tedious.  The big
> problem is that "yytext" gets out of step --- it will point at the last
> token the core lexer has processed, and there's no good way to back it up
> after lookahead.  I spent a fair amount of time trying to work around that
> by eliminating uses of "yytext" in plpgsql, and mostly succeeded, but
> there are still some left.  (Some of the remaining regression failures are
> error messages that point at the wrong token because they rely on yytext.)

Just wondering if there are additional regressions not detected due to
pg_regress using the ignore-whitespace option to diff.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Status report: getting plpgsql to use the core lexer

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> ...  I spent a fair amount of time trying to work around that
>> by eliminating uses of "yytext" in plpgsql, and mostly succeeded, but
>> there are still some left.  (Some of the remaining regression failures are
>> error messages that point at the wrong token because they rely on yytext.)

> Just wondering if there are additional regressions not detected due to
> pg_regress using the ignore-whitespace option to diff.

Good question, but I doubt it --- those messages all use double quotes
around the yytext string, and I believe that eg. "foo" and " foo" are
different even under --ignore-whitespace.

I just finished wiping all that stuff from my work directory, or I'd
be able to give you a non-guesswork answer :-( ... but it's not
worth regenerating the build for.
        regards, tom lane


Re: Status report: getting plpgsql to use the core lexer

От
"Kevin Grittner"
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> One problem that wasn't obvious when I started is that if you are
> trying to use a reentrant lexer, Bison insists on including its
> YYSTYPE union in the call signature of the lexer.  Of course,
> YYSTYPE means different things to the core grammar and plpgsql's
> grammar.  I tried to work around that by having an interface layer
> that would (among other duties) translate as needed.  It turned out
> to be a real PITA, not least because you can't include both
> definitions into the same C file.  The scheme I have has more or
> less failed --- I think I'd need *two* interface layers to make it
> work without unmaintainable kluges.  It would probably be better to
> try to adjust the core lexer's API some more so that it does not
> depend on the core YYSTYPE, but I'm not sure yet how to get Bison to
> play along without injecting an interface layer (and hence wasted
> cycles) into the core grammar/lexer interface.
> 
> Another pretty serious issue is that the current plpgsql lexer
> treats various sorts of qualified names as single tokens.  I had
> thought this could be worked around in the interface layer by doing
> more lookahead.  You can do that, and it mostly works, but it's
> mighty tedious.  The big problem is that "yytext" gets out of step
> --- it will point at the last token the core lexer has processed,
> and there's no good way to back it up after lookahead.  I spent a
> fair amount of time trying to work around that by eliminating uses
> of "yytext" in plpgsql, and mostly succeeded, but there are still
> some left.  (Some of the remaining regression failures are error
> messages that point at the wrong token because they rely on yytext.)
> 
> Now, having name lookup happen at the lexical level is pretty bogus
> anyhow.  The long-term solution here is probably to avoid doing
> lookup in the plpgsql lexer and move it into some sort of callback
> hook in the main parser, as we've discussed before.  I didn't want
> to get into that right away, but I'm now thinking it has to happen
> before not after refactoring the lexer code.  One issue that has to
> be surmounted before that can happen is that plpgsql currently
> throws away all knowledge of syntactic scope after initial
> processing of a function --- the "name stack" is no longer available
> when we want to parse individual SQL commands.  We can probably
> rearrange that design but it's another bit of work I don't have time
> for right now.
All of this sounds pretty familiar to me.  As you may recall, our
framework includes a SQL parser which parses the subset of standard
SQL we feel is portable enough, and generates Java classes to
implement the code in "lowest common denominator" SQL with all
procedural code for triggers and stored procedures handled in Java
(which runs in our middle tier database service).  We use ANTLR, and
initially had a three-phase process: lexer, parser, and tree-walkers
to generate code.  We were doing way too much in the parser phase --
checking for table names, column names, data types, etc.  The syntax
of SQL forced us to do a lot of scanning forward and remembering where
we were (especially to get the FROM clause information so we could
process columns in the result list).
We were able to get to much cleaner code by rewriting the parser to
have a "dumb" phase to get the overall structure into an AST, and then
use a tree-walker phase to do all the lookups and type resolution
after we had the rough structure, writing another AST to walk for code
generation.  Besides making the code cleaner and easier to maintain,
it helped us give better error messages pointing more accurately to
the source of the problem.  I don't know if a similar approach is
feasible in flex/bison, but if it is, refactoring for an extra pass
might be worth the trouble.
-Kevin


Re: Status report: getting plpgsql to use the core lexer

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> ...
> We were able to get to much cleaner code by rewriting the parser to
> have a "dumb" phase to get the overall structure into an AST, and then
> use a tree-walker phase to do all the lookups and type resolution
> after we had the rough structure, writing another AST to walk for code
> generation.  Besides making the code cleaner and easier to maintain,
> it helped us give better error messages pointing more accurately to
> the source of the problem.  I don't know if a similar approach is
> feasible in flex/bison, but if it is, refactoring for an extra pass
> might be worth the trouble.

That's actually what we have in the core parser.  plpgsql is trying to
take shortcuts, and this whole project is exactly about weaning it away
from that.  The bottom line is I tried to tackle the sub-projects in the
wrong order...
        regards, tom lane