Обсуждение: Doing psql's lexing with flex
I got interested enough in the psql-with-flex problem to go off and solve it. Attached is a working patch, which I'm now debating whether to apply. Comments solicited... The patch removes about 200 lines of very spaghetti-ish code in mainloop.c. However, it adds an 875-line flex source file, which might be thought a bad tradeoff :-(. One bright spot is that about half of that total is a direct copy of the main backend lexer, so it's not really as much new, separately maintainable code as all that. Also, Andrew Dunstan's patch for supporting dollar-quoting would add about 100 lines to mainloop.c, versus only a dozen or so lines in the flex implementation. Once that's taken into account I don't think there is a lot of difference in effective SLOC to maintain. I'm also of the opinion that the new C code in psqlscan.l is much more straightforward than the code removed from mainloop.c, though having just written it, I'm no doubt pretty biased. Bruce was asking about speed. On normal-size queries I cannot measure any difference at all. For testing purposes I made up a file containing a single 750K query (just a "SELECT big-honking-string-constant", with the string literal broken into lines of 75 bytes). The client-side (psql) CPU time to run this file looks about like this on my machine: PGCLIENTENCODING UNICODE SJIS CVS tip 1.57 1.82 flex implementation 0.93 2.33 The flex implementation is consistently faster than CVS tip when dealing with backend-compatible encodings (such as UTF-8). It's consistently slower when it has to deal with a non-backend-safe encoding such as SJIS or Big5. But for real-world cases the differential is down in the noise either way. I'm inclined to apply this but I can see where a person not comfortable with flex might feel differently. Opinions? regards, tom lane
Вложения
Tom Lane wrote: > I got interested enough in the psql-with-flex problem to go off and > solve it. Attached is a working patch, which I'm now debating whether > to apply. Comments solicited... > > The patch removes about 200 lines of very spaghetti-ish code in > mainloop.c. However, it adds an 875-line flex source file, which > might be thought a bad tradeoff :-(. One bright spot is that about > half of that total is a direct copy of the main backend lexer, so > it's not really as much new, separately maintainable code as all that. > Also, Andrew Dunstan's patch for supporting dollar-quoting would add > about 100 lines to mainloop.c, versus only a dozen or so lines in the > flex implementation. Once that's taken into account I don't think there > is a lot of difference in effective SLOC to maintain. I'm also of the > opinion that the new C code in psqlscan.l is much more straightforward > than the code removed from mainloop.c, though having just written it, > I'm no doubt pretty biased. > > Bruce was asking about speed. On normal-size queries I cannot measure > any difference at all. For testing purposes I made up a file containing > a single 750K query (just a "SELECT big-honking-string-constant", with > the string literal broken into lines of 75 bytes). The client-side > (psql) CPU time to run this file looks about like this on my machine: > > PGCLIENTENCODING > UNICODE SJIS > > CVS tip 1.57 1.82 > > flex implementation 0.93 2.33 Looks good. I withdraw my performance concerns. Thanks for testing. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane <tgl@sss.pgh.pa.us> writes: > I'm inclined to apply this but I can see where a person not comfortable > with flex might feel differently. Opinions? Looks good to me. The psql cleanup is nice, and ISTM that much of the flex code is comments or flex boilerplate anyway, so the actual LOC increase isn't too bad. -Neil
Neil Conway <neilc@samurai.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I'm inclined to apply this but I can see where a person not comfortable >> with flex might feel differently. Opinions? > Looks good to me. The psql cleanup is nice, and ISTM that much of the > flex code is comments or flex boilerplate anyway, so the actual LOC > increase isn't too bad. I just realized that something I had planned to look at later is actually an essential step if this is to be applied. I had wanted to see if lexing of backslash command tokens could be folded into the flex lexer as well, but thought I could leave it for later. The case where this doesn't preserve the previous behavior is if the expansion of a psql variable includes both a backslash command and some part of its arguments. As patched, HandleSlashCommand will only look to the original input string and not see the contents of any psql variables (because those are only seen inside the lexer, I'm not physically inserting them into the line buffer as the old code did). Imagine for example \set foo '\c mydb' blah :foo bar The existing code would interpret this as blah \c mydb bar but my patch as it stands would behave very strangely --- the \c command would see bar as its argument and then 'mydb' would be regurgitated after HandleSlashCommand finishes. The existing code is pretty darn inconsistent on this point anyway when you look at it carefully. AFAICS a variable reference is handled quite differently in the arguments of a backslash command than elsewhere; it can't supply general text but only a single option value. The same variable expanded before control reaches HandleSlashCommand is going to look indistinguishable from hand-entered text. If we did translate it into a flex thing the behavior would be different in corner cases like this. Peter, any comments on this? Is the current behavior actually intentional, or did it just fall out of the implementation techniques? regards, tom lane
Tom Lane wrote: >I got interested enough in the psql-with-flex problem to go off and >solve it. Attached is a working patch, which I'm now debating whether >to apply. Comments solicited... > >The patch removes about 200 lines of very spaghetti-ish code in >mainloop.c. However, it adds an 875-line flex source file, which >might be thought a bad tradeoff :-(. One bright spot is that about >half of that total is a direct copy of the main backend lexer, so >it's not really as much new, separately maintainable code as all that. >Also, Andrew Dunstan's patch for supporting dollar-quoting would add >about 100 lines to mainloop.c, versus only a dozen or so lines in the >flex implementation. > Am I missing something, or is dollar quoting not in this patch? Perhaps you have a followup patch? >Once that's taken into account I don't think there >is a lot of difference in effective SLOC to maintain. I'm also of the >opinion that the new C code in psqlscan.l is much more straightforward >than the code removed from mainloop.c, though having just written it, >I'm no doubt pretty biased. > >Bruce was asking about speed. On normal-size queries I cannot measure >any difference at all. For testing purposes I made up a file containing >a single 750K query (just a "SELECT big-honking-string-constant", with >the string literal broken into lines of 75 bytes). The client-side >(psql) CPU time to run this file looks about like this on my machine: > > PGCLIENTENCODING > UNICODE SJIS > >CVS tip 1.57 1.82 > >flex implementation 0.93 2.33 > >The flex implementation is consistently faster than CVS tip when dealing >with backend-compatible encodings (such as UTF-8). It's consistently >slower when it has to deal with a non-backend-safe encoding such as SJIS >or Big5. But for real-world cases the differential is down in the noise >either way. > >I'm inclined to apply this but I can see where a person not comfortable >with flex might feel differently. Opinions? > > In principle this is the Right Thing (tm). We use flex elsewhere, of course, so the fact that it is a flex scanner doesn't seem to me to be any sort of strong argument. If we are going to do this we need to get it in ASAP, so it gets plenty of beating on. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Am I missing something, or is dollar quoting not in this patch? It is not. If we go this way, then we'd add essentially identical flex patches to backend and psql to implement dollar quoting (plus perhaps a few more lines in psql to support signaling dollar quoting in the prompt, but that's pretty trivial). My intent with the given patch was just to replicate existing functionality with flex. regards, tom lane
Tom Lane wrote: > I got interested enough in the psql-with-flex problem to go off and > solve it. Attached is a working patch, which I'm now debating > whether to apply. Comments solicited... That should teach me a lesson not to leave random comments lying around in the source code. Years later someone might take them seriously. :)
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Am I missing something, or is dollar quoting not in this patch? >> >> > >It is not. If we go this way, then we'd add essentially identical >flex patches to backend and psql to implement dollar quoting (plus >perhaps a few more lines in psql to support signaling dollar quoting >in the prompt, but that's pretty trivial). My intent with the given >patch was just to replicate existing functionality with flex. > > > > OK. I vote for applying this ASAP, then, so we can get on with the dollar quoting work. I think we should put big warning signs on both the backend's and psql's .l files saying they must be kept in sync. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I think we should put big warning signs on both the backend's and psql's > .l files saying they must be kept in sync. Right. I was planning to do that, and also to make some trivial reformatting in the backend's scan.l to make it easier to compare the two files by diff. For example, consistently write pattern { action; } not pattern { action; } so that the diffs are confined to the action lines. regards, tom lane
Tom Lane wrote: > Imagine for example > \set foo '\c mydb' > blah :foo bar > The existing code would interpret this as > blah \c mydb bar > but my patch as it stands would behave very strangely --- the \c > command would see bar as its argument and then 'mydb' would be > regurgitated after HandleSlashCommand finishes. Feel free (or encouraged) to change the old behavior. (Of course, the new behavior needs to adjustments, too.) Generally, I think a variable should just stand for a data value and should not result in a slash command being generated.
Peter Eisentraut <peter_e@gmx.net> writes: > Feel free (or encouraged) to change the old behavior. (Of course, the > new behavior needs to adjustments, too.) Generally, I think a variable > should just stand for a data value and should not result in a slash > command being generated. That seems workable for :foo references in the body of a backslash command --- you can just substitute the value but not allow it to be split into multiple argument tokens. However, wouldn't you still want to rescan it for internal :bar references to other variables? The psql man page has some arm-waving about how you can do indirection, and seems like that ought to work in this context too. Also, this seems inconsistent with the non-backslash-command case. Seems like you have to scan what goes into the query buffer to keep the lexer state in sync, you can't just treat it as an uninterpreted token. So not triggering on backslashes seems inconsistent. We can certainly implement different behavior for the two cases, I'm just unconvinced we should... regards, tom lane