Обсуждение: Have we out-grown Flex?

Поиск

Список

Период

Сортировка

Have we out-grown Flex?

От

Peter Geoghegan

Дата:

01 мая 2012 г., 21:53:50

Quite apart from the practical difficulties that we have with Flex
(the fact that the authors are non-responsive and possibly retired,
that annoying compiler warning, and the fact that we are forced to
maintain our own Windows binaries of 2.5.35), it has some notable
disadvantages. I am aware that the MySQL people use their own
hand-coded lexical analyzer named sql_lex.cc, which provides a yacc
interface, while avoiding using any lexical analyzer generator
whatsoever. They can't have done this just for fun, and no doubt this
goes some way to explaining their continued performance advantage for
very simple queries. I have heard people complain about Postgres
parser overhead for "pgbench -S" style use-cases where it is
unreasonably high, and I've heard them do so more than once.

I suspect that we pay a high price for our table-based lexical
analyser's use of a table data structure to maintain transition
information, and the MySQL guys do not. It seems exceedingly likely
that our lexer is way more sophisticated than theirs, which likely
makes it infeasible to emulate their approach, and certainly makes it
unattractive. However, I think there might be a better way. There is
an LGPL-licensed project named Quex (hey, GNU bison is GPL), which can
generate C (and C++) code, and I am in contact with its primary
author, Frank-Rene Schäfer.

http://quex.sourceforge.net/

While it fits the bill as a replacement for Flex if ever we were
compelled to drop Flex, it also has a number of interesting advantages
over it that warrant further investigation. As Wikipedia puts it:

"Quex uses traditional steps of Thompson construction to create
non-deterministic finite automatons from regular expressions,
conversion to a deterministic finite automaton and then Hopcroft
optimization to reduce the number of states to a minimum. Those
mechanisms, though, have been adapted to deal with character sets
rather than single characters. By means of this the calculation time
can be significantly reduced. Since the Unicode character set consists
of many more code points than plain ASCII, those optimizations are
necessary in order to produce lexical analysers in a reasonable amount
of time."

Apparently, in practical terms, reductions of more than 1/2 and even
as high as 2/3 in total lexing time have been observed when switching
from Flex, and Quex uses the syntax of Flex for its description of
regular expressions, which makes a conversion research project seem
like a worthwhile undertaking. I am told that this will probably take
less than a month, and perhaps as little as two weeks given the broad
compatibility of Flex and Quex. This undertaking has the enthusiastic
support of Frank-Rene, which is in refreshing contrast to the complete
inaccessibility of the Flex developers and the dead silence on the
flex users' mailing list.

I'm certainly not suggesting that this isn't something that, in order
to be adopted, will have to be carefully considered from many angles,
of which the performance of the resulting lexer is only one, but at
the same time I dare say that this approach is the "low-hanging fruit"
of optimising for the use-case where parser overhead is unreasonably
high. Am I barking up the wrong tree? At the very least, hopefully
these simple observations will provoke thought - Flex/Lex is not the
only tool for generating yacc-compatible lexical analysers, and it may
not be the best one for our current needs.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Have we out-grown Flex?

От

Tom Lane

Дата:

01 мая 2012 г., 23:25:55

Peter Geoghegan <peter@2ndquadrant.com> writes:
> ... I have heard people complain about Postgres
> parser overhead for "pgbench -S" style use-cases where it is
> unreasonably high, and I've heard them do so more than once.

I was under the impression that those cycles were mostly in bison not
flex...
        regards, tom lane

Re: Have we out-grown Flex?

От

Tom Lane

Дата:

01 мая 2012 г., 23:38:18

Peter Geoghegan <peter@2ndquadrant.com> writes:
> Quite apart from the practical difficulties that we have with Flex
> (the fact that the authors are non-responsive and possibly retired,

btw, as far as that goes, a look into their sourceforge SCM shows
evidence of continued activity.  I dunno why it's been so long since
their last formal release, but they're not dead.

> that annoying compiler warning,

http://flex.cvs.sourceforge.net/viewvc/flex/flex/flex.skl?r1=2.213&r2=2.214
        regards, tom lane

Re: Have we out-grown Flex?

От

Robert Haas

Дата:

02 мая 2012 г., 00:25:08

On Tue, May 1, 2012 at 8:53 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> I'm certainly not suggesting that this isn't something that, in order
> to be adopted, will have to be carefully considered from many angles,
> of which the performance of the resulting lexer is only one, but at
> the same time I dare say that this approach is the "low-hanging fruit"
> of optimising for the use-case where parser overhead is unreasonably
> high. Am I barking up the wrong tree? At the very least, hopefully
> these simple observations will provoke thought - Flex/Lex is not the
> only tool for generating yacc-compatible lexical analysers, and it may
> not be the best one for our current needs.

I think Tom's question of whether the parser or lexer is the problem
is something that ought to be investigated.  Personally, I suspect
that our tendency to use lists everywhere, for everything, an artifact
of our proud LISP heritage, may be costing us dearly in terms of parse
time.  However, there's a difference between suspicion and proof, and
I certainly have none of the latter.

One fairly major obstacle to using quex for anything is that it
doesn't appear to be packaged on either Fedora or Ubuntu (I tried
apt-cache search quex on my Ubuntu laptop, and yum install quex on an
F16 machine).  'port install quex' on my MacOS X doesn't find it
either.  This means that if we were to switch to quex, the barrier to
compiling PostgreSQL would go up very significantly.  I realize
there's a chicken-and-egg problem there, since if quex is awesome and
nobody uses it because it isn't packaged, then it will never become
popular enough for anyone to consider packaging it.  Nonetheless, I'm
not sure I want to be the guinea pig.  Even if the actual
patch-as-committed-to-git to make the switch to quex were not that
large, the amount of follow-on work we'd be creating for every
developer and packager of PostgreSQL would be quite formidable.  We
shouldn't do that unless it's pretty much awesome.

One possible way forward would be to try to be compatible with both
quex and flex.  We'd need a buildfarm critter or two running with quex
to detect breakage there, though, since a lot of people would probably
continue to run flex. But I'm not entirely sure it's worth doing even
that much unless we can show a worthwhile performance bump.  The core
scanner doesn't change much, so "more features" isn't too compelling
as a reason for change, especially if we need to remain
flex-compatible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Have we out-grown Flex?

От

Tom Lane

Дата:

02 мая 2012 г., 00:57:58

Robert Haas <robertmhaas@gmail.com> writes:
> One fairly major obstacle to using quex for anything is that it
> doesn't appear to be packaged on either Fedora or Ubuntu (I tried
> apt-cache search quex on my Ubuntu laptop, and yum install quex on an
> F16 machine).  'port install quex' on my MacOS X doesn't find it
> either.  This means that if we were to switch to quex, the barrier to
> compiling PostgreSQL would go up very significantly.  I realize
> there's a chicken-and-egg problem there, since if quex is awesome and
> nobody uses it because it isn't packaged, then it will never become
> popular enough for anyone to consider packaging it.  Nonetheless, I'm
> not sure I want to be the guinea pig.

I was trying to think of a delicate way to put that, but since you
beat me to it ... yeah, exactly.  It's possible that quex is so
spectacular that we'd be willing to absorb the conversion costs
and the penalties of being a pioneer user of a new tool, but the bar
seems pretty high.

> Even if the actual
> patch-as-committed-to-git to make the switch to quex were not that
> large, the amount of follow-on work we'd be creating for every
> developer and packager of PostgreSQL would be quite formidable.

FWIW, I think only developers not packagers would really be taking such
a hit.  I assume we'd continue to ship prebuilt lexer output in
tarballs, so there'd seldom be a reason for a packager to need to run
the tool.  Given the extremely slow rate of churn of the lexer, it might
not be necessary for most developers to have the tool installed, either,
if we were willing to put the derived file into git.

Having said all that, if the amount of effort needed to make a trial
port is within reason, by all means let's try it and see what the
performance difference is, rather than just speculating.
        regards, tom lane

Re: Have we out-grown Flex?

От

Peter Geoghegan

Дата:

02 мая 2012 г., 08:47:23

On 2 May 2012 04:24, Robert Haas <robertmhaas@gmail.com> wrote:
> I think Tom's question of whether the parser or lexer is the problem
> is something that ought to be investigated.  Personally, I suspect
> that our tendency to use lists everywhere, for everything, an artifact
> of our proud LISP heritage, may be costing us dearly in terms of parse
> time.  However, there's a difference between suspicion and proof, and
> I certainly have none of the latter.

It's funny that you should say that, because I actually had a
discussion with Greg Stark over drinks about this recently. John
Bentley has described an experiment that undermines many traditional
beliefs about the trade-offs represented by using a linked list rather
than an array. The test is, using a modern computer, generate N
integers at random and insert them in order into a sequence. Then,
randomly remove integers from the sequence, one at a time. What is the
performance at different sizes of N when the sequence is a
doubly-linked list, and when it is an array? If you graph the two, the
results are perhaps rather surprising. I think his graph went up to
100,000. The initial result shows a line representing an array down
near the bottom of the graph. The list line looks exponential. Even if
you use memory pooling so the list doesn't have to allocate memory as
needed, the array still roundly beats the list, albeit not quite so
convincingly and without the list hitting the roof near 100,000. I
don't think that I need to point out that this is for inserting and
deleting, and that's when you're supposed to use lists.

The point is that on modern architectures, with many layers of cache,
the cost of the linear search to get the insertion point completely
dominates - this is without the array availing of a binary search, in
the interest of fairness. CPU caches happen to do a very good job of
moving over on-average half of the array for inserting elements at
random points. The list is much larger than the array, with the two
extra pointers per element (yeah, I know that we use singly linked
lists, but we have other disadvantages compared to typical C lists),
which matters. Predictable usage patterns - that is, temporal and
spatial locality, resulting in good usage of the memory hierarchy
matters a lot. We're not talking about a small difference, either. I
think the difference in the published test was something like the
array was 50 - 100 times faster. The list results in far more cache
misses than the array. So, I'm right there with you - using lists
everywhere is bad news.

As for the question of Flex/Quex, I'm not in an immediate position to
sink any more time into it, but it remains on my list of things to
pursue for 9.3, though it's only about number 3 on that list right
now.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Have we out-grown Flex?

От

Peter Geoghegan

Дата:

02 мая 2012 г., 10:34:29

On 2 May 2012 04:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, I think only developers not packagers would really be taking such
> a hit.  I assume we'd continue to ship prebuilt lexer output in
> tarballs, so there'd seldom be a reason for a packager to need to run
> the tool.  Given the extremely slow rate of churn of the lexer, it might
> not be necessary for most developers to have the tool installed, either,
> if we were willing to put the derived file into git.

Incidentally, I had an unrelated conversation with someone (I think it
might have been Heikki) a while back, where it was suggested that Flex
and Bison could be run through web services. This might actually make
hacking Postgres on windows far easier, because the last time I tried
to do that the hard way, I gave up, suspecting that there must be some
kind of Winflex bug that selectively manifests itself - certainly, the
population of windows hackers is small enough that it could go
unnoticed for quite a while. Such an approach could be part of the
solution to this problem (although, incidentally, Quex maintains
visual studio support quite well, and even has graphical instructions
here: http://quex.sourceforge.net/doc/html/intro/visual_studio_trouble.html
).

It might be the case that some kind of virtualisation and/or
authentication (Postgres community account required) could make this
approach practical. It just isn't the path of least resistance right
now. Visual studio builds would be far easier if we did this, which
might encourage more hackers to venture into Windows land.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Have we out-grown Flex?

От

Jeff Janes

Дата:

02 мая 2012 г., 13:21:11

On Tue, May 1, 2012 at 5:53 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Quite apart from the practical difficulties that we have with Flex
> (the fact that the authors are non-responsive and possibly retired,
> that annoying compiler warning, and the fact that we are forced to
> maintain our own Windows binaries of 2.5.35), it has some notable
> disadvantages. I am aware that the MySQL people use their own
> hand-coded lexical analyzer named sql_lex.cc, which provides a yacc
> interface, while avoiding using any lexical analyzer generator
> whatsoever. They can't have done this just for fun, and no doubt this
> goes some way to explaining their continued performance advantage for
> very simple queries. I have heard people complain about Postgres
> parser overhead for "pgbench -S" style use-cases where it is
> unreasonably high, and I've heard them do so more than once.

For -S -M simple, the time spent planning is 5 times more than the
time spent parsing.  It may be worthwhile to reduce the time spent
parsing, but if the goal is parity with MySQL it probably isn't the
place to start.

(If you use a bottom-up profiler, the time spent in planning is
scattered over so many different functions that none of them look very
important individually)

Cheers,

Jeff

Re: Have we out-grown Flex?

От

Peter Geoghegan

Дата:

02 мая 2012 г., 13:32:11

On 2 May 2012 17:20, Jeff Janes <jeff.janes@gmail.com> wrote:
> For -S -M simple, the time spent planning is 5 times more than the
> time spent parsing.  It may be worthwhile to reduce the time spent
> parsing, but if the goal is parity with MySQL it probably isn't the
> place to start.

Could you please share your figures and methodology? I've heard of far
larger proportions than that.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Have we out-grown Flex?

От

Jeff Janes

Дата:

02 мая 2012 г., 14:38:15

On Wed, May 2, 2012 at 9:31 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 2 May 2012 17:20, Jeff Janes <jeff.janes@gmail.com> wrote:
>> For -S -M simple, the time spent planning is 5 times more than the
>> time spent parsing.  It may be worthwhile to reduce the time spent
>> parsing, but if the goal is parity with MySQL it probably isn't the
>> place to start.
>
> Could you please share your figures and methodology? I've heard of far
> larger proportions than that.

I used two methods.  One was to hack exec_simple_query so that, under
the control of a new GUC, it would do things such as pg_parse_query
the query 101 times, throwing away the results of the first 100 before
proceeding to use the 101 parse result as normal.  Then I just run
pgbench under both settings, take the difference in 1/TPS between them
and divide by 100 to get the seconds per parse (and multiple by 1e6 to
get usec/parse)

I did the same thing for pg_analyze_and_rewrite, and for
pg_analyze_and_rewrite+pg_plan_queries (pg_plan_queries scribbles on
the structure produced by pg_analyze_and_rewrite, so you have to
repeat both as a unit, and then subtract the the
pg_analyze_and_rewrite timings off afterwards to isolate just the
planning) .

On my current laptop and rebased to git HEAD, I got
2usec/pg_parse_query, 2usec/pg_analyze_and_rewrite, and
12usec/pg_plan_queries.   Since my laptop is dual core, I did this
with -c2 -j2.

Back when I originally implemented and tested this on much older
hardware and about one year older pgsql code base, the absolute values
of usec/action were several fold higher, but the ratios of 1:1:6 were
about the same.

This does risk that it will overlook caching effects by repeating the
same thing in a tight loop.  I.e. parses 2 through 101 might be much
faster than parse 1 was.  Also, it risks that I simply don't know what
I'm doing and my attempts to throw away the results of a parse are
misbegotten--I just overwrote the old pointer with the new one and
assume the memory context would clean up the resulting orphans.

I could try to clean up and post the patch that implements this if you want.

The second method was just to do --enable-profiling on a stock build
and look at the call graph section of gprof output.  It attributed 50%
to pg_plan_queries and children and about 10% to each of
pg_parse_query and pg_analyze_and_rewrite (including their respective
children).  I don't put tremendous faith in gprof's call graph, but
the fact that the results are in agreement with the other method gives
me more confidence in both.

Cheers,

Jeff

Re: Have we out-grown Flex?

От

Magnus Hagander

Дата:

02 мая 2012 г., 14:43:04

On Wed, May 2, 2012 at 3:33 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 2 May 2012 04:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> FWIW, I think only developers not packagers would really be taking such
>> a hit.  I assume we'd continue to ship prebuilt lexer output in
>> tarballs, so there'd seldom be a reason for a packager to need to run
>> the tool.  Given the extremely slow rate of churn of the lexer, it might
>> not be necessary for most developers to have the tool installed, either,
>> if we were willing to put the derived file into git.
>
> Incidentally, I had an unrelated conversation with someone (I think it
> might have been Heikki) a while back, where it was suggested that Flex
> and Bison could be run through web services. This might actually make

Might've been me - I've been doing that for a long time to work around
winflex issues. But I never got around to doing anything like access
control or so, I just ran it on a hidden ip on a random port, and
simple curl call on the windows box..

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Have we out-grown Flex?

От

Bruce Momjian

Дата:

02 мая 2012 г., 15:16:05

On Wed, May 02, 2012 at 10:37:58AM -0700, Jeff Janes wrote:
> I could try to clean up and post the patch that implements this if you want.
> 
> The second method was just to do --enable-profiling on a stock build
> and look at the call graph section of gprof output.  It attributed 50%
> to pg_plan_queries and children and about 10% to each of
> pg_parse_query and pg_analyze_and_rewrite (including their respective
> children).  I don't put tremendous faith in gprof's call graph, but
> the fact that the results are in agreement with the other method gives
> me more confidence in both.

Those are the ratio's I (and I think Tom) expected to see.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Have we out-grown Flex?

От

james

Дата:

03 мая 2012 г., 16:51:52

I haven't tried quex, but I have tried lemon (which can be broken out of 
SQLite) and re2c and ragel.

I like ragel and lemon, but the combination supports a push-parser style 
from memory, and many tools are inconvenient unless you are prepared to 
suck in a whole message before parsing, or let the parser drive a pull 
loop, or use a coroutine structure.

Could go all trendy and use a PEG tool like, er,, peg 
(http://piumarta.com/software/peg/).  (I haven't tried them tho')

James

Re: Have we out-grown Flex?

От

james

Дата:

03 мая 2012 г., 16:54:18

Doesn't that imply that a plan cache might be worthwhile?

But no matter: didn't the OP really have issue with packaging and 
Windows support - and there are a lot of Windows users, and in general 
there are many Windows devs: making it easier for them to contribute has 
to be good doesn't it?

Re: Have we out-grown Flex?

От

Daniel Farina

Дата:

03 мая 2012 г., 16:57:55

On Thu, May 3, 2012 at 12:51 PM, james <james@mansionfamily.plus.com> wrote:
> I haven't tried quex, but I have tried lemon (which can be broken out of
> SQLite) and re2c and ragel.
>
> I like ragel and lemon, but the combination supports a push-parser style
> from memory, and many tools are inconvenient unless you are prepared to suck
> in a whole message before parsing, or let the parser drive a pull loop, or
> use a coroutine structure.
>
> Could go all trendy and use a PEG tool like, er,, peg
> (http://piumarta.com/software/peg/).  (I haven't tried them tho')

I think the goal is not trendy nor easy to use (but easy to maintain,
at least...), but faster, and even then there is some doubt if any
amount of lexer optimization could possibly matter given everything
else that needs to happen to execute a query.  Better error messages
(with position information) might be a functional enhancement that I'd
like, but I don't think flex is limiting in that regard; rather, a lot
more information already exposed by flex would have to be passed
through the semantic analyzer.

Provided it could matter, are these tools faster than flex? My
limited understanding is "probably not".

--
fdr

Re: Have we out-grown Flex?

От

james

Дата:

03 мая 2012 г., 17:03:52

I believe there are tools that are significantly faster than flex.  I 
believe re2c generates code that is faster.  But the key thing is to 
test, probably, or perhaps ask around.  I'm out of touch, but from 
memory flex wasn't the be-all and end-all.

Lemon is definitely easy to maintain/port and the result is pretty nice, 
too (I know Bison/Yacc wasn't the focus here).

Re: Have we out-grown Flex?

От

Jeff Janes

Дата:

03 мая 2012 г., 18:36:47

On Thu, May 3, 2012 at 12:53 PM, james <james@mansionfamily.plus.com> wrote:
> Doesn't that imply that a plan cache might be worthwhile?

PG has one if you use prepared statements.  Some people are either
unable or unwilling to use prepared statements, though.

Cheers,

Jeff

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Have we out-grown Flex?