Re: [HACKERS] What about LIMIT in SELECT ?

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: [HACKERS] What about LIMIT in SELECT ?
Дата
Msg-id 199810160534.BAA29942@candle.pha.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] What about LIMIT in SELECT ?  (jwieck@debis.com (Jan Wieck))
Ответы RE: [HACKERS] What about LIMIT in SELECT ?
Список pgsql-hackers
OK, I have had my day of thinking, and will address this specific
posting first, because it is the most fundamental concerning the future
direction of the optimization.

>
>     And I got the time to hack around about this.
>
>     I  hacked  in  a little check into the planner, that compares
>     the sortClause against the key field list of  an  index  scan
>     and  just  suppresses  the sort node if it exactly matchs and
>     all sort operators are "<".
>
>     I tested with a 10k row table where key is a text field.  The
>     base query is a
>
>         SELECT ... WHERE key > 'val' ORDER BY key;
>
>     The  used 'val' is always a key that is close to the first of
>     all keys in the table ('' on the first  query  and  the  last
>     selected value on subsequent ones).
>
>     Scenario  1  (S1)  uses exactly the above query but processes
>     only the first 20 rows  from  the  result  buffer.  Thus  the
>     frontend receives nearly the whole table.

OK.

>
>     Scenario  2  (S2)  uses a cursor and FETCH 20. But closes the
>     cursor and creates a new one for  the  next  selection  (only
>     with another 'val') as it would occur in a web application.
>
>     If  there  is  no index on key, the backend will allways do a
>     Sort->SeqScan and due  to  the  'val'  close  to  the  lowest
>     existing  key  nearly all tuples get scanned and put into the
>     sort. S1 here runs about 10 seconds and S2 about  6  seconds.
>     The  speedup  in  S2  results  from  the  reduced overhead of
>     sending not wanted tuples into the frontend.

Makes sense.  All rows are processed, but not sent to client.

>
>     Now with a btree index  on  key  and  an  unpatched  backend.
>     Produced  plan  is  always  a  Sort->IndexScan.   S1 needs 16
>     seconds and S2 needs 12 seconds. Again nearly all data is put
>     into  the  sort but this time over the index scan and that is
>     slower.

VACUUM ANALYZE could affect this.  Because it had no stats, it thought
index use would be faster, but in fact because 'val' was near the lowest
value, it as selecting 90% of the table, and would have been better with
a sequential scan.  pg_statistics's low/hi values for a column could
have told that to the optimizer.

I know the good part of the posting is coming.

>     Last with the btree index on key  and  the  patched  backend.
>     This  time the plan is a plain IndexScan because the ORDER BY
>     clause exactly matches the sort order of the  chosen  index.
>     S1  needs  13  seconds  and  S2 less than 0.2!  This dramatic
>     speedup comes from the fact, that this time the index scan is
>     the  toplevel  executor  node and the executor run is stopped
>     after 20 tuples have been selected.

OK, seems like in the S1 case, the use of the psort/ORDER BY code on top
of the index was taking and extra 3 seconds, which is 23%.  That is a
lot more than I thought for the psort code, and shows we could gain a
lot by removing unneeded sorts from queries that are already using
matching indexes.

Just for clarity, added to TODO.  I think everyone is clear on this one,
and its magnitude is a surprise to me:

  * Prevent psort() usage when query already using index matching ORDER BY


>     Analysis of the above timings:
>
>     If there is an ORDER BY clause, using an index  scan  is  the
>     clever  way  if  the  indexqual  dramatically reduces the the
>     amount of data selected and sorted.   I  think  this  is  the
>     normal case (who really selects nearly all rows from a 5M row
>     table?). So choosing the index path  is  correct.  This  will
>     hurt if someone really selects most of the rows and the index
>     scan jumps over the disc.  But here the programmer should use
>     an  unqualified  query  to  perform  a  seqscan  and  do  the
>     qualification in the frontend application.

Fortunately, the optimizer already does the index selection for us, and
guesses pretty well if the index or sequential scan is better.  Once we
implement the above removal of psort(), we will have to change the
timings because now you have to compare index scan against sequential
scan AND psort(), because in the index scan situation, you don't need
the psort(), assuming the ORDER BY matches the index exactly.

>     The speedup for the cursor/fetch scenario  is  so  impressive
>     that  I'll  create  a  post 6.4 patch. I don't want it in 6.4
>     because there is absolutely no query in the whole  regression
>     test,  where  it  suppresses  the  sort  node.   So  we  have
>     absolutely no check that it doesn't break anything.
>
>     For a web application, that can use a unique  key  to  select
>     the next amount of rows, it will be a big win.

OK, I think the reason the regression test did not show your code being
used is important.

First, most of the tables are small in the regression test, so sequential
scans are faster.  Second, most queries using indexes are either joins,
which do the entire table, or equality tests, like col = 3, where there
is no matching ORDER BY because all the col values are 3.  Again, your
code can't help with these.

The only regression-type code that would use it would be a 'col > 3'
qualification with a col ORDER BY, and there aren't many of those.

However, if we think of the actual application you are addressing, it is
a major win.  If we are going after only one row of the index, it is
fast.  If we are going after the entire table, it is faster to
sequential scan and psort().  You big win is with the partial queries,
where you end up doing a full sequential scan or index scan, then and
ORDER BY, while you really only need a few rows from the query, and if
you deal directly with the index, you can prevent many rows from being
processed.  It is the ability to skip processing those extra rows that
makes it a big win, not so much the removal of the ORDER BY, though that
helps too.

Your solution really is tailored for this 'partial' query application,
and I think it is a big need for certain applications that can't use
cursors, like web apps.  Most other apps have long-time connections to
the database, and are better off with cursors.

I did profiling to improve startup time, because the database
requirements of web apps are different from normal db apps, and we have
to adjust to that.

So, to reiterate, full queries are not benefited as much from the new
code, because sequential scan/psort is faster, or because the index only
retrieves a small number of rows because the qualification of values is
very specific.

Those open-ended, give me the rows from 100 to 199 really need your
modifications.

OK, we have QUERY_LIMIT, and that allows us to throw any query at the
system, and it will return that many of the first rows for the ORDER BY.
No fancy stuff required.  If we can get a matching index, we may be able
to remove the requirement of scanning all the row (with Jan's patch),
and that is a big win.  If not, we at least prevent the rows from being
returned to the client.

However, there is the OFFSET issue.  This is really a case where the
user wants to _restart_ the query where they left off.  That is a
different problem.  All of a sudden, we need to evaluate more of the
query, and return a segment from the middle of the result set.

I think we need to decide how to handle such a restart.  Do we
re-evaluate the entire query, skipping all the rows up to OFFSET, and
return the number of rows they requested after OFFSET.  I would think we
don't want to do that, do we.  It would be much easier to code.  If it
is a single table, skipping forward has to be done anyway, because we
can't just _jump_ to the 100th entry in the index, unless we pass some
_tid_ to the user, and expect them to pass that back to start the query.
I don't think we went to do that.  It is ugly, and the row may have
moved since we started.  So, for a single table, adding a QUERY_OFFSET
would do exactly what we need, with Jan's patches.

For a joined query, I think you will have to do the entire _join_ before
returning anything.

You can't just process all the joins up to the OFFSET location, and you
can't just jump to the 100th index location, because you don't know that
the 100th index location produced the 100th result just returned to the
user.  You have to process the whole query, and because of the join and
not knowing which data row from each table is going to make which entry
in the final result.  If you are really craft, and the ORDER BY table is
in the outer part of the join loop, you could start processing the table
that is part of the outer loop in _index_ order, because you know that
the rows processed in index order are going to produce the output in
result order.  You then could process and throw away the results up to
offset, and generate the needed rows and stop.

The other way of doing it is to specify a query limit based on specific
index entries, so you say I want the query returned by the first 20
index entries matching the ORDER BY, or entries 100-199, and the query
is limited to using only those entries in the index.  In that case,
though, in joins, you could return more or less rows in the result
depending on the other tables, and that may be unacceptable.  However,
for this case, the advantage is that you don't need to process the rows
from 1 to 99 because you have been told the user only wants rows from
certain index slots.  If the user requests rows 50000-50100, this would
be much faster because you don't have to process the 50000 rows before
returning any data.  However, I question how often people grab stuff
from the center of large data sets.  Seems the QUERY_OFFSET idea may be
easier for users.

I will be commenting on the rest of the optimization postings tomorrow.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] small bug in src/interfaces/ecpg/lib/Makefile.in
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] Odd problem with read_pg_options ...