Re: FETCH FIRST clause PERCENT option

Поиск
Список
Период
Сортировка
От Surafel Temesgen
Тема Re: FETCH FIRST clause PERCENT option
Дата
Msg-id CALAY4q_4DwEJMp=474C8a1vRAtx9S=iW7g44MRQxX1J=mPg3aw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: FETCH FIRST clause PERCENT option  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: FETCH FIRST clause PERCENT option  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
Hi

On Tue, Jan 1, 2019 at 10:08 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

I've been looking at this patch today, and I think there's a couple of
issues with the costing and execution. Consider a simple table with 1M rows:

  create table t as select i from generate_series(1,1000000) s(i);

and these two queries, that both select 1% of rows

  select * from t fetch first 10000 rows only;

  select * from t fetch first 1 percent rows only;

which are costed like this

  test=# explain select * from t fetch first 10000 rows only;
                             QUERY PLAN
  -----------------------------------------------------------------
   Limit  (cost=0.00..144.25 rows=10000 width=4)
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)
  (2 rows)

  test=# explain select * from t fetch first 1 percent rows only;
                             QUERY PLAN
  -----------------------------------------------------------------
   Limit  (cost=0.00..14425.00 rows=1000000 width=4)
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)
  (2 rows)

There are two issues with the costs in the "percent" case.

Firstly, the estimated number of rows should be about the same (10k) in
both cases, but the "percent" case incorrectly uses the total row count
(i.e. as if 100% rows were returned).

Ok I will correct it


It's correct that the total cost for the "percent" case is based on 100%
of rows, of course, because the code has to fetch everything and stash
it into the tuplestore in LIMIT_INITIAL phase.

But that implies the startup cost for the "percent" case can't be 0,
because all this work has to be done before emitting the first row.


Correct, startup cost must be equal to total cost of source data path and total cost have to be slightly greater than startup cost . I am planing to increase the total cost by limitCount * 0.1(similar to the parallel_tuple_cost) because getting tuple from tuplestore are almost similar operation to passing a tuple from worker to master backend.  

So these costing aspects need fixing, IMHO.


The execution part of the patch seems to be working correctly, but I
think there's an improvement - we don't need to execute the outer plan
to completion before emitting the first row. For example, let's say the
outer plan produces 10000 rows in total and we're supposed to return the
first 1% of those rows. We can emit the first row after fetching the
first 100 rows, we don't have to wait for fetching all 10k rows.


but total rows count is not given how can we determine safe to return row

Regards
Surafel

 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: log bind parameter values on error
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: pg_upgrade: Pass -j down to vacuumdb