Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

Поиск
Список
Период
Сортировка
От David Rowley
Тема Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)
Дата
Msg-id CAApHDvq58bLcA2x1-HXaCZ-yKfhLiovBwzgUO=G7JSSXqjsouQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)  (Andy Fan <zhihui.fan1213@gmail.com>)
Ответы Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)  (Andy Fan <zhihui.fan1213@gmail.com>)
Список pgsql-hackers
Thanks for having a look at this.

On Thu, 3 Aug 2023 at 18:49, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> 1. ORDER BY or PARTITION BY
>
> select *, count(two) over (order by unique1) from tenk1 limit 1;
> DEBUG:  startup_tuples = 1
> DEBUG:  startup_tuples = 1
>
> select *, count(two) over (partition by unique1) from tenk1 limit 1;
> DEBUG:  startup_tuples = 1
> DEBUG:  startup_tuples = 1
>
> Due to the Executor of nodeWindowAgg, we have to fetch the next tuple
> until it mismatches with the current one, then we can calculate the
> WindowAgg function. In the current patch, we didn't count the
> mismatched tuple. I verified my thought with 'break at IndexNext'
> function and see IndexNext is called twice, so in the above case the
> startup_tuples should be 2?

You're probably right here. I'd considered that it wasn't that
critical and aimed to attempt to keep the estimate close to the number
of rows that'll be aggregated. I think that's probably not the best
thing to do as if you consider the EXCLUDE options, those just exclude
tuples from aggregation, it does not mean we read fewer tuples from
the subnode. I've updated the patch accordingly.

> 2. ORDER BY and PARTITION BY
>
> select two, hundred,
> count(two) over (partition by ten order by hundred)
> from tenk1 limit 1;
>
> DEBUG:  startup_tuples = 10
>  two | hundred | count
> -----+---------+-------
>    0 |       0 |   100
>
> If we consider the mismatched tuples, it should be 101?

I don't really see how we could do better with the current level of
statistics.  The stats don't know that there are only 10 distinct
"hundred" values for rows which have ten=1. All we have is n_distinct
on tenk1.hundred, which is 100.

> 3. As we can see the log for startup_tuples is logged twice sometimes,
> the reason is because it is used in cost_windowagg, so it is calculated
> for every create_one_window_path. I think the startup_tuples should be
> independent with the physical path, maybe we can cache it somewhere to
> save some planning cycles?

I wondered about that too but I ended up writing off the idea of
caching because the input_tuple count comes from the Path and the
extra calls are coming from other Paths, which could well have some
completely different value for input_tuples.

David

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: Use of additional index columns in rows filtering
Следующее
От: Melih Mutlu
Дата:
Сообщение: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication