Обсуждение: Improve rowcount estimate for UNNEST(column)

Поиск
Список
Период
Сортировка

Improve rowcount estimate for UNNEST(column)

От
Paul A Jungwirth
Дата:
Hello,

Here is a patch to improve rowcount estimates for
`UNNEST(some_array_column)`. Today we hard code this to 10, but we
have statistics about array size, so it's easy to use them.

I've seen plans where this would make a difference. If the array has
only 1 or 2 elements, then overestimating the rowcount by 10 leads to
unnecessary seqscans downstream. I can see how an underestimate would
cause issues too.

This patch builds on a391ff3c3d41 which allowed set-returning
functions like UNNEST to include a support function to estimate their
result count. (There is a nice writeup at
https://www.cybertec-postgresql.com/en/optimizer-support-functions/)
But that patch only changes UNNEST if it has a Const or ArrayExpr
argument.

The statistic I'm using is the last value in the DECHIST array, which
is the average number of distinct elements in the array. Using the
plain (non-distinct) element count would be more accurate, but we
don't have that, and using distinct elements is still better than a
hardcoded 10.

The real change is in estimate_array_length, which has several callers
besides array_unnest_support, but I think this change should give more
accurate estimates for all of them.

There is a comment that estimate_array_length must agree with
scalararraysel. I don't think this commit introduces any
discrepancies. The most relevant case there is `scalar = ANY/ALL
(array)`, which also consults DECHIST (and/or MCELEM).

I wasn't sure where to put a test. I finally settled on arrays.sql
since (1) that has other UNNEST tests (2) array_unnest_support is in
util/adt/arrayfuncs.c (3) I couldn't find a place devoted to
rowcount/selectivity estimates. I'm happy to move it if someone has a
better idea!

Based on 712dc2338b23.

Yours,

--
Paul              ~{:-)
pj@illuminatedcomputing.com

Вложения

Re: Improve rowcount estimate for UNNEST(column)

От
Laurenz Albe
Дата:
On Sat, 2023-11-25 at 09:19 -0800, Paul A Jungwirth wrote:
> Here is a patch to improve rowcount estimates for
> `UNNEST(some_array_column)`. Today we hard code this to 10, but we
> have statistics about array size, so it's easy to use them.
>
> I've seen plans where this would make a difference. If the array has
> only 1 or 2 elements, then overestimating the rowcount by 10 leads to
> unnecessary seqscans downstream. I can see how an underestimate would
> cause issues too.

The idea sounds good to me.
I didn't test or scrutinize the code, but I noticed that you use
EXPLAIN in the regression tests.  I think that makes the tests vulnerable
to changes in the parameters or in the block size.
Perhaps you can write a function that runs EXPLAIN and extracts just the
row count.  That should be stable enough.

Yours,
Laurenz Albe



Re: Improve rowcount estimate for UNNEST(column)

От
Tom Lane
Дата:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Sat, 2023-11-25 at 09:19 -0800, Paul A Jungwirth wrote:
>> Here is a patch to improve rowcount estimates for
>> `UNNEST(some_array_column)`. Today we hard code this to 10, but we
>> have statistics about array size, so it's easy to use them.

> The idea sounds good to me.

I didn't read the patch either yet, but it seems like a reasonable idea.

> I didn't test or scrutinize the code, but I noticed that you use
> EXPLAIN in the regression tests.  I think that makes the tests vulnerable
> to changes in the parameters or in the block size.

Yes, this regression test is entirely unacceptable; the numbers will
not be stable enough.  Even aside from the different-settings issue,
you can't rely on ANALYZE deriving exactly the same stats every time.
Usually what we try to do is devise a query where the plan shape
changes because of the better estimate.  That typically will provide
some insulation against small changes in the numerical estimates.

            regards, tom lane



Re: Improve rowcount estimate for UNNEST(column)

От
jian he
Дата:
Hi.
Since both array_op_test, arrest both are not dropped at the end of
src/test/regress/sql/arrays.sql.
I found using table array_op_test test more convincing.

select
        reltuples * 10 as original,
        reltuples * (select
floor(elem_count_histogram[array_length(elem_count_histogram,1)])
        from    pg_stats
        where   tablename = 'array_op_test' and         attname = 'i')
as with_patch
        ,(select (elem_count_histogram[array_length(elem_count_histogram,1)])
        from    pg_stats
        where   tablename = 'array_op_test' and         attname = 'i')
as elem_count_histogram_last_element
from pg_class where relname = 'array_op_test';
 original | with_patch | elem_count_histogram_last_element
----------+------------+-----------------------------------
     1030 |        412 |                         4.7843137
(1 row)

without patch:
explain select unnest(i)  from array_op_test;
                              QUERY PLAN
----------------------------------------------------------------------
 ProjectSet  (cost=0.00..9.95 rows=1030 width=4)
   ->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
(2 rows)

with patch:
 explain select unnest(i)  from array_op_test;
                              QUERY PLAN
----------------------------------------------------------------------
 ProjectSet  (cost=0.00..6.86 rows=412 width=4)
   ->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
(2 rows)
--------
because, in the estimate_array_length function, `nelem =
sslot.numbers[sslot.nnumbers - 1];` will round  4.7843137 to 4.
so with patch estimated row 412 = 103 *4. without patch estimated rows
= 103 * 10.



Re: Improve rowcount estimate for UNNEST(column)

От
jian he
Дата:
On Mon, Nov 27, 2023 at 3:05 PM jian he <jian.universality@gmail.com> wrote:
>
> Hi.
> Since both array_op_test, arrest both are not dropped at the end of
> src/test/regress/sql/arrays.sql.
> I found using table array_op_test test more convincing.
>
> select
>         reltuples * 10 as original,
>         reltuples * (select
> floor(elem_count_histogram[array_length(elem_count_histogram,1)])
>         from    pg_stats
>         where   tablename = 'array_op_test' and         attname = 'i')
> as with_patch
>         ,(select (elem_count_histogram[array_length(elem_count_histogram,1)])
>         from    pg_stats
>         where   tablename = 'array_op_test' and         attname = 'i')
> as elem_count_histogram_last_element
> from pg_class where relname = 'array_op_test';
>  original | with_patch | elem_count_histogram_last_element
> ----------+------------+-----------------------------------
>      1030 |        412 |                         4.7843137
> (1 row)
>
> without patch:
> explain select unnest(i)  from array_op_test;
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  ProjectSet  (cost=0.00..9.95 rows=1030 width=4)
>    ->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
> (2 rows)
>
> with patch:
>  explain select unnest(i)  from array_op_test;
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  ProjectSet  (cost=0.00..6.86 rows=412 width=4)
>    ->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
> (2 rows)
> --------

Hi.
I did a minor change. change estimate_array_length return type to
double,  cost_tidscan function inside `int ntuples` to `double
ntuples`.

 `clamp_row_est(get_function_rows(root, expr->funcid, clause));` will
round 4.7843137 to 5.
so with your patch and my refactor, the rows will be 103 * 5 = 515.

 explain select unnest(i)  from array_op_test;
                              QUERY PLAN
----------------------------------------------------------------------
 ProjectSet  (cost=0.00..7.38 rows=515 width=4)
   ->  Seq Scan on array_op_test  (cost=0.00..4.03 rows=103 width=40)
(2 rows)

Вложения

Re: Improve rowcount estimate for UNNEST(column)

От
Paul Jungwirth
Дата:
Hello,

On 11/26/23 12:22, Tom Lane wrote:
 > Yes, this regression test is entirely unacceptable; the numbers will
 > not be stable enough.  Even aside from the different-settings issue,
 > you can't rely on ANALYZE deriving exactly the same stats every time.
 > Usually what we try to do is devise a query where the plan shape
 > changes because of the better estimate.

Here is a patch with an improved test. With the old "10" estimate we get a Merge Join, but now that 
the planner can see there are only ~4 elements per array, we get a Nested Loop.

It was actually hard to get a new plan, since all our regress tables' arrays have around 5 elements 
average, which isn't so far from 10. Adding a table with 1- or 2- element arrays would be more 
dramatic. So I resorted to tuning the query with `WHERE seqno <= 50`. Hopefully that's not cheating 
too much.

I thought about also adding a test where the old code *underestimates*, but then I'd have to add a 
new table with big arrays. If it's worth it let me know.

On 11/26/23 23:05, jian he wrote:
 > I found using table array_op_test test more convincing.

True, arrtest is pretty small. The new test uses array_op_test instead.

On 11/29/23 20:35, jian he wrote:
 > I did a minor change. change estimate_array_length return type to double

I'm not sure I want to change estimate_array_length from returning ints to returning doubles, since 
it's called in many places. I can see how that might give plans that are more accurate yet, so maybe 
it's worth it? It feels like it ought to be a separate patch to me, but if others want me to include 
it here please let me know.

I did add clamp_row_est since it's essentially free and maybe gives some safety.

Rebased onto d43bd090a8.

Yours,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com
Вложения

Re: Improve rowcount estimate for UNNEST(column)

От
Tom Lane
Дата:
Paul Jungwirth <pj@illuminatedcomputing.com> writes:
> Here is a patch with an improved test. With the old "10" estimate we get a Merge Join, but now that
> the planner can see there are only ~4 elements per array, we get a Nested Loop.

Pushed with minor editorialization.  I ended up not using the test
case, because I was afraid it wouldn't be all that stable, and
code coverage shows that we are exercising the added code path
even without a bespoke test case.

> On 11/29/23 20:35, jian he wrote:
>>> I did a minor change. change estimate_array_length return type to double

> I'm not sure I want to change estimate_array_length from returning
> ints to returning doubles, since it's called in many places.

But your patch forces every one of those places to be touched anyway,
as a consequence of adding the "root" argument.  I looked at the
callers and saw that every single one of them (in core anyway) ends up
using the result in a "double" rowcount calculation, so we're really
not buying anything by converting to integer and back again.  There's
also a question of whether the number from DECHIST could be big enough
to overflow an int.  Perhaps not given the current calculation method,
but since it's a float4 there's at least a theoretical risk.  Hence,
I adopted Jian's suggestion.

One other point is that examine_variable is capable of succeeding
on things that aren't Vars, so I thought the restriction to Vars
was inappropriate.

            regards, tom lane