Re: Use extended statistics to estimate (Var op Var) clauses

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Use extended statistics to estimate (Var op Var) clauses
Дата
Msg-id ecc0b08a-518d-7ad6-17ed-a5e962fc4f5f@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Use extended statistics to estimate (Var op Var) clauses  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Ответы Re: Use extended statistics to estimate (Var op Var) clauses  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Список pgsql-hackers
On 7/9/22 14:04, Dean Rasheed wrote:
> On Mon, 13 Dec 2021 at 02:21, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> I finally got around to this patch again, focusing mostly on the first
>> part that simply returns either 1.0 or 0.0 for Var op Var conditions
>> (i.e. the part not really using extended statistics).
>>
> 
> Just starting to look at this again, starting with 0001 ...
> 
> This needs to account for nullfrac, since x = x is only true if x is not null.
> 

Right, I forgot to account for nullfrac.

> I don't like how matching_restriction_variables() is adding a
> non-trivial amount of overhead to each of these selectivity functions,
> by calling examine_variable() for each side, duplicating what
> get_restriction_variable() does later on.
> 

But matching_restriction_variables() does not use examine_variable()
anymore. It did, originally, but 0002 removed all of that. Now it does
just pull_varnos() and that's it. I admit leaving those bits unsquashed
was a bit confusing, but the first part was really 0001+0002+0003.


> I think it would probably be better to make the changes in
> var_eq_non_const(), where all of this information is available, and
> get rid of matching_restriction_variables() (it doesn't look like the
> new scalarineqsel_wrapper() code really needs
> matching_restriction_variables() - it can just use what
> get_restriction_variable() already returns).
> 

I'm not sure how could that help. var_eq_non_const only really applies
to eqsel_internal(), so we'd still need to deal with various other
places for inequality operators.


Attached is a rebased and somewhat cleaned up version of the patch
series, addressing the review comments so far and squashing the bits I
previously kept separate to showcase the changes.

I've also added a bunch of regression tests - queries with (Var op Var)
clauses of varying complexity, to demonstrate the effect of each patch.
I added them as 0001, so it's clear how the individual patches affect
the results.

I've also wrote a single generator that generates both data and queries
with (Var op Var) clauses, and then runs them on multiple connections to
compare the estimates. It requires some effort to run that (setting up
the clusters, ...) but shouldn't be too hard to get it working.

The results are pretty massive (thousands of queries), but a simple
summary showing percentiles (0.25, 0.5, 0.75, 0.9) for estimation error,
calculated as (+1 to deal with 0 actual rows)

  abs(estimated - actual) / (actual + 1)

looks like this:

  clauses | stats |           master          |        patched
 ---------+-------+---------------------------+------------------------
        1 | no    | {0.39, 0.67, 4.17, 10000} | {0.29, 0.67, 1, 10000}
        2 | no    | {0.38, 0.79, 50, 9950}    | {0.26, 0.73, 1, 3333}
        3 | no    | {0.3, 0.84, 50, 3317}     | {0.22, 0.78, 1, 1111}
        4 | no    | {0.24, 0.84, 25, 1852}    | {0.14, 0.78, 1, 370}
        5 | no    | {0.2, 0.85, 17, 1100}     | {0.11, 0.78, 1, 50}
        1 | yes   | {0.39, 0.67, 4.17, 10000} | {0, 0.14, 1, 1}
        2 | yes   | {0.38, 0.79, 50, 9950}    | {0, 0.15, 1, 1}
        3 | yes   | {0.3, 0.84, 50, 3317}     | {0, 0.15, 1, 1}
        4 | yes   | {0.24, 0.84, 25, 1852}    | {0, 0.17, 1, 1}
        5 | yes   | {0.2, 0.85, 17, 1100}     | {0, 0.14, 1, 1}

This seems pretty good, IMO. Without extended stats on the columns,
there's only so much we can do, but even then the errors got much
smaller. With the stats it's clearly way better.

Of course, take this with a grain of salt - those are randomly generated
synthetic queries, with all queries being considered equally "likely".
But clearly some queries are more likely to appear in the applications,
and those are more important to estimate. However, the point of this was
to see if there are classes of queries that would get much worse, and I
haven't found anything like that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Aleksander Alekseev
Дата:
Сообщение: Re: [PATCH v1] eliminate duplicate code in table.c
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: [PATCH v1] eliminate duplicate code in table.c