Re: planner chooses incremental but not the best one

Поиск
Список
Период
Сортировка
От ywgrit
Тема Re: planner chooses incremental but not the best one
Дата
Msg-id CAPt2h2YMkSOKUJH0DtPWNOa0ywY3=nVEqbFv0yqruT0YmeqqhA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: planner chooses incremental but not the best one  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: planner chooses incremental but not the best one  (ywgrit <yw987194828@gmail.com>)
Список pgsql-hackers

The possible solution of one scenario I can come up with so far is the query's predicate columns and group columns belonging to one table .

For a query that contains where clause, perhaps num_groups could be estimated according to the following formula.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ... pred_col_n).

ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause = ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel.

pred_col_n belong to the columns involved in the where clause and sort_col_m belong to the columns involved in the group by clause.

The reason for multiplying by selectivity of rel directly is that the selectivity of rel depends on only pred_col not sort_col. So the above formula can be simplified as follows.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) * selectivity of rel.

The correctness of the above formula depends on the following conditions.

  • ndistinct(pred_col_1, pred_col_2, ... pred_col_n)* ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) statistics already exist, and need be accurate.

  • Both pred_col_n and sort_col_m are uniformly distributed, if not, statistics such as mcv are needed for correction.

  • The tuples of rel are the number of total tuples of the table , not the number of filtered tuples.

After experimentation, in the scenario mentioned in previous thread. The estimate num_groups is 3, the accuracy of result strongly relies on the uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause could be able to estimated accurately.

I'd like to hear your opinions.

Regards.

ywgrit.


Tomas Vondra <tomas.vondra@enterprisedb.com> 于2023年12月18日周一 20:53写道:


On 12/18/23 11:40, Richard Guo wrote:
>
> On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>
>     Oh! Now I see what you meant by using the new formula in 84f9a35e3
>     depending on how we sum tuples. I agree that seems like the right thing.
>
>     I'm not sure it'll actually help with the issue, though - if I apply the
>     patch, the plan does not actually change (and the cost changes just a
>     little bit).
>
>     I looked at this only very briefly, but I believe it's due to the
>     assumption of independence I mentioned earlier - we end up using the new
>     formula introduced in 84f9a35e3, but it assumes it assumes the
>     selectivity and number of groups are independent. But that'd not the
>     case here, because the groups are very clearly correlated (with the
>     condition on "b").
>
>
> You're right.  The patch allows us to adjust the estimate of distinct
> values for appendrels using the new formula introduced in 84f9a35e3.
> But if the restrictions are correlated with the grouping expressions,
> the new formula does not behave well.  That's why the patch does not
> help in case [1], where 'b' and 'c' are correlated.
>
> OTOH, if the restrictions are not correlated with the grouping
> expressions, the new formula would perform quite well.  And in this case
> the patch would help a lot, as shown in [2] where estimate_num_groups()
> gives a much more accurate estimate with the help of this patch.
>
> So this patch could be useful in certain situations.  I'm wondering if
> we should at least have this patch (if it is right).
>

I do agree the patch seems to do the right thing, and it's worth pushing
on it's own.

>
>     If that's the case, I'm not sure how to fix this :-(
>
>
> The commit message of 84f9a35e3 says
>
>     This could possibly be improved upon in the future by identifying
>     correlated restrictions and using a hybrid of the old and new
>     formulae.
>
> Maybe this is something we can consider trying.  But anyhow this is not
> an easy task I suppose.

Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)

The challenge is where to get usable information about correlation
between columns. I only have a couple very rought ideas of what might
try. For example, if we have multi-column ndistinct statistics, we might
look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from

    ndistinct(b,c,d) / ndistinct(b,c)

If we know how many distinct values we have for the predicate column, we
could then estimate the number of groups. I mean, we know that for the
restriction "WHERE b = 3" we only have 1 distinct value, so we could
estimate the number of groups as

    1 * ndistinct(b,c)

I'm well aware this is only a very trivial example, and for more complex
examples it's likely way more complicated. But hopefully it illustrates
the general idea.

The other idea would be to maybe look at multi-column MCV, and try using
it to deduce cross-column correlation too (it could be more accurate for
arbitrary predicates).

And finally, we might try being a bit more pessimistic and look at what
the "worst case" behavior could be. So for example instead of trying to
estimate the real number of groups, we'd ask "What's the minimum number
of groups we're likely to get?". And use that to cost the incremental
sort. But I don't think we do this elsewhere, and I'm not sure we want
to start with this risk-based approach here. It might be OK, because we
usually expect the incremental sort to be much cheaper, ...

If this something would be interested in exploring? I don't have
capacity to work on this myself, but I'd be available for reviews,
feedback and so on.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Luzanov
Дата:
Сообщение: Re: Trigger violates foreign key constraint
Следующее
От: Pavel Luzanov
Дата:
Сообщение: Re: Trigger violates foreign key constraint