Re: MergeJoin beats HashJoin in the case of multiple hash clauses
От | Andrei Lepikhov |
---|---|
Тема | Re: MergeJoin beats HashJoin in the case of multiple hash clauses |
Дата | |
Msg-id | 8750fa3f-43b6-40db-803f-d6ae471384ef@gmail.com обсуждение исходный текст |
Ответ на | MergeJoin beats HashJoin in the case of multiple hash clauses (Andrey Lepikhov <a.lepikhov@postgrespro.ru>) |
Ответы |
Re: MergeJoin beats HashJoin in the case of multiple hash clauses
|
Список | pgsql-hackers |
On 17/2/2025 01:34, Alexander Korotkov wrote: > Hi, Andrei! > > On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepihov@gmail.com> wrote: > Thank you for your work on this subject. I agree with the general > direction. While everyone has used conservative estimates for a long > time, it's better to change them only when we're sure about it. > However, I'm still not sure I get the conservatism. > > if (innerbucketsize > thisbucketsize) > innerbucketsize = thisbucketsize; > if (innermcvfreq > thismcvfreq) > innermcvfreq = thismcvfreq; > > IFAICS, even in the worst case (all columns are totally correlated), > the overall bucket size should be the smallest bucket size among > clauses (not the largest). And the same is true of MCV. As a mental > experiment, we can add a new clause to hash join, which is always true > because columns on both sides have the same value. In fact, it would > have almost no influence except for the cost of extracting additional > columns and the cost of executing additional operators. But in the > current model, this additional clause would completely ruin > thisbucketsize and thismcvfreq, making hash join extremely > unappealing. Should we still revise this to calculate minimum instead > of maximum? I agree with your point. But I think the code works precisely the way you have described. > > I've slightly revised the patch. I've run pg_indent and renamed > s/saveList/origin_rinfos/g for better readability. Thank You! > > Also, the patch badly needs regression test coverage. We can't > include costs in expected outputs. But that could be some plans, > which previously were reliably merge joins but now become reliable > hash joins. I added one test here. Writing more tests on this feature is hard, but feature [1] may provide us with additional tools to reveal extended stat internals. I also have thought about injection points, but it seems an over-complication. [1] Showing applied extended statistics in explain Part 2 https://www.postgresql.org/message-id/flat/TYYPR01MB82310B308BA8770838F681619E5E2%40TYYPR01MB8231.jpnprd01.prod.outlook.com -- regards, Andrei Lepikhov
Вложения
В списке pgsql-hackers по дате отправления: