Re: A wrong index choose issue because of inaccurate statistics

Поиск
Список
Период
Сортировка
От Andy Fan
Тема Re: A wrong index choose issue because of inaccurate statistics
Дата
Msg-id CAKU4AWp=PNpm-AjsCop0b-sNyOg7P1Q3EpJ7WuhRDp5rzQNr0Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: A wrong index choose issue because of inaccurate statistics  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Список pgsql-hackers


On Mon, Jun 8, 2020 at 10:16 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
I know one project where they used PostgreSQL code base to detect
"robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of
the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf
describe the idea. 
 
In short, the idea is to annotate a plan with a "bandwidth" i.e. how
does the plan fair with degradation of statistics. A plan which has a
slightly higher cost which doesn't degrade much with degradation of
statistics is preferred over a low cost plan whose cost rises sharply
with degradation of statistics. This is similar to what David is
suggesting.


Great to know them,  thank you for sharing it, links have been bookmarked.  

On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
>>
>> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>> > The one-line fix describe the exact idea in my mind:
>> >
>> > +++ b/src/backend/optimizer/path/costsize.c
>> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>> >
>> >         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>> >
>> > +       /*
>> > +        * To make the planner more robust to handle some inaccurate statistics
>> > +        * issue, we will add a extra cost to qpquals so that the less qpquals
>> > +        * the lower cost it has.
>> > +        */
>> > +       cpu_run_cost += 0.01 * list_length(qpquals);
>> > +
>> >
>> > This change do fix the issue above, but will it make some other cases worse? My
>> > answer is no based on my current knowledge, and this is most important place
>> > I want to get advised. The mainly impact I can figure out is: it not only
>> > change the cost difference between (a, b) and (a, c) it also cause the cost
>> > difference between Index scan on (a, c) and seq scan.  However the
>> > cost different between index scan and seq scan are huge by practice so
>> > I don't think this impact is harmful.
>>
>> Didn't that idea already get shot down in the final paragraph on [1]?
>>
>> I understand that you wish to increase the cost by some seemingly
>> innocent constant to fix your problem case.  Here are my thoughts
>> about that: Telling lies is not really that easy to pull off. Bad
>> liers think it's easy and good ones know it's hard. The problem is
>> that the lies can start small, but then at some point the future you
>> must fashion some more lies to account for your initial lie. Rinse and
>> repeat that a few times and before you know it, your course is set
>> well away from the point of truth.  I feel the part about "rinse and
>> repeat" applies reasonably well to how join costing works.  The lie is
>> likely to be amplified as the join level gets deeper.
>>
>> I think you need to think of a more generic solution and propose that
>> instead.  There are plenty of other quirks in the planner that can
>> cause suffering due to inaccurate or non-existing statistics. For
>> example, due to how we multiply individual selectivity estimates,
>> having a few correlated columns in a join condition can cause the
>> number of join rows to be underestimated. Subsequent joins can then
>> end up making bad choices on which join operator to use based on those
>> inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
>> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
>> input to the ORDER BY but cannot use <x> as an indexable condition.
>> We don't record any stats to make better choices there, maybe we
>> should, but either way, we're taking a bit risk there as all the rows
>> matching <x> might be right at the end of the index and we might need
>> to scan the entire thing before hitting the LIMIT. For now, we just
>> assume completely even distribution of rows. i.e. If there are 50 rows
>> estimated in the path and the limit is for 5 rows, then we'll assume
>> we need to read 10% of those before finding all the ones we need. In
>> reality, everything matching <x> might be 95% through the index and we
>> could end up reading 100% of rows. That particular problem is not just
>> caused by the uneven distribution of rows in the index, but also from
>> selectivity underestimation.
>>
>> I'd more recently wondered if we shouldn't have some sort of "risk"
>> factor to the cost model. I really don't have ideas on how exactly we
>> would calculate the risk factor in all cases, but in this case,  say
>> the risk factor always starts out as 1. We could multiply that risk
>> factor by some >1 const each time we find another index filter qual.
>> add_path() can prefer lower risk paths when the costs are similar.
>> Unsure what the exact add_path logic would be. Perhaps a GUC would
>> need to assist with the decision there.   Likewise, with
>> NestedLoopPaths which have a large number of join quals, the risk
>> factor could go up a bit with those so that we take a stronger
>> consideration for hash or merge joins instead.
>>
>
> I thought about these ideas too. And I am not alone.
>
> https://hal.archives-ouvertes.fr/hal-01316823/document
>
> Regards
>
> Pavel
>
>> Anyway, it's pretty much a large research subject which would take a
>> lot of work to iron out even just the design. It's likely not a
>> perfect idea, but I think it has a bit more merit that trying to
>> introduce lies to the cost modal to account for a single case where
>> there is a problem.
>>
>> David
>>
>> [1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
>>
>>


--
Best Wishes,
Ashutosh Bapat


--
Best Regards
Andy Fan

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Add -Wold-style-definition to CFLAGS?
Следующее
От: Vianello Fabio
Дата:
Сообщение: RE: BUG #16481: Stored Procedure Triggered by Logical Replication isUnable to use Notification Events