Обсуждение: A wrong index choose issue because of inaccurate statistics

Поиск
Список
Период
Сортировка

A wrong index choose issue because of inaccurate statistics

От
Andy Fan
Дата:
This thread is a follow-up of thread [1] where I don't have a good writing to
describe the issue and solution in my mind. So I start this thread to fix that
and also enrich the topic by taking the advices from Ashutosh, Tomas and Tom.

Inaccurate statistics is not avoidable and can cause lots of issue, I don't
think we can fix much of them, but the issue I want to talk about now like 
:select * from t where a = x and b = y, and we have index on (a, b) and (a, c); 
The current implementation may choose (a, c) at last while I think we should
always choose index (a, b) over (a, c).

Why will the (a, c) be choose?  If planner think a = x has only 1 row, it will cost
the index (a, b) as "an index access to with 2 qual checking and get 1 row + table
scan with the index result,".  the cost of (a, c) as "an index access with 1 qual 
and get 1 row, and table scan the 1 row and filter the another qual". There is no cost 
difference for the qual filter on index scan and table scan, so the  final cost is
exactly same.  If the cost is exactly same, which path is found first,
which one will be choose at last.  You can use the attached reproduced.sql to 
reproduce this issue.

The solution in my mind is just hacking the cost model to treat the qual filter
on table scan is slightly higher so that the index (a, b) will be always choose. 
At the same time, planner still think only 1 row returned which maybe wrong,
but that's the issue I can fix here and will not impact on the final index choose
anyway.

The one-line fix describe the exact idea in my mind: 

+++ b/src/backend/optimizer/path/costsize.c
@@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,

        cpu_run_cost += cpu_per_tuple * tuples_fetched;

+       /*
+        * To make the planner more robust to handle some inaccurate statistics
+        * issue, we will add a extra cost to qpquals so that the less qpquals
+        * the lower cost it has.
+        */
+       cpu_run_cost += 0.01 * list_length(qpquals);
+

This change do fix the issue above, but will it make some other cases worse? My
answer is no based on my current knowledge, and this is most important place
I want to get advised. The mainly impact I can figure out is: it not only
change the cost difference between (a, b) and (a, c) it also cause the cost
difference between Index scan on (a, c) and seq scan.  However the
cost different between index scan and seq scan are huge by practice so 
I don't think this impact is harmful.

I test TPC-H to see if this change causes any unexpected behavior, the
data and index are setup based on [2], the attached normal.log is the plan
without this patch and patched.log is the plan with this patch. In summary, I
didn't find any unexpected behaviors because of that change. All the plans 
whose cost changed have the following pattern which is expected.

Index Scan ...
   Index Cond: ...
   Filter: ...

Any thought?

Вложения

Re: A wrong index choose issue because of inaccurate statistics

От
Andy Fan
Дата:


Why will the (a, c) be choose?  If planner think a = x has only 1 row ..

I just did more research and found above statement is not accurate,
the root cause of this situation is because IndexSelectivity = 0. Even through I
don't think we can fix anything here since IndexSelectivity is calculated from
statistics and we don't know it is 1% wrong or 10% wrong or more just like What
Tomas said.

The way of fixing it is just add a "small" extra cost for pqquals for index
scan. that should be small enough to not have impacts on others part. I will
discuss how small is small later with details, we can say it just a guc variable
for now.

+++ b/src/backend/optimizer/path/costsize.c
@@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,

        cpu_run_cost += cpu_per_tuple * tuples_fetched;

+       /*
+        * To make the planner more robust to handle some inaccurate statistics
+        * issue, we will add a extra cost to qpquals so that the less qpquals
+        * the lower cost it has.
+        */
+       cpu_run_cost += stat_stale_cost * list_length(qpquals);
+

If we want to reduce the impact of this change further, we can only add this if
the IndexSelecivity == 0.  

How to set the value of stat_stale_cost? Since the minimum cost for a query
should be a cpu_tuple_cost which is 0.01 default. Adding an 0.01 cost for each
pqqual in index scan should not make a big difference.  However sometimes we may
set it to 0.13 if we consider index->tree_height was estimated wrongly for 1 (cost is
50 * 0.0025 = 0.125). I don't know how it happened, but looks it do happen in prod
environment. At the same time it is unlikely index->tree_height is estimated
wrongly for 2 or more. so basically we can set this value to 0(totally disable
this feature), 0.01 (should be ok for most case), 0.13 (A bit aggressive).

The wrong estimation of IndexSelectitity = 0 might be common case and if
people just have 2 related index like (A, B) and (A, C). we have 50% chances to
have a wrong decision, so I would say this case worth the troubles. My current
implementation looks not cool, so any suggestion to research further is pretty
welcome.

--
Best Regards
Andy Fan

Re: A wrong index choose issue because of inaccurate statistics

От
David Rowley
Дата:
On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>
>         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> +       /*
> +        * To make the planner more robust to handle some inaccurate statistics
> +        * issue, we will add a extra cost to qpquals so that the less qpquals
> +        * the lower cost it has.
> +        */
> +       cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan.  However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?

I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case.  Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth.  I feel the part about "rinse and
repeat" applies reasonably well to how join costing works.  The lie is
likely to be amplified as the join level gets deeper.

I think you need to think of a more generic solution and propose that
instead.  There are plenty of other quirks in the planner that can
cause suffering due to inaccurate or non-existing statistics. For
example, due to how we multiply individual selectivity estimates,
having a few correlated columns in a join condition can cause the
number of join rows to be underestimated. Subsequent joins can then
end up making bad choices on which join operator to use based on those
inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
BY col LIMIT n; sometimes choosing an index that provides pre-sorted
input to the ORDER BY but cannot use <x> as an indexable condition.
We don't record any stats to make better choices there, maybe we
should, but either way, we're taking a bit risk there as all the rows
matching <x> might be right at the end of the index and we might need
to scan the entire thing before hitting the LIMIT. For now, we just
assume completely even distribution of rows. i.e. If there are 50 rows
estimated in the path and the limit is for 5 rows, then we'll assume
we need to read 10% of those before finding all the ones we need. In
reality, everything matching <x> might be 95% through the index and we
could end up reading 100% of rows. That particular problem is not just
caused by the uneven distribution of rows in the index, but also from
selectivity underestimation.

I'd more recently wondered if we shouldn't have some sort of "risk"
factor to the cost model. I really don't have ideas on how exactly we
would calculate the risk factor in all cases, but in this case,  say
the risk factor always starts out as 1. We could multiply that risk
factor by some >1 const each time we find another index filter qual.
add_path() can prefer lower risk paths when the costs are similar.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there.   Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.

Anyway, it's pretty much a large research subject which would take a
lot of work to iron out even just the design. It's likely not a
perfect idea, but I think it has a bit more merit that trying to
introduce lies to the cost modal to account for a single case where
there is a problem.

David

[1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development



Re: A wrong index choose issue because of inaccurate statistics

От
Pavel Stehule
Дата:


pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>
>         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> +       /*
> +        * To make the planner more robust to handle some inaccurate statistics
> +        * issue, we will add a extra cost to qpquals so that the less qpquals
> +        * the lower cost it has.
> +        */
> +       cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan.  However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?

I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case.  Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth.  I feel the part about "rinse and
repeat" applies reasonably well to how join costing works.  The lie is
likely to be amplified as the join level gets deeper.

I think you need to think of a more generic solution and propose that
instead.  There are plenty of other quirks in the planner that can
cause suffering due to inaccurate or non-existing statistics. For
example, due to how we multiply individual selectivity estimates,
having a few correlated columns in a join condition can cause the
number of join rows to be underestimated. Subsequent joins can then
end up making bad choices on which join operator to use based on those
inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
BY col LIMIT n; sometimes choosing an index that provides pre-sorted
input to the ORDER BY but cannot use <x> as an indexable condition.
We don't record any stats to make better choices there, maybe we
should, but either way, we're taking a bit risk there as all the rows
matching <x> might be right at the end of the index and we might need
to scan the entire thing before hitting the LIMIT. For now, we just
assume completely even distribution of rows. i.e. If there are 50 rows
estimated in the path and the limit is for 5 rows, then we'll assume
we need to read 10% of those before finding all the ones we need. In
reality, everything matching <x> might be 95% through the index and we
could end up reading 100% of rows. That particular problem is not just
caused by the uneven distribution of rows in the index, but also from
selectivity underestimation.

I'd more recently wondered if we shouldn't have some sort of "risk"
factor to the cost model. I really don't have ideas on how exactly we
would calculate the risk factor in all cases, but in this case,  say
the risk factor always starts out as 1. We could multiply that risk
factor by some >1 const each time we find another index filter qual.
add_path() can prefer lower risk paths when the costs are similar.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there.   Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.


I thought about these ideas too. And I am not alone.


Regards

Pavel

Anyway, it's pretty much a large research subject which would take a
lot of work to iron out even just the design. It's likely not a
perfect idea, but I think it has a bit more merit that trying to
introduce lies to the cost modal to account for a single case where
there is a problem.

David

[1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development


Re: A wrong index choose issue because of inaccurate statistics

От
Ashutosh Bapat
Дата:
I know one project where they used PostgreSQL code base to detect
"robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of
the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf
describe the idea.

In short, the idea is to annotate a plan with a "bandwidth" i.e. how
does the plan fair with degradation of statistics. A plan which has a
slightly higher cost which doesn't degrade much with degradation of
statistics is preferred over a low cost plan whose cost rises sharply
with degradation of statistics. This is similar to what David is
suggesting.


On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
>>
>> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>> > The one-line fix describe the exact idea in my mind:
>> >
>> > +++ b/src/backend/optimizer/path/costsize.c
>> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>> >
>> >         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>> >
>> > +       /*
>> > +        * To make the planner more robust to handle some inaccurate statistics
>> > +        * issue, we will add a extra cost to qpquals so that the less qpquals
>> > +        * the lower cost it has.
>> > +        */
>> > +       cpu_run_cost += 0.01 * list_length(qpquals);
>> > +
>> >
>> > This change do fix the issue above, but will it make some other cases worse? My
>> > answer is no based on my current knowledge, and this is most important place
>> > I want to get advised. The mainly impact I can figure out is: it not only
>> > change the cost difference between (a, b) and (a, c) it also cause the cost
>> > difference between Index scan on (a, c) and seq scan.  However the
>> > cost different between index scan and seq scan are huge by practice so
>> > I don't think this impact is harmful.
>>
>> Didn't that idea already get shot down in the final paragraph on [1]?
>>
>> I understand that you wish to increase the cost by some seemingly
>> innocent constant to fix your problem case.  Here are my thoughts
>> about that: Telling lies is not really that easy to pull off. Bad
>> liers think it's easy and good ones know it's hard. The problem is
>> that the lies can start small, but then at some point the future you
>> must fashion some more lies to account for your initial lie. Rinse and
>> repeat that a few times and before you know it, your course is set
>> well away from the point of truth.  I feel the part about "rinse and
>> repeat" applies reasonably well to how join costing works.  The lie is
>> likely to be amplified as the join level gets deeper.
>>
>> I think you need to think of a more generic solution and propose that
>> instead.  There are plenty of other quirks in the planner that can
>> cause suffering due to inaccurate or non-existing statistics. For
>> example, due to how we multiply individual selectivity estimates,
>> having a few correlated columns in a join condition can cause the
>> number of join rows to be underestimated. Subsequent joins can then
>> end up making bad choices on which join operator to use based on those
>> inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
>> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
>> input to the ORDER BY but cannot use <x> as an indexable condition.
>> We don't record any stats to make better choices there, maybe we
>> should, but either way, we're taking a bit risk there as all the rows
>> matching <x> might be right at the end of the index and we might need
>> to scan the entire thing before hitting the LIMIT. For now, we just
>> assume completely even distribution of rows. i.e. If there are 50 rows
>> estimated in the path and the limit is for 5 rows, then we'll assume
>> we need to read 10% of those before finding all the ones we need. In
>> reality, everything matching <x> might be 95% through the index and we
>> could end up reading 100% of rows. That particular problem is not just
>> caused by the uneven distribution of rows in the index, but also from
>> selectivity underestimation.
>>
>> I'd more recently wondered if we shouldn't have some sort of "risk"
>> factor to the cost model. I really don't have ideas on how exactly we
>> would calculate the risk factor in all cases, but in this case,  say
>> the risk factor always starts out as 1. We could multiply that risk
>> factor by some >1 const each time we find another index filter qual.
>> add_path() can prefer lower risk paths when the costs are similar.
>> Unsure what the exact add_path logic would be. Perhaps a GUC would
>> need to assist with the decision there.   Likewise, with
>> NestedLoopPaths which have a large number of join quals, the risk
>> factor could go up a bit with those so that we take a stronger
>> consideration for hash or merge joins instead.
>>
>
> I thought about these ideas too. And I am not alone.
>
> https://hal.archives-ouvertes.fr/hal-01316823/document
>
> Regards
>
> Pavel
>
>> Anyway, it's pretty much a large research subject which would take a
>> lot of work to iron out even just the design. It's likely not a
>> perfect idea, but I think it has a bit more merit that trying to
>> introduce lies to the cost modal to account for a single case where
>> there is a problem.
>>
>> David
>>
>> [1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
>>
>>


--
Best Wishes,
Ashutosh Bapat



Re: A wrong index choose issue because of inaccurate statistics

От
Andy Fan
Дата:


On Fri, Jun 5, 2020 at 2:19 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>
>         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> +       /*
> +        * To make the planner more robust to handle some inaccurate statistics
> +        * issue, we will add a extra cost to qpquals so that the less qpquals
> +        * the lower cost it has.
> +        */
> +       cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan.  However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?


Thanks for chiming in. I treat this as I didn't describe my idea clearly enough then
both Tomas and Tom didn't spend much time to read it (no offense,  and I 
understand they need to do lots of things every day),  so I re-summarize the
issue to make it easier to read. 

In Tomas's reply,  he raises concerns about how to fix the issue, since we
don't know how much it errored 1%, 10% and so on, so I emphasized I don't
touch that part actually. Even the wrong estimation still plays a bad role on
later  join, but that would not be the issue I would fix here. 


I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case.  Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth.  I feel the part about "rinse and
repeat" applies reasonably well to how join costing works.  The lie is
likely to be amplified as the join level gets deeper.

I agree with that to some extent. However we can just provide more options to 
users. At the same time, I still believe we should provide such options very carefully. 
 
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there.   Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.

I probably underestimated the impacts for a large number of join quals.   
This looks like a weakness we can't ignore confomforably, so I checked
the code further, maybe we don't need a risk factor for this case.

For query WHERE a = 1 AND b = 2,   both Selectivity(a = 1) and
Selectivity(b = 2) are greater than 0 even the statistics are stale enough,
so the IndexSelectivity is greater than 0 as well.  Based on this, 
IndexSelectivity(A, B) should be less than IndexSelectivity(A, C) for the
above query  However they still generate the same cost because of the
below code. (genericcostestimate and btcostestimate)

numIndexTuples = indexSelectivity * index->rel->tuples;
numIndexTuples = rint(numIndexTuples / num_sa_scans);

if (numIndexTuples < 1.0)                                                        
    numIndexTuples = 1.0;

later numIndexTuples is used to calculate cost.  The above code eats the
difference of indexSelectivity. 

Many places say we need to "round to integer" but I am still not figuring out
why it is a must.  so If we can  "round to integer"  just after the IndexCost
is calculated,  the issue can be fixed as well.   

The attached is the patch in my mind,  since this patch may lower the index
costs if the numIndexTuples < 1.0,  so it makes the optimizer prefer to use
nest loop rather than a hash join if the loop is small,  which is a common
case in our regression test where we don't have much data, so there are
several changes like that. 

Another impact I found in the regression test is that optimizer choose an 
IndexScan of a conditional index rather than IndexOnlyScan a normal index.
I checked the code and didn't find anything wrong, so I'd like to say that is
because the data is too small. 

I also tested TPC-H workload,  Query-12 changed hash join to nest loop,
The execution time changed from1605.118ms  to 699.032ms.


> the root cause of this situation is because IndexSelectivity = 0.

This is wrong.  I got the 0 by elog(INFO, "%f", IndexSelectivity).  
that reported 0 while the value is pretty small. 

--
Best Regards
Andy Fan
Вложения

Re: A wrong index choose issue because of inaccurate statistics

От
Andy Fan
Дата:


On Fri, Jun 5, 2020 at 2:30 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:


pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>
>         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> +       /*
> +        * To make the planner more robust to handle some inaccurate statistics
> +        * issue, we will add a extra cost to qpquals so that the less qpquals
> +        * the lower cost it has.
> +        */
> +       cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan.  However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?

I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case.  Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth.  I feel the part about "rinse and
repeat" applies reasonably well to how join costing works.  The lie is
likely to be amplified as the join level gets deeper.

I think you need to think of a more generic solution and propose that
instead.  There are plenty of other quirks in the planner that can
cause suffering due to inaccurate or non-existing statistics. For
example, due to how we multiply individual selectivity estimates,
having a few correlated columns in a join condition can cause the
number of join rows to be underestimated. Subsequent joins can then
end up making bad choices on which join operator to use based on those
inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
BY col LIMIT n; sometimes choosing an index that provides pre-sorted
input to the ORDER BY but cannot use <x> as an indexable condition.
We don't record any stats to make better choices there, maybe we
should, but either way, we're taking a bit risk there as all the rows
matching <x> might be right at the end of the index and we might need
to scan the entire thing before hitting the LIMIT. For now, we just
assume completely even distribution of rows. i.e. If there are 50 rows
estimated in the path and the limit is for 5 rows, then we'll assume
we need to read 10% of those before finding all the ones we need. In
reality, everything matching <x> might be 95% through the index and we
could end up reading 100% of rows. That particular problem is not just
caused by the uneven distribution of rows in the index, but also from
selectivity underestimation.

I'd more recently wondered if we shouldn't have some sort of "risk"
factor to the cost model. I really don't have ideas on how exactly we
would calculate the risk factor in all cases, but in this case,  say
the risk factor always starts out as 1. We could multiply that risk
factor by some >1 const each time we find another index filter qual.
add_path() can prefer lower risk paths when the costs are similar.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there.   Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.


I thought about these ideas too. And I am not alone.



Thanks for the documents,  after checking it, that is more like
 oracle's statistics feedback[1].  Hope we can have it someday:) 


--
Best Regards
Andy Fan

Re: A wrong index choose issue because of inaccurate statistics

От
Andy Fan
Дата:


On Mon, Jun 8, 2020 at 10:16 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
I know one project where they used PostgreSQL code base to detect
"robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of
the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf
describe the idea. 
 
In short, the idea is to annotate a plan with a "bandwidth" i.e. how
does the plan fair with degradation of statistics. A plan which has a
slightly higher cost which doesn't degrade much with degradation of
statistics is preferred over a low cost plan whose cost rises sharply
with degradation of statistics. This is similar to what David is
suggesting.


Great to know them,  thank you for sharing it, links have been bookmarked.  

On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
>>
>> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>> > The one-line fix describe the exact idea in my mind:
>> >
>> > +++ b/src/backend/optimizer/path/costsize.c
>> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>> >
>> >         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>> >
>> > +       /*
>> > +        * To make the planner more robust to handle some inaccurate statistics
>> > +        * issue, we will add a extra cost to qpquals so that the less qpquals
>> > +        * the lower cost it has.
>> > +        */
>> > +       cpu_run_cost += 0.01 * list_length(qpquals);
>> > +
>> >
>> > This change do fix the issue above, but will it make some other cases worse? My
>> > answer is no based on my current knowledge, and this is most important place
>> > I want to get advised. The mainly impact I can figure out is: it not only
>> > change the cost difference between (a, b) and (a, c) it also cause the cost
>> > difference between Index scan on (a, c) and seq scan.  However the
>> > cost different between index scan and seq scan are huge by practice so
>> > I don't think this impact is harmful.
>>
>> Didn't that idea already get shot down in the final paragraph on [1]?
>>
>> I understand that you wish to increase the cost by some seemingly
>> innocent constant to fix your problem case.  Here are my thoughts
>> about that: Telling lies is not really that easy to pull off. Bad
>> liers think it's easy and good ones know it's hard. The problem is
>> that the lies can start small, but then at some point the future you
>> must fashion some more lies to account for your initial lie. Rinse and
>> repeat that a few times and before you know it, your course is set
>> well away from the point of truth.  I feel the part about "rinse and
>> repeat" applies reasonably well to how join costing works.  The lie is
>> likely to be amplified as the join level gets deeper.
>>
>> I think you need to think of a more generic solution and propose that
>> instead.  There are plenty of other quirks in the planner that can
>> cause suffering due to inaccurate or non-existing statistics. For
>> example, due to how we multiply individual selectivity estimates,
>> having a few correlated columns in a join condition can cause the
>> number of join rows to be underestimated. Subsequent joins can then
>> end up making bad choices on which join operator to use based on those
>> inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
>> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
>> input to the ORDER BY but cannot use <x> as an indexable condition.
>> We don't record any stats to make better choices there, maybe we
>> should, but either way, we're taking a bit risk there as all the rows
>> matching <x> might be right at the end of the index and we might need
>> to scan the entire thing before hitting the LIMIT. For now, we just
>> assume completely even distribution of rows. i.e. If there are 50 rows
>> estimated in the path and the limit is for 5 rows, then we'll assume
>> we need to read 10% of those before finding all the ones we need. In
>> reality, everything matching <x> might be 95% through the index and we
>> could end up reading 100% of rows. That particular problem is not just
>> caused by the uneven distribution of rows in the index, but also from
>> selectivity underestimation.
>>
>> I'd more recently wondered if we shouldn't have some sort of "risk"
>> factor to the cost model. I really don't have ideas on how exactly we
>> would calculate the risk factor in all cases, but in this case,  say
>> the risk factor always starts out as 1. We could multiply that risk
>> factor by some >1 const each time we find another index filter qual.
>> add_path() can prefer lower risk paths when the costs are similar.
>> Unsure what the exact add_path logic would be. Perhaps a GUC would
>> need to assist with the decision there.   Likewise, with
>> NestedLoopPaths which have a large number of join quals, the risk
>> factor could go up a bit with those so that we take a stronger
>> consideration for hash or merge joins instead.
>>
>
> I thought about these ideas too. And I am not alone.
>
> https://hal.archives-ouvertes.fr/hal-01316823/document
>
> Regards
>
> Pavel
>
>> Anyway, it's pretty much a large research subject which would take a
>> lot of work to iron out even just the design. It's likely not a
>> perfect idea, but I think it has a bit more merit that trying to
>> introduce lies to the cost modal to account for a single case where
>> there is a problem.
>>
>> David
>>
>> [1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
>>
>>


--
Best Wishes,
Ashutosh Bapat


--
Best Regards
Andy Fan