Обсуждение: Less selective index chosen unexpectedly
Вложения
On Tue, May 18, 2021 at 1:36 PM James Coleman <jtc331@gmail.com> wrote: > The attached test case demonstrates the full problem at least some of the time--I've not been able to make it deterministic,but I'd say it shows the wrong plan choice (resulting in ~5k unnecessarily processed and filtered rows) roughly2/3 of the time on my laptop against PG11 (what we're running on in production). Against ~master I able to reproducethe wrong plan choice, but not the large number of filtered rows until I modified the second INSERT's case statementto use "n.i > 5000" instead of "n.i < 25000" -- I assume this is due to some combination of index deduplication/suffixtruncation. Interestingly with those changes it seems to be more deterministic against ~master than theoriginal repro case against 11. That's expected with 12 because the old "getting tired" behavior to deal with pages full of duplicates in !heapkeyspace indexes adds non-determinism (to every index tuple insertion involving an incoming item that already has many duplicates in the index). I actually relied on the determinism when I wrote my own index space utilization test suite for the Postgres 12 and Postgres 13 work. With a little care you can expect to get exactly the same index with a test case involving serial inserts. -- Peter Geoghegan
James Coleman <jtc331@gmail.com> writes: > Specifically we have a table (simplified repro case): > create table items(d date, t text, fk integer); > create index on items(d); > create index on items(t, fk, d); > For a query like: > select * from items where d = '2021-05-18' and fk = 1 and t = 'type0' limit > 1; > It's possible to get either an index scan on items_d_idx with a filter on > "fk" and "t" or an index scan on items_t_fk_d_idx without the need for a > filter. Even if both plans estimate a low cost and a single row, it seems > to be that the scan on the index containing more columns (and no filter) > ought to be pretty strongly preferred unless the cost or estimate rows is > dramatically higher. Actually not. The multi-column index is going to be physically larger, which means that the estimated cost to descend into it is going to be larger just on the grounds of more I/O. The extra comparisons to include the additional columns in that search aren't free either. Since you've specified LIMIT 1, you've also given up much of any cost advantage that might accrue to scanning items after the first match. Hence, the only way that the multi-column index is going to win out is if the extra filter conditions are estimated to be selective enough (relative to the condition on "d") that we have to scan multiple entries in the d-only index before getting the first match. Experimenting by adding explain select * from items where d = '2021-05-18' limit 1; (to see what the estimated selectivity of just that condition is) at each step of your script, I see that in the trouble cases, the "d" condition is by itself estimated to match only one row. If that were accurate, the planner would be quite right to pick the smaller index. The only thing I see that's really going wrong here is marginally inaccurate stats, especially right after a big insertion that's not reflected into the stats yet. I'm not sure there's much to improve there. You could increase the stats target some more, though of course that just pushes out the size of table where the issue will appear. regards, tom lane
On 2021-May-18, Tom Lane wrote: > The only thing I see that's really going wrong here is marginally > inaccurate stats, especially right after a big insertion that's > not reflected into the stats yet. I'm not sure there's much to > improve there. You could increase the stats target some more, > though of course that just pushes out the size of table where > the issue will appear. I think the real winner would be a mechanism to incrementally analyze tables, so that it updates the existing stats by sampling only blocks that have new data, and "somehow" merge that into the existing statistics. You could have such a process run much more frequently than standard analyze, because the cost is [supposed to be] smaller. Of course, the big problem with this idea is how would you merge/update the stats at all in the first place. ... ah, it appears there have been attempts at this already: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.63.5414&rep=rep1&type=pdf -- Álvaro Herrera Valdivia, Chile "Ed is the standard text editor." http://groups.google.com/group/alt.religion.emacs/msg/8d94ddab6a9b0ad3
On Tue, May 18, 2021 at 2:50 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I think the real winner would be a mechanism to incrementally analyze > tables, so that it updates the existing stats by sampling only blocks > that have new data, and "somehow" merge that into the existing > statistics. You could have such a process run much more frequently than > standard analyze, because the cost is [supposed to be] smaller. I wonder if there is a more general design that incorporates changes over time. That is, a design that has ANALYZE store old statistics for a table in order to give the system (and the DBA) a sense of how things change over time. This could enable autoanalyze behavior that more or less settles on an optimal frequency between ANALYZE operations -- frequent enough to get stable statistics with some margin of error, but not too frequent. I also wonder if this general approach could enable a strategy that uses something like Bayesian inference to detect bad statistics, and/or to dampen the consequences of bad estimates. -- Peter Geoghegan
On Tue, May 18, 2021 at 5:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > James Coleman <jtc331@gmail.com> writes: > > Specifically we have a table (simplified repro case): > > create table items(d date, t text, fk integer); > > create index on items(d); > > create index on items(t, fk, d); > > > For a query like: > > select * from items where d = '2021-05-18' and fk = 1 and t = 'type0' limit > > 1; > > > It's possible to get either an index scan on items_d_idx with a filter on > > "fk" and "t" or an index scan on items_t_fk_d_idx without the need for a > > filter. Even if both plans estimate a low cost and a single row, it seems > > to be that the scan on the index containing more columns (and no filter) > > ought to be pretty strongly preferred unless the cost or estimate rows is > > dramatically higher. > > Actually not. The multi-column index is going to be physically larger, > which means that the estimated cost to descend into it is going to be > larger just on the grounds of more I/O. The extra comparisons to > include the additional columns in that search aren't free either. > Since you've specified LIMIT 1, you've also given up much of any cost > advantage that might accrue to scanning items after the first match. > Hence, the only way that the multi-column index is going to win out is > if the extra filter conditions are estimated to be selective enough > (relative to the condition on "d") that we have to scan multiple > entries in the d-only index before getting the first match. > > Experimenting by adding > > explain select * from items where d = '2021-05-18' limit 1; > > (to see what the estimated selectivity of just that condition is) > at each step of your script, I see that in the trouble cases, > the "d" condition is by itself estimated to match only one row. > If that were accurate, the planner would be quite right to pick > the smaller index. > > The only thing I see that's really going wrong here is marginally > inaccurate stats, especially right after a big insertion that's > not reflected into the stats yet. I'm not sure there's much to > improve there. You could increase the stats target some more, > though of course that just pushes out the size of table where > the issue will appear. There are two axes that matter here in costing (I should have communicated this better in my original email): 1.) correctness of selectivity and 2.) the relative risk of getting that selectivity wrong. What I'm interested in here is that, at least in this case (and I'm theorizing it's more generalizable, but that's why I wanted to get discussion on it) the risk of getting selectivity wrong is quite high. And it seems to me that a more specific index (though larger and costlier to descend) is generally lower risk, albeit higher cost in best cases for the less specific index. In this particular example it wouldn't even matter which permutation of column ordering you have in the more specific index -- it's always guaranteed to return the first row that's found by descending the tree (excluding vacuumable rows). James Coleman
On Thu, 20 May 2021 at 02:54, James Coleman <jtc331@gmail.com> wrote: > What I'm interested in here is that, at least in this case (and I'm > theorizing it's more generalizable, but that's why I wanted to get > discussion on it) the risk of getting selectivity wrong is quite high. > And it seems to me that a more specific index (though larger and > costlier to descend) is generally lower risk, albeit higher cost in > best cases for the less specific index. In this particular example it > wouldn't even matter which permutation of column ordering you have in > the more specific index -- it's always guaranteed to return the first > row that's found by descending the tree (excluding vacuumable rows). Unfortunately we don't currently include risk in our cost model. Today's planner is fairly happy to dance blindfolded at the top of tall cliffs. Unfortunately we fall off them a little too often, such as in cases like you describe. It would be good to one day give the planner some proper lessons in safety. I have considered that we maybe should consider adding some sort of risk factor in our cost model. I mentioned something in [1] about it, but I don't really have a great idea yet as to how the "risk" cost value would be calculated. Calculating that could be hard. It might make sense to bump up the risk factor for some paths, but as we go through and add a risk level to more paths, then what we set the risk value to is suddenly much more important. It might be easy to say that the risk value should be set to what the total_cost would have to be multiplied by to get the cost for the worst possible case. However, if the worst possible case is exceedingly unlikely, then maybe that's not a great choice. The planner does often take plenty of other risks. The situation you describe here is just one of many. Another example is that if we do a join on say 3 or more conditions that those conditions could be correlated. We currently always assume they're independent and just multiply individual selectivities. We could end up thinking that the join will produce far fewer rows than it actually will. Problems can arise in subsequent joins. We may end up doing a subsequent join using a nested loop thinking that only 1 row will be produced from the first join. That is likely to turn bad if the innermost join instead produces millions of rows. Unfortunately extended statistics don't yet help us with join selectivities. We also take risks with LIMIT clauses as we assume the non-filtered rows will be completely evenly distributed through the result. That ends badly when the rows we want are right at the end. David [1] https://www.postgresql.org/message-id/CAApHDvqBoYU8aES4a0t-J15wk1wPMFJDHcyafyfHj7JqJ+u9wg@mail.gmail.com
David Rowley <dgrowleyml@gmail.com> wrote:
> The planner does often take plenty of other risks. The situation you
> describe here is just one of many. Another example is that if we do a
> join on say 3 or more conditions that those conditions could be
> correlated. We currently always assume they're independent and just
> multiply individual selectivities. We could end up thinking that the
> join will produce far fewer rows than it actually will. Problems can
> arise in subsequent joins. We may end up doing a subsequent join using
> a nested loop thinking that only 1 row will be produced from the first
> join. That is likely to turn bad if the innermost join instead
> produces millions of rows.
The n_distinct case is another especially hard one. I go around and set those values by hand all the time. Or how about the gazillion of fallback cases where the planner gives up estimating any selectivity and just pulls up another constant?
We are still living in a world where even the simplest function could cause one of these "I don't know the selectivity...let's say X0%"-cases.
And I am still surprised how well these broad estimates tend to work.
While we progressed a lot on that front, and I am grateful for all these folks who worked on the planner, the general problem is way to hard.
Getting to a point where I as a human cannot easily find a case where I can say "Ugh, that cliff is dangerous, don't go ther..." seems impossible to me.
The underlying problem here is more severe. We make systematic errors and the more complex the query gets the more likely it is that we will find a cliff to jump.
It's not just that we stumble upon one, no we actively search for it. Thinking about predicate pushing: That stuff can alter the estimated rows of a subquery.
From an optimizing standpoint that's a very scary thought. It tells me how fragile my models are.
To be honest with you here, the simple one relation index case doesn't interest me.
I have confirmed again and again, that the planner is better than me picking the correct index than I am. It wins in more than 80 % of the cases.
We always notice this one case where it spectacularly doesn't.
But if benchmarking has taught me anything, it's how much better on average the optimizer is at choosing a good index than I am.
And if get more indexes, the speed disparity tends to get bigger. It's not actively searching for a cliff. It might be, that I am just terrible at choosing from 50+ indexes. But I have learned, that I not that great at estimating whether using the more specific index is indeed the less risky thing to do. Spoiler alert: Reading the more specific and rarely used index from disc is an actual risk for some workloads.
That's different from join orders, where it's trying hard to shot itself into the foot. Or these limit cases where it lacks better statistics/knowledge of the problem.
I am at a loss what to do about this. I suspect there isn’t one solution.
I dislike the idea of a risk score though. In part because of the sheer insane amount of complexity that would be adding to the beast the planner already is.
But I also suspect, there is no way to reach sensible consensus on what is indeed risky, because it depends on the kind of querys and workloads you have.
Peter Geoghegan <pg@bowt.ie> wrote:
> I wonder if there is a more general design that incorporates changes
> over time. That is, a design that has ANALYZE store old statistics for
> a table in order to give the system (and the DBA) a sense of how
> things change over time. This could enable autoanalyze behavior that
> more or less settles on an optimal frequency between ANALYZE
> operations -- frequent enough to get stable statistics with some
> margin of error, but not too frequent.
I feel like the current design mainly views the analyze as a little addition to the autovacuum. And fwiw syncing them up, seems worthwhile to me in most cases.
I don't want to think to much about combining different parts into an incremental statistic. Even for MVC that seems impossible to do.
I find that idea interesting though. In particular because it shows potential to get rid of a lot of analyze scripts I have running at timed intervals, some of which definitely often run more frequently than necessary.
Do you have a particular idea in mind to predict future changes? Forecasting problems are impossible to solve reliably.
If it breaks in some weird high velocity cases, we could always allow for a configuration to switch back to the old behavior.
Every change could ruin somebody's day. But to beat the simple heuristic of y% of rows in most cases, doesn't sound like rocket science. We can always drop in more complex ai models, once we have figured out what breaks with some test databases.
What are your thoughts on this?
Regards
Arne