Обсуждение: Thinking About Correlated Columns (again)
Hi! This has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up. A really basic example of how bad the planner is here: CREATE TABLE foo AS SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b FROM generate_series(1, 1000000) a(id); CREATE INDEX idx_foo_ab ON foo (col_a, col_b); ANALYZE foo; EXPLAIN ANALYZE SELECT * FROM foo WHERE col_a = 50 AND col_b = 50; Index Scan using idx_foo_ab on foo (cost=0.00..6.35 rows=1 width=12) (actual time=0.030..3.643 rows=1000 loops=1) Index Cond: ((col_a = 50) AND (col_b = 50)) Hey, look! The row estimate is off by a factor of 1000. This particular case doesn't suffer terribly from the mis-estimation, but others do. Boy, do they ever. Since there really is no fix for this aside from completely rewriting the query or horribly misusing CTEs (to force sequence scans instead of bloated nested loops, for example), I recently tried to use some existing ALTER TABLE syntax: ALTER TABLE foo ALTER col_a SET (n_distinct = 1); ALTER TABLE foo ALTER col_b SET (n_distinct = 1); The new explain output: Index Scan using idx_foo_ab on foo (cost=0.00..9.37 rows=2 width=12) (actual time=0.013..0.871 rows=1000 loops=1) Index Cond: ((col_a = 50) AND (col_b = 50)) Well... 2 is closer to 1000 than 1 was, but it's pretty clear the histogram is still the basis for the cost estimates. I'm curious what benefit overriding the n_distinct pg_stats field actually gives us in practice. I'm worried that without an easy answer for cases like this, more DBAs will abuse optimization fences to get what they want and we'll end up in a scenario that's actually worse than query hints. Theoretically, query hints can be deprecated or have GUCs to remove their influence, but CTEs are forever, or until the next code refactor. I've seen conversations on this since at least 2005. There were even proposed patches every once in a while, but never any consensus. Anyone care to comment? -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 15.05.2013 18:31, Shaun Thomas wrote: > I've seen conversations on this since at least 2005. There were even > proposed patches every once in a while, but never any consensus. Anyone > care to comment? Well, as you said, there has never been any consensus. There are basically two pieces to the puzzle: 1. What metric do you use to represent correlation between columns? 2. How do use collect that statistic? Based on the prior discussions, collecting the stats seems to be tricky. It's not clear for which combinations of columns it should be collected (all possible combinations? That explodes quickly...), or how it can be collected without scanning the whole table. I think it would be pretty straightforward to use such a statistic, once we have it. So perhaps we should get started by allowing the DBA to set a correlation metric manually, and use that in the planner. - Heikki
On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
It's a very hard problem. There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.
We have an external search engine (not part of Postgres) for a domain-specific problem that discovers correlations dynamically and does on-the-fly alterations to its search plan. Our queries are simple conjunctions (A & B & C ...), which makes "planning" easy: you can do indexes in any order you like. So, the system watches each index's performance:
1. An index that's rarely rejecting anything gets demoted and eventually is discarded.
2. An index that's rejecting lots of stuff gets promoted.
Rule 1: If two indexes are strongly correlated, which ever happens to be first in the plan will do all the work, so the second one will rarely reject anything. By demoting it to later in the plan, it allows more selective indexes to be examined first. If an index ends up rejecting almost nothing, it is discarded.
Rule 2: If an index rejects lots of stuff and you promote it to the front of the plan, then "anti-correlated" indexes (those that reject different things) tend to move to the front of the plan, resulting in very efficient indexing.
Typically, our query plans start with roughly 20-100 indexes. The nature of our domain-specific problem is that the indexes are highly correlated (but it's impossible to predict in advance; each query causes unique correlations.) Within a few thousand rows, it usually has dropped most of them, and finishes the query with 5-10 indexes that are about 95% as selective, but much faster, than the original plan.
But there are two caveats. First, our particular query is a simple conjunction (A & B & C ...). It's easy to shuffle the index orders around.
Second, databases tend to be non-homogeneous. There are clusters of similar rows. If you blindly optimize by going through a table in storage order, you may optimize strongly for a highly similar group of rows, then discover once you get out of that section of the database that your "optimization" threw out a bunch of indexes that would be selective in other sections of the database. To solve this, you have to add a mechanism that examines the database rows in random order. This tends to minimize the problem (although in theory it could still happen).
A more typical query (i.e. anything past a simple conjunction) would be much more difficult for a dynamic optimizer.
[Inefficient plans for correlated columns] has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.
...
Since there really is no fix for this aside from completely rewriting the query or horribly misusing CTEs (to force sequence scans instead of bloated nested loops, for example)...
I'm worried that without an easy answer for cases like this, more DBAs will abuse optimization fences to get what they want and we'll end up in a scenario that's actually worse than query hints. Theoretically, query hints can be deprecated or have GUCs to remove their influence, but CTEs are forever, or until the next code refactor.
I've seen conversations on this since at least 2005. There were even proposed patches every once in a while, but never any consensus. Anyone care to comment?
It's a very hard problem. There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.
We have an external search engine (not part of Postgres) for a domain-specific problem that discovers correlations dynamically and does on-the-fly alterations to its search plan. Our queries are simple conjunctions (A & B & C ...), which makes "planning" easy: you can do indexes in any order you like. So, the system watches each index's performance:
1. An index that's rarely rejecting anything gets demoted and eventually is discarded.
2. An index that's rejecting lots of stuff gets promoted.
Rule 1: If two indexes are strongly correlated, which ever happens to be first in the plan will do all the work, so the second one will rarely reject anything. By demoting it to later in the plan, it allows more selective indexes to be examined first. If an index ends up rejecting almost nothing, it is discarded.
Rule 2: If an index rejects lots of stuff and you promote it to the front of the plan, then "anti-correlated" indexes (those that reject different things) tend to move to the front of the plan, resulting in very efficient indexing.
Typically, our query plans start with roughly 20-100 indexes. The nature of our domain-specific problem is that the indexes are highly correlated (but it's impossible to predict in advance; each query causes unique correlations.) Within a few thousand rows, it usually has dropped most of them, and finishes the query with 5-10 indexes that are about 95% as selective, but much faster, than the original plan.
But there are two caveats. First, our particular query is a simple conjunction (A & B & C ...). It's easy to shuffle the index orders around.
Second, databases tend to be non-homogeneous. There are clusters of similar rows. If you blindly optimize by going through a table in storage order, you may optimize strongly for a highly similar group of rows, then discover once you get out of that section of the database that your "optimization" threw out a bunch of indexes that would be selective in other sections of the database. To solve this, you have to add a mechanism that examines the database rows in random order. This tends to minimize the problem (although in theory it could still happen).
A more typical query (i.e. anything past a simple conjunction) would be much more difficult for a dynamic optimizer.
And I can't see how you can expect a static planner (one that doesn't do on-the-fly modifications to the plan) to find correlations.
Craig
On 05/15/2013 10:52 AM, Heikki Linnakangas wrote: > I think it would be pretty straightforward to use such a statistic, > once we have it. So perhaps we should get started by allowing the DBA > to set a correlation metric manually, and use that in the planner. The planner already kinda does this with functional indexes. It would definitely be nice to override the stats with known correlations when possible. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 05/15/2013 12:23 PM, Craig James wrote: > On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas > <sthomas@optionshouse.com <mailto:sthomas@optionshouse.com>> wrote: > > [Inefficient plans for correlated columns] has been a pain point > for quite a while. While we've had several discussions in the > area, it always seems to just kinda trail off and eventually > vanish every time it comes up. > > ... > Since there really is no fix for this aside from completely > rewriting the query or horribly misusing CTEs (to force sequence > scans instead of bloated nested loops, for example)... > I'm worried that without an easy answer for cases like this, more > DBAs will abuse optimization fences to get what they want and > we'll end up in a scenario that's actually worse than query hints. > Theoretically, query hints can be deprecated or have GUCs to > remove their influence, but CTEs are forever, or until the next > code refactor. > > I've seen conversations on this since at least 2005. There were > even proposed patches every once in a while, but never any > consensus. Anyone care to comment? > > > It's a very hard problem. There's no way you can keep statistics > about all possible correlations since the number of possibilities is > O(N^2) with the number of columns. > I don't see why we couldn't allow the DBA to specify some small subset of the combinations of columns for which correlation stats would be needed. I suspect in most cases no more than a handful for any given table would be required. cheers andrew
On Wed, May 15, 2013 at 11:52 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
The option that always made the most sense to me was having folks ask postgres to collect the statistic by running some command that marks two columns as correlated. This could at least be a starting point.
On 15.05.2013 18:31, Shaun Thomas wrote:Well, as you said, there has never been any consensus.I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?
There are basically two pieces to the puzzle:
1. What metric do you use to represent correlation between columns?
2. How do use collect that statistic?
Nik
On Wed, May 15, 2013 at 01:30:57PM -0400, Nikolas Everett wrote: > The option that always made the most sense to me was having folks ask > postgres to collect the statistic by running some command that marks two > columns as correlated. This could at least be a starting point. One suggestion made in the past was to calculate these stats (provided someone comes up with something worthwhile to calculate, that is) for all columns involved in a multicolumn index. That probably doesn't cover all the places we'd like the planner to know stuff like this, but it's a reasonable start. -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
Вложения
On 16/05/13 04:23, Craig James wrote:
[...]On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:[Inefficient plans for correlated columns] has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.
Actually far worse: N!/(N - K)!K! summed over K=1...N, assuming the order of columns in the correlation is unimportant (otherwise it is N factorial) - based on my hazy recollection of the relevant maths...
It's a very hard problem. There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.
[...]
Cheers,
Gavin
On 16/05/13 03:52, Heikki Linnakangas wrote:
On 15.05.2013 18:31, Shaun Thomas wrote:How about pg comparing actual numbers of rows delivered with the predicted number - and if a specified threshold is reached, then maintaining statistics? There is obviously more to it, such as: is this a relevant query to consider & the size of the tables (no point in attempting to optimise tables with only 10 rows for example).I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?
Well, as you said, there has never been any consensus.
There are basically two pieces to the puzzle:
1. What metric do you use to represent correlation between columns?
2. How do use collect that statistic?
Based on the prior discussions, collecting the stats seems to be tricky. It's not clear for which combinations of columns it should be collected (all possible combinations? That explodes quickly...), or how it can be collected without scanning the whole table.
I think it would be pretty straightforward to use such a statistic, once we have it. So perhaps we should get started by allowing the DBA to set a correlation metric manually, and use that in the planner.
- Heikki
Cheers,
Gavin
Shaun Thomas wrote on 15.05.2013 17:31: > Hi! > > This has been a pain point for quite a while. While we've had several > discussions in the area, it always seems to just kinda trail off and > eventually vanish every time it comes up. > > A really basic example of how bad the planner is here: > > CREATE TABLE foo AS > SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b > FROM generate_series(1, 1000000) a(id); > > CREATE INDEX idx_foo_ab ON foo (col_a, col_b); > > Index Scan using idx_foo_ab on foo (cost=0.00..6.35 rows=1 width=12) > (actual time=0.030..3.643 rows=1000 loops=1) > Index Cond: ((col_a = 50) AND (col_b = 50)) > > Hey, look! The row estimate is off by a factor of 1000. This > particular case doesn't suffer terribly from the mis-estimation, but > others do. Boy, do they ever. What happens if you create one index for each column? (instead of one combined index) For your example it does not seem to improve the situation, but maybe things get better with the "bad" queries?
On Wed, May 15, 2013 at 1:15 PM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
Right ... I was only thinking of combinations for two columns.
Craig
On 16/05/13 04:23, Craig James wrote:[...]On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:[Inefficient plans for correlated columns] has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.Actually far worse: N!/(N - K)!K! summed over K=1...N, assuming the order of columns in the correlation is unimportant (otherwise it is N factorial) - based on my hazy recollection of the relevant maths...
It's a very hard problem. There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.
Right ... I was only thinking of combinations for two columns.
Craig
[...]
Cheers,
Gavin
On 05/15/2013 03:31 PM, Thomas Kellerer wrote: > What happens if you create one index for each column? (instead of one > combined index) I just created the combined index to simplify the output. With two indexes, it does the usual bitmap index scan. Even then the row estimate is off by a factor of 1000, which is the actual problem. The other reason I tried the combined index was that there was some noise a while back about having the planner check for possible column correlations on composite indexes. Clearly that ended up not being the case, but it was worth examination. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email