Обсуждение: Thinking About Correlated Columns (again)

От:
Shaun Thomas
Дата:

Hi!

This has been a pain point for quite a while. While we've had several
discussions in the area, it always seems to just kinda trail off and
eventually vanish every time it comes up.

A really basic example of how bad the planner is here:

CREATE TABLE foo AS
SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b
   FROM generate_series(1, 1000000) a(id);

CREATE INDEX idx_foo_ab ON foo (col_a, col_b);

ANALYZE foo;

EXPLAIN ANALYZE
SELECT *
   FROM foo
  WHERE col_a = 50
    AND col_b = 50;

Index Scan using idx_foo_ab on foo  (cost=0.00..6.35 rows=1 width=12)
            (actual time=0.030..3.643 rows=1000 loops=1)
   Index Cond: ((col_a = 50) AND (col_b = 50))

Hey, look! The row estimate is off by a factor of 1000. This particular
case doesn't suffer terribly from the mis-estimation, but others do.
Boy, do they ever.

Since there really is no fix for this aside from completely rewriting
the query or horribly misusing CTEs (to force sequence scans instead of
bloated nested loops, for example), I recently tried to use some
existing ALTER TABLE syntax:

ALTER TABLE foo ALTER col_a SET (n_distinct = 1);
ALTER TABLE foo ALTER col_b SET (n_distinct = 1);

The new explain output:

Index Scan using idx_foo_ab on foo  (cost=0.00..9.37 rows=2 width=12)
            (actual time=0.013..0.871 rows=1000 loops=1)
   Index Cond: ((col_a = 50) AND (col_b = 50))

Well... 2 is closer to 1000 than 1 was, but it's pretty clear the
histogram is still the basis for the cost estimates. I'm curious what
benefit overriding the n_distinct pg_stats field actually gives us in
practice.

I'm worried that without an easy answer for cases like this, more DBAs
will abuse optimization fences to get what they want and we'll end up in
a scenario that's actually worse than query hints. Theoretically, query
hints can be deprecated or have GUCs to remove their influence, but CTEs
are forever, or until the next code refactor.

I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870


______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


От:
Heikki Linnakangas
Дата:

On 15.05.2013 18:31, Shaun Thomas wrote:
> I've seen conversations on this since at least 2005. There were even
> proposed patches every once in a while, but never any consensus. Anyone
> care to comment?

Well, as you said, there has never been any consensus.

There are basically two pieces to the puzzle:

1. What metric do you use to represent correlation between columns?

2. How do use collect that statistic?

Based on the prior discussions, collecting the stats seems to be tricky.
It's not clear for which combinations of columns it should be collected
(all possible combinations? That explodes quickly...), or how it can be
collected without scanning the whole table.

I think it would be pretty straightforward to use such a statistic, once
we have it. So perhaps we should get started by allowing the DBA to set
a correlation metric manually, and use that in the planner.

- Heikki


От:
Craig James
Дата:

On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <> wrote:
[Inefficient plans for correlated columns] has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.

...
Since there really is no fix for this aside from completely rewriting the query or horribly misusing CTEs (to force sequence scans instead of bloated nested loops, for example)...
I'm worried that without an easy answer for cases like this, more DBAs will abuse optimization fences to get what they want and we'll end up in a scenario that's actually worse than query hints. Theoretically, query hints can be deprecated or have GUCs to remove their influence, but CTEs are forever, or until the next code refactor.

I've seen conversations on this since at least 2005. There were even proposed patches every once in a while, but never any consensus. Anyone care to comment?

It's a very hard problem.  There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.

We have an external search engine (not part of Postgres) for a domain-specific problem that discovers correlations dynamically and does on-the-fly alterations to its search plan.  Our queries are simple conjunctions (A & B & C ...), which makes "planning" easy: you can do indexes in any order you like.  So, the system watches each index's performance:

  1.  An index that's rarely rejecting anything gets demoted and eventually is discarded.
  2.  An index that's rejecting lots of stuff gets promoted.

Rule 1: If two indexes are strongly correlated, which ever happens to be first in the plan will do all the work, so the second one will rarely reject anything.  By demoting it to later in the plan, it allows more selective indexes to be examined first.  If  an index ends up rejecting almost nothing, it is discarded.

Rule 2: If an index rejects lots of stuff and you promote it to the front of the plan, then "anti-correlated" indexes (those that reject different things) tend to move to the front of the plan, resulting in very efficient indexing.

Typically, our query plans start with roughly 20-100 indexes.  The nature of our domain-specific problem is that the indexes are highly correlated (but it's impossible to predict in advance; each query causes unique correlations.)  Within a few thousand rows, it usually has dropped most of them, and finishes the query with 5-10 indexes that are about 95% as selective, but much faster, than the original plan.

But there are two caveats.  First, our particular query is a simple conjunction (A & B & C ...).  It's easy to shuffle the index orders around.

Second, databases tend to be non-homogeneous.  There are clusters of similar rows.  If you blindly optimize by going through a table in storage order, you may optimize strongly for a highly similar group of rows, then discover once you get out of that section of the database that your "optimization" threw out a bunch of indexes that would be selective in other sections of the database.  To solve this, you have to add a mechanism that examines the database rows in random order.  This tends to minimize the problem (although in theory it could still happen).

A more typical query (i.e. anything past a simple conjunction) would be much more difficult for a dynamic optimizer.

And I can't see how you can expect a static planner (one that doesn't do on-the-fly modifications to the plan) to find correlations.

Craig
От:
Shaun Thomas
Дата:

On 05/15/2013 10:52 AM, Heikki Linnakangas wrote:

> I think it would be pretty straightforward to use such a statistic,
> once we have it. So perhaps we should get started by allowing the DBA
> to set a correlation metric manually, and use that in the planner.

The planner already kinda does this with functional indexes. It would
definitely be nice to override the stats with known correlations when
possible.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870


______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


От:
Andrew Dunstan
Дата:

On 05/15/2013 12:23 PM, Craig James wrote:
> On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas
> < <mailto:>> wrote:
>
>     [Inefficient plans for correlated columns] has been a pain point
>     for quite a while. While we've had several discussions in the
>     area, it always seems to just kinda trail off and eventually
>     vanish every time it comes up.
>
>     ...
>     Since there really is no fix for this aside from completely
>     rewriting the query or horribly misusing CTEs (to force sequence
>     scans instead of bloated nested loops, for example)...
>     I'm worried that without an easy answer for cases like this, more
>     DBAs will abuse optimization fences to get what they want and
>     we'll end up in a scenario that's actually worse than query hints.
>     Theoretically, query hints can be deprecated or have GUCs to
>     remove their influence, but CTEs are forever, or until the next
>     code refactor.
>
>     I've seen conversations on this since at least 2005. There were
>     even proposed patches every once in a while, but never any
>     consensus. Anyone care to comment?
>
>
> It's a very hard problem.  There's no way you can keep statistics
> about all possible correlations since the number of possibilities is
> O(N^2) with the number of columns.
>

I don't see why we couldn't allow the DBA to specify some small subset
of the combinations of columns for which correlation stats would be
needed. I suspect in most cases no more than a handful for any given
table would be required.

cheers

andrew


От:
Nikolas Everett
Дата:

On Wed, May 15, 2013 at 11:52 AM, Heikki Linnakangas <> wrote:
On 15.05.2013 18:31, Shaun Thomas wrote:
I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?

Well, as you said, there has never been any consensus.

There are basically two pieces to the puzzle:

1. What metric do you use to represent correlation between columns?

2. How do use collect that statistic?
 
The option that always made the most sense to me was having folks ask postgres to collect the statistic by running some command that marks two columns as correlated.  This could at least be a starting point.

Nik 
От:
eggyknap
Дата:

On Wed, May 15, 2013 at 01:30:57PM -0400, Nikolas Everett wrote:
> The option that always made the most sense to me was having folks ask
> postgres to collect the statistic by running some command that marks two
> columns as correlated.  This could at least be a starting point.

One suggestion made in the past was to calculate these stats (provided someone
comes up with something worthwhile to calculate, that is) for all columns
involved in a multicolumn index. That probably doesn't cover all the places
we'd like the planner to know stuff like this, but it's a reasonable start.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

От:
Gavin Flower
Дата:

On 16/05/13 04:23, Craig James wrote:
On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <> wrote:
[Inefficient plans for correlated columns] has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.

[...]

It's a very hard problem.  There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.
Actually far worse: N!/(N - K)!K! summed over K=1...N, assuming the order of columns in the correlation is unimportant (otherwise it is N factorial) - based on my hazy recollection of the relevant maths...

[...]

Cheers,
Gavin

От:
Gavin Flower
Дата:

On 16/05/13 03:52, Heikki Linnakangas wrote:
On 15.05.2013 18:31, Shaun Thomas wrote:
I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?

Well, as you said, there has never been any consensus.

There are basically two pieces to the puzzle:

1. What metric do you use to represent correlation between columns?

2. How do use collect that statistic?

Based on the prior discussions, collecting the stats seems to be tricky. It's not clear for which combinations of columns it should be collected (all possible combinations? That explodes quickly...), or how it can be collected without scanning the whole table.

I think it would be pretty straightforward to use such a statistic, once we have it. So perhaps we should get started by allowing the DBA to set a correlation metric manually, and use that in the planner.

- Heikki


How about pg comparing actual numbers of rows delivered with the predicted number - and if a specified threshold is reached, then maintaining statistics? There is obviously more to it, such as: is this a relevant query to consider & the size of the tables (no point in attempting to optimise tables with only 10 rows for example).


Cheers,
Gavin
От:
Thomas Kellerer
Дата:

Shaun Thomas wrote on 15.05.2013 17:31:
> Hi!
>
> This has been a pain point for quite a while. While we've had several
> discussions in the area, it always seems to just kinda trail off and
> eventually vanish every time it comes up.
>
> A really basic example of how bad the planner is here:
>
> CREATE TABLE foo AS
> SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b
>    FROM generate_series(1, 1000000) a(id);
>
> CREATE INDEX idx_foo_ab ON foo (col_a, col_b);
>
> Index Scan using idx_foo_ab on foo  (cost=0.00..6.35 rows=1 width=12)
>             (actual time=0.030..3.643 rows=1000 loops=1)
>    Index Cond: ((col_a = 50) AND (col_b = 50))
>
> Hey, look! The row estimate is off by a factor of 1000. This
> particular case doesn't suffer terribly from the mis-estimation, but
> others do. Boy, do they ever.

What happens if you create one index for each column? (instead of one combined index)

For your example it does not seem to improve the situation, but maybe things get better with the "bad" queries?




От:
Craig James
Дата:



On Wed, May 15, 2013 at 1:15 PM, Gavin Flower <> wrote:
On 16/05/13 04:23, Craig James wrote:
On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <> wrote:
[Inefficient plans for correlated columns] has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.

[...]

It's a very hard problem.  There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.
Actually far worse: N!/(N - K)!K! summed over K=1...N, assuming the order of columns in the correlation is unimportant (otherwise it is N factorial) - based on my hazy recollection of the relevant maths...

Right ... I was only thinking of combinations for two columns.

Craig
 

[...]

Cheers,
Gavin


От:
Shaun Thomas
Дата:

On 05/15/2013 03:31 PM, Thomas Kellerer wrote:

> What happens if you create one index for each column? (instead of one
>  combined index)

I just created the combined index to simplify the output. With two
indexes, it does the usual bitmap index scan. Even then the row estimate
is off by a factor of 1000, which is the actual problem.

The other reason I tried the combined index was that there was some
noise a while back about having the planner check for possible column
correlations on composite indexes. Clearly that ended up not being the
case, but it was worth examination.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870


______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email