Обсуждение: Yet another abort-early plan disaster on 9.3

Поиск

Список

Период

Сортировка

Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

18 сентября 2014 г., 03:11:45

Folks,

Just encountered another case of critical fail for abort-early query
plans.  In this case, it will completely prevent a user from upgrading
to 9.3; this is their most common query, and on 9.3 it takes 1000X longer.

Maybe we should think about removing abort-early plans from 9.5?
Clearly we don't understand them well enough for them to work for users.

Query:

SELECT "categories".* FROM "categories" WHERE "categories"."user_id" IN
( SELECT to_user_id FROM "tags" WHERE "tags"."from_user_id" = 53529975 )
ORDER BY recorded_on DESC LIMIT 20;

Here's the plan from 9.1:

 Limit  (cost=1613.10..1613.15 rows=20 width=194) (actual
time=0.503..0.509 rows=20 loops=1)
   ->  Sort  (cost=1613.10..1736.14 rows=49215 width=194) (actual
time=0.502..0.505 rows=20 loops=1)
         Sort Key: categories.recorded_on
         Sort Method: top-N heapsort  Memory: 30kB
         ->  Nested Loop  (cost=248.80..303.51 rows=49215 width=194)
(actual time=0.069..0.347 rows=81 loops=1)
               ->  HashAggregate  (cost=248.80..248.81 rows=1 width=4)
(actual time=0.050..0.054 rows=8 loops=1)
                     ->  Index Scan using unique_index_tags on tags
(cost=0.00..248.54 rows=103 width=4) (actual time=0.020..0.033 rows=8
loops=1)
                           Index Cond: (from_user_id = 53529975)
               ->  Index Scan using index_categories_on_user_id on
categories  (cost=0.00..54.34 rows=29 width=194) (actual
time=0.010..0.028 rows=10 loops=8)
                     Index Cond: (user_id = tags.to_user_id)
 Total runtime: 0.641 ms

And from 9.3:

 Limit  (cost=1.00..2641.10 rows=20 width=202) (actual
time=9.933..711.372 rows=20 loops=1)
   ->  Nested Loop Semi Join  (cost=1.00..9641758.39 rows=73041
width=202) (actual time=9.931..711.361 rows=20 loops=1)
         ->  Index Scan Backward using index_categories_on_recorded_on
on categories  (cost=0.43..406943.98 rows=4199200 width=202) (actual
time=0.018..275.020 rows=170995 loops=1)
         ->  Index Scan using unique_index_tags on tags
(cost=0.57..2.20 rows=1 width=4) (actual time=0.002..0.002 rows=0
loops=170995)
               Index Cond: ((from_user_id = 53529975) AND (to_user_id =
categories.user_id))
 Total runtime: 711.457 ms

So, here's what's happening here:

As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
being 1:105 (as opposed to 1:6, which is about the real ratio).  This
means that PostgreSQL thinks it can find the 20 rows within the first 2%
of the index ... whereas it actually needs to scan 50% of the index to
find them.

Removing LIMIT causes 9.3 to revert to the "good" plan, as expected.

This is the core issue with abort-early plans; they depend on our
statistics being extremely accurate, which we know they are not. And if
they're wrong, the execution time climbs by 1000X or more.  Abort-early
plans are inherently riskier than other types of query plans.

What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
this change.  The stats are no more than 10% different across the
version change.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Merlin Moncure

Дата:

19 сентября 2014 г., 20:15:41

On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Folks,
>
> Just encountered another case of critical fail for abort-early query
> plans.  In this case, it will completely prevent a user from upgrading
> to 9.3; this is their most common query, and on 9.3 it takes 1000X longer.
>
> Maybe we should think about removing abort-early plans from 9.5?
> Clearly we don't understand them well enough for them to work for users.
>
> Query:
>
> SELECT "categories".* FROM "categories" WHERE "categories"."user_id" IN
> ( SELECT to_user_id FROM "tags" WHERE "tags"."from_user_id" = 53529975 )
> ORDER BY recorded_on DESC LIMIT 20;
>
> Here's the plan from 9.1:
>
>  Limit  (cost=1613.10..1613.15 rows=20 width=194) (actual
> time=0.503..0.509 rows=20 loops=1)
>    ->  Sort  (cost=1613.10..1736.14 rows=49215 width=194) (actual
> time=0.502..0.505 rows=20 loops=1)
>          Sort Key: categories.recorded_on
>          Sort Method: top-N heapsort  Memory: 30kB
>          ->  Nested Loop  (cost=248.80..303.51 rows=49215 width=194)
> (actual time=0.069..0.347 rows=81 loops=1)
>                ->  HashAggregate  (cost=248.80..248.81 rows=1 width=4)
> (actual time=0.050..0.054 rows=8 loops=1)
>                      ->  Index Scan using unique_index_tags on tags
> (cost=0.00..248.54 rows=103 width=4) (actual time=0.020..0.033 rows=8
> loops=1)
>                            Index Cond: (from_user_id = 53529975)
>                ->  Index Scan using index_categories_on_user_id on
> categories  (cost=0.00..54.34 rows=29 width=194) (actual
> time=0.010..0.028 rows=10 loops=8)
>                      Index Cond: (user_id = tags.to_user_id)
>  Total runtime: 0.641 ms
>
> And from 9.3:
>
>  Limit  (cost=1.00..2641.10 rows=20 width=202) (actual
> time=9.933..711.372 rows=20 loops=1)
>    ->  Nested Loop Semi Join  (cost=1.00..9641758.39 rows=73041
> width=202) (actual time=9.931..711.361 rows=20 loops=1)
>          ->  Index Scan Backward using index_categories_on_recorded_on
> on categories  (cost=0.43..406943.98 rows=4199200 width=202) (actual
> time=0.018..275.020 rows=170995 loops=1)
>          ->  Index Scan using unique_index_tags on tags
> (cost=0.57..2.20 rows=1 width=4) (actual time=0.002..0.002 rows=0
> loops=170995)
>                Index Cond: ((from_user_id = 53529975) AND (to_user_id =
> categories.user_id))
>  Total runtime: 711.457 ms
>
> So, here's what's happening here:
>
> As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
> chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
> being 1:105 (as opposed to 1:6, which is about the real ratio).  This
> means that PostgreSQL thinks it can find the 20 rows within the first 2%
> of the index ... whereas it actually needs to scan 50% of the index to
> find them.
>
> Removing LIMIT causes 9.3 to revert to the "good" plan, as expected.
>
> This is the core issue with abort-early plans; they depend on our
> statistics being extremely accurate, which we know they are not. And if
> they're wrong, the execution time climbs by 1000X or more.  Abort-early
> plans are inherently riskier than other types of query plans.
>
> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
> this change.  The stats are no more than 10% different across the
> version change.

Amusingly on-topic rant I happened to read immediately after this by chance:

http://wp.sigmod.org/?p=1075

Is there a canonical case of where 'abort early' plans help? (I'm new
to that term -- is it a recent planner innovation...got any handy
links?)

merlin

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

19 сентября 2014 г., 21:40:26

On 09/19/2014 10:15 AM, Merlin Moncure wrote:
> On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> This is the core issue with abort-early plans; they depend on our
>> statistics being extremely accurate, which we know they are not. And if
>> they're wrong, the execution time climbs by 1000X or more.  Abort-early
>> plans are inherently riskier than other types of query plans.
>>
>> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
>> this change.  The stats are no more than 10% different across the
>> version change.
>
> Amusingly on-topic rant I happened to read immediately after this by chance:
>
> http://wp.sigmod.org/?p=1075
>
> Is there a canonical case of where 'abort early' plans help? (I'm new
> to that term -- is it a recent planner innovation...got any handy
> links?)

Yeah, here's an example of the canonical case:

Table t1 ( a, b, c )

- "b" is low-cardinality
- "c" is high-cardinality
- There are separate indexes on both b and c.

SELECT a, b, c FROM t1
WHERE b = 2
ORDER BY c LIMIT 1;

In this case, the fastest plan is usually to use the index on C and
return the first row where the filter condition matches the filter on b.
 This can be an order of magnitude faster than using the index on b and
then resorting by c and taking the first row, if (b=2) happens to match
20% of the table.

This is called an "abort early" plan because we expect to never finish
the scan on the index on c.  We expect to scan the index on c, find the
first row that matches b=2 and exit.

The problem with such plans is that they are "risky".  As in, if we are
wrong about our (b=2) stats, then we've just adopted a query plan which
will be 10X to 1000X slower than the more conventional plan.

We can see this in the bad plan I posted:

 Limit  (cost=1.00..2641.10 rows=20 width=202) (actual
time=9.933..711.372 rows=20 loops=1)
   ->  Nested Loop Semi Join  (cost=1.00..9641758.39 rows=73041
width=202) (actual time=9.931..711.361 rows=20 loops=1)
         ->  Index Scan Backward using index_categories_on_recorded_on
on categories  (cost=0.43..406943.98 rows=4199200 width=202) (actual
time=0.018..275.020 rows=170995 loops=1)

Notice how the total cost of the plan is a fraction of the cost of the
two steps which preceeded it?  This is an indication that the planner
expects to be able to abort the index scan and nestloop join before it's
more than 2% through it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Greg Stark

Дата:

20 сентября 2014 г., 09:38:15

On 19 Sep 2014 19:40, "Josh Berkus" <josh@agliodbs.com> wrote:
>
> On 09/19/2014 10:15 AM, Merlin Moncure wrote:
> > On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com> wrote:
> >> This is the core issue with abort-early plans; they depend on our
> >> statistics being extremely accurate, which we know they are not. And if
> >> they're wrong, the execution time climbs by 1000X or more. Abort-early
> >> plans are inherently riskier than other types of query plans.

All plans are risky if the stats are wrong. It's one of the perennial digressions that many postgres newcomers make to track worst case costs and provide a knob for planner aggressiveness but it always breaks down when you try to quantify the level of risk because you discover that even such simple things as indeed scans versus sequential scans can be equally risky either way.

> >> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
> >> this change. The stats are no more than 10% different across the
> >> version change.

There's no difference. Postgres has been estimating LIMIT costs this way since before I came to postgres in 7.3.

> > Is there a canonical case of where 'abort early' plans help? (I'm new
> > to that term -- is it a recent planner innovation...got any handy
> > links?)
>
> Yeah, here's an example of the canonical case:
>
> Table t1 ( a, b, c )
>
> - "b" is low-cardinality
> - "c" is high-cardinality
> - There are separate indexes on both b and c.
>
> SELECT a, b, c FROM t1
> WHERE b = 2
> ORDER BY c LIMIT 1;

You badly want a partial index on c WHERE b=2 for each value of 2 which appears in your queries.

It would be neat to have an opclass which worked like that. Which would amount to having prefix compression perhaps.

What plan does 9.1 come up with?

Re: Yet another abort-early plan disaster on 9.3

От

Claudio Freire

Дата:

20 сентября 2014 г., 14:51:56

On Sat, Sep 20, 2014 at 3:38 AM, Greg Stark <stark@mit.edu> wrote:
>> > Is there a canonical case of where 'abort early' plans help? (I'm new
>> > to that term -- is it a recent planner innovation...got any handy
>> > links?)
>>
>> Yeah, here's an example of the canonical case:
>>
>> Table t1 ( a, b, c )
>>
>> - "b" is low-cardinality
>> - "c" is high-cardinality
>> - There are separate indexes on both b and c.
>>
>> SELECT a, b, c FROM t1
>> WHERE b = 2
>> ORDER BY c LIMIT 1;
>
> You badly want a partial index on c WHERE b=2 for each value of 2 which
> appears in your queries.
>
> It would be neat to have an opclass which worked like that. Which would
> amount to having prefix compression perhaps.

I've been looking at that exactly.

One complexity of it, is that splitting becomes much harder. As in, recursive.

Re: Yet another abort-early plan disaster on 9.3

От

Tom Lane

Дата:

20 сентября 2014 г., 18:01:12

Greg Stark <stark@mit.edu> writes:
> On 19 Sep 2014 19:40, "Josh Berkus" <josh@agliodbs.com> wrote:
>> Yeah, here's an example of the canonical case:
>>
>> Table t1 ( a, b, c )
>>
>> - "b" is low-cardinality
>> - "c" is high-cardinality
>> - There are separate indexes on both b and c.
>>
>> SELECT a, b, c FROM t1
>> WHERE b = 2
>> ORDER BY c LIMIT 1;

> You badly want a partial index on c WHERE b=2 for each value of 2 which
> appears in your queries.

Well, if it's *only* b = 2 that you ever search for, then maybe a partial
index would be a good answer.  Personally I'd use a plain btree index on
(b, c).  The planner's been able to match this type of query to
multicolumn indexes for a long time.

            regards, tom lane

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

20 сентября 2014 г., 21:33:43

On 09/19/2014 11:38 PM, Greg Stark wrote:
>
> On 19 Sep 2014 19:40, "Josh Berkus" <josh@agliodbs.com
> <mailto:josh@agliodbs.com>> wrote:
>>
>> On 09/19/2014 10:15 AM, Merlin Moncure wrote:
>> > On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com
> <mailto:josh@agliodbs.com>> wrote:
>> >> This is the core issue with abort-early plans; they depend on our
>> >> statistics being extremely accurate, which we know they are not. And if
>> >> they're wrong, the execution time climbs by 1000X or more.  Abort-early
>> >> plans are inherently riskier than other types of query plans.
>
> All plans are risky if the stats are wrong. It's one of the perennial
> digressions that many postgres newcomers make to track worst case costs
> and provide a knob for planner aggressiveness but it always breaks down
> when you try to quantify the level of risk because you discover that
> even such simple things as indeed scans versus sequential scans can be
> equally risky either way.

I've had a *wee* bit more experience with query plans than most Postgres
newcomers, Greg.

While all query plan changes can result in regressions if they're bad,
there are certain kinds of plans which depend more on accurate stats
than others.  Abort-early plans are the most extreme of these.  Most
likely we should adjust the cost model for abort-early plans to take in
our level of uncertainty, especially since we *know* that our n-distinct
estimation is crap.  For example, we could increase the estimated cost
for an abort-early index scan by 10X, to reflect our weak confidence in
its correctness.

We could also probably do the same for plans which depend on column
correlation estimates being accurate.

>
>> >> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring
> about
>> >> this change.  The stats are no more than 10% different across the
>> >> version change.
>
> There's no difference. Postgres has been estimating LIMIT costs this way
> since before I came to postgres in 7.3.

Then why is the plan different in 9.1 and 9.3 with identical stats (I
tested)?

>
> It would be neat to have an opclass which worked like that. Which would
> amount to having prefix compression perhaps.
>
> What plan does 9.1 come up with?

That was the "good" plan from my original post.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Merlin Moncure

Дата:

22 сентября 2014 г., 16:56:05

On Sat, Sep 20, 2014 at 1:33 PM, Josh Berkus <josh@agliodbs.com> wrote:
> For example, we could increase the estimated cost
> for an abort-early index scan by 10X, to reflect our weak confidence in
> its correctness.

Has any progress been made on the performance farm?  The problem with
suggestions like this (which seem pretty reasonable to me) is that
we've got no way of quantifying the downside.   I think this is one
example of a class of plans that are high risk.  Another one off the
top of my head is nestloop joins based on assumed selectivity of
multiple stacked quals.  About 90% of the time, my reflective
workaround to these types of problems is to 'disable_nestloop' which
works around 90% of the time and the result are solved with monkeying
around with 'OFFSET 0' etc.   In the past, a GUC controlling planner
risk has been much discussed -- maybe it's still worth considering?

merlin

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

23 сентября 2014 г., 02:56:26

On 09/22/2014 06:55 AM, Merlin Moncure wrote:
> Has any progress been made on the performance farm?  The problem with
> suggestions like this (which seem pretty reasonable to me) is that
> we've got no way of quantifying the downside.

Yeah, that's certainly an issue. The problem is that we'd need a
benchmark which actually created complex query plans.  I believe that
Mark Wong is working on TPCH-based benchmarks, so maybe we'll get that.

>  I think this is one
> example of a class of plans that are high risk.  Another one off the
> top of my head is nestloop joins based on assumed selectivity of
> multiple stacked quals.

Yeah, that's another good example.

> About 90% of the time, my reflective
> workaround to these types of problems is to 'disable_nestloop' which
> works around 90% of the time and the result are solved with monkeying
> around with 'OFFSET 0' etc.   In the past, a GUC controlling planner
> risk has been much discussed -- maybe it's still worth considering?

We've hashed that out a bit, but frankly I think it's much more
profitable to pursue fixing the actual problem than providing a
workaround like "risk", such as:

a) fixing n_distinct estimation
b) estimating stacked quals using better math (i.e. not assuming total
randomness)
c) developing some kind of correlation stats

Otherwise we would be just providing users with another knob there's no
rational way to set.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

26 сентября 2014 г., 11:06:31

On 23 September 2014 00:56, Josh Berkus <josh@agliodbs.com> wrote:

> We've hashed that out a bit, but frankly I think it's much more
> profitable to pursue fixing the actual problem than providing a
> workaround like "risk", such as:
>
> a) fixing n_distinct estimation
> b) estimating stacked quals using better math (i.e. not assuming total
> randomness)
> c) developing some kind of correlation stats
>
> Otherwise we would be just providing users with another knob there's no
> rational way to set.

I believe this is a serious issue for PostgreSQL users and one that
needs to be addressed.

n_distinct can be fixed manually, so that is less of an issue.

The problem, as I see it, is different. We assume that if there are
100 distinct values and you use LIMIT 1 that you would only need to
scan 1% of rows. We assume that the data is arranged in the table in a
very homogenous layout. When data is not, and it seldom is, we get
problems.

Simply put, assuming that LIMIT will reduce the size of all scans is
just way wrong. I've seen many plans where increasing the LIMIT
dramatically improves the plan.

If we can at least agree it is a problem, we can try to move forwards.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

Merlin Moncure

Дата:

29 сентября 2014 г., 18:00:34

On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The problem, as I see it, is different. We assume that if there are
> 100 distinct values and you use LIMIT 1 that you would only need to
> scan 1% of rows. We assume that the data is arranged in the table in a
> very homogenous layout. When data is not, and it seldom is, we get
> problems.

Hm, good point -- 'data proximity'.  At least in theory, can't this be
measured and quantified?  For example, given a number of distinct
values, you could estimate the % of pages read (or maybe non
sequential seeks relative to the number of pages) you'd need to read
all instances of a particular value in the average (or perhaps the
worst) case.   One way of trying to calculate that would be to look at
proximity of values in sampled pages (and maybe a penalty assigned for
high update activity relative to table size).  Data proximity would
then become a cost coefficient to the benefits of LIMIT.

merlin

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

29 сентября 2014 г., 23:53:20

On 29 September 2014 16:00, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The problem, as I see it, is different. We assume that if there are
>> 100 distinct values and you use LIMIT 1 that you would only need to
>> scan 1% of rows. We assume that the data is arranged in the table in a
>> very homogenous layout. When data is not, and it seldom is, we get
>> problems.
>
> Hm, good point -- 'data proximity'.  At least in theory, can't this be
> measured and quantified?  For example, given a number of distinct
> values, you could estimate the % of pages read (or maybe non
> sequential seeks relative to the number of pages) you'd need to read
> all instances of a particular value in the average (or perhaps the
> worst) case.   One way of trying to calculate that would be to look at
> proximity of values in sampled pages (and maybe a penalty assigned for
> high update activity relative to table size).  Data proximity would
> then become a cost coefficient to the benefits of LIMIT.

The necessary first step to this is to realise that we can't simply
apply the LIMIT as a reduction in query cost, in all cases.

The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.

So plans like this could be wrong, by assuming the scan will end
earlier because of the LIMIT than it actually will.

Limit
  IndexScan (no index cond)

Limit
  NestJoin
    IndexScan (no index cond)
    SomeScan

Limit
  NestJoin
    NestJoin
      IndexScan (no index cond)
      SomeScan
   SomeScan

and deeper...

I'm looking for a way to identify and exclude such plans, assuming
that this captures at least some of the problem plans.

Comments? Test Cases?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

30 сентября 2014 г., 00:54:41

On 09/26/2014 01:06 AM, Simon Riggs wrote:
> On 23 September 2014 00:56, Josh Berkus <josh@agliodbs.com> wrote:
>
>> We've hashed that out a bit, but frankly I think it's much more
>> profitable to pursue fixing the actual problem than providing a
>> workaround like "risk", such as:
>>
>> a) fixing n_distinct estimation
>> b) estimating stacked quals using better math (i.e. not assuming total
>> randomness)
>> c) developing some kind of correlation stats
>>
>> Otherwise we would be just providing users with another knob there's no
>> rational way to set.
>
> I believe this is a serious issue for PostgreSQL users and one that
> needs to be addressed.
>
> n_distinct can be fixed manually, so that is less of an issue.

It's an issue for the 99.8% of our users who don't know what n_distinct
is, let alone how to calculate it.  Also, changing it requires an
exclusive lock on the table. Of course, you and I have been over this
issue before.

One thing I'm wondering is why our estimator is creates n_distinct as a
% so seldom.  Really, any time n_distinct is over 10K we should be
estimating a % instead.  Now, estimating that % has its own issues, but
it does seem like a peculiar quirk of our stats model.

Anyway, in the particular case I posted fixing n_distinct to realistic
numbers (%) fixed the query plan.

>
> The problem, as I see it, is different. We assume that if there are
> 100 distinct values and you use LIMIT 1 that you would only need to
> scan 1% of rows. We assume that the data is arranged in the table in a
> very homogenous layout. When data is not, and it seldom is, we get
> problems.
>
> Simply put, assuming that LIMIT will reduce the size of all scans is
> just way wrong. I've seen many plans where increasing the LIMIT
> dramatically improves the plan.
>
> If we can at least agree it is a problem, we can try to move forwards.

That is certainly another problem.  Does correlation stat figure in the
LIMIT calculation at all, currently?  That's what correlation stat is
for, no?

Also, to be fair, physical correlation of rows can also lead to
abort-early plans being extra fast, if everything we want is towards the
beginning of the index.  Which means we'd need working multi-column
correlation, which is a known hard problem.

For example, consider the query:

SELECT id, updated_on FROM audit_log
WHERE updated_on < '2010-01-01'
ORDER BY id LIMIT 10;

In an append-only table, that query is liable to be very fast with an
abort-early plan scanning on an ID index (AEP from now on), since the
oldest rows are likely to correspond with the smallest IDs.  But then
the user does this:

SELECT id, updated_on FROM audit_log
WHERE updated_on < '2010-01-01'
ORDER BY id DESC LIMIT 10;

... and a completely different plan is called for, because using an AEP
will result in reverse scanning most of the index.  However, I'm not
sure the planner knows the difference, since it's only comparing the
estimated selectivity of (updated_on < '2010-01-01') and seeing that
it's 20% of rows.  I bet you'd get an AEP in the 2nd case too.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Tom Lane

Дата:

30 сентября 2014 г., 02:00:49

Simon Riggs <simon@2ndquadrant.com> writes:
> The way I'm seeing it, you can't assume the LIMIT will apply to any
> IndexScan that doesn't have an index condition. If it has just a
> filter, or nothing at all, just an ordering then it could easily scan
> the whole index if the stats are wrong.

That statement applies with equal force to *any* plan with a LIMIT;
it's not just index scans.

The real question is to what extent are the tuples satisfying the extra
filter condition randomly distributed with respect to the index order
(or physical order, if it's a seqscan).  The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

If we could settle on some other model for the probable distribution
of the matching tuples, we could adjust the cost estimates for LIMIT
accordingly.  I have not enough statistics background to know what a
realistic alternative would be.

Another possibility is to still assume a uniform distribution but estimate
for, say, a 90% probability instead of 50% probability that we'll find
enough tuples after scanning X amount of the table.  Again, I'm not too
sure what that translates to in terms of the actual math, but it sounds
like something a statistics person could do in their sleep.

I do not think we should estimate for the worst case though.  If we do,
we'll hear cries of anguish from a lot of people, including many of the
same ones complaining now, because the planner stopped picking fast-start
plans even for cases where they are orders of magnitude faster than the
alternatives.

            regards, tom lane

Re: Yet another abort-early plan disaster on 9.3

От

Gavin Flower

Дата:

30 сентября 2014 г., 05:12:59

On 30/09/14 12:00, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.

That statement applies with equal force to *any* plan with a LIMIT;
it's not just index scans.

The real question is to what extent are the tuples satisfying the extra
filter condition randomly distributed with respect to the index order
(or physical order, if it's a seqscan).  The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

If we could settle on some other model for the probable distribution
of the matching tuples, we could adjust the cost estimates for LIMIT
accordingly.  I have not enough statistics background to know what a
realistic alternative would be.

Another possibility is to still assume a uniform distribution but estimate
for, say, a 90% probability instead of 50% probability that we'll find
enough tuples after scanning X amount of the table.  Again, I'm not too
sure what that translates to in terms of the actual math, but it sounds
like something a statistics person could do in their sleep.

I do not think we should estimate for the worst case though.  If we do,
we'll hear cries of anguish from a lot of people, including many of the
same ones complaining now, because the planner stopped picking fast-start
plans even for cases where they are orders of magnitude faster than the
alternatives.
		regards, tom lane

If you analyzed the tables in most production databases, you would find that they are almost invariably not uniformly distributed.

Most likely they will be clumped, and if you plotted the frequency of values of a given column in a given range against the number of blocks, you are likely to see one or more distinct peaks.  If the table has been CLUSTERed on that column, then they will very likely to be in one clump spanning contiguous blocks.

I suspect that there are two distinct populations: one relating to values present before the last VACUUM, and ones added since.

There are so many factors to consider, pattern of CRUD operations, range of values in query, ... that probably prevent using very sophisticated approaches, but I would be happy to be proved wrong!

Though I am fairly confident that if the distribution is not known in advance for a given table, then the percentage required to process to satisfy the limit is likely to be a lot larger than a uniform distribution would suggest.

Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it?  Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics.  But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.

I have a nasty feeling that assuming a uniform distribution, may still end up being the best we can do - but I maybe being unduly pessimistic!.

Cheers,
Gavin

Re: Yet another abort-early plan disaster on 9.3

От

Greg Stark

Дата:

30 сентября 2014 г., 06:12:45

On Fri, Sep 26, 2014 at 9:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If we can at least agree it is a problem, we can try to move forwards.

Well that's a good question. I don't think we do and I think the
reason why is because we haven't actually pinned down exactly what is
the problem.

The real problem here is that the ideal index for the query isn't there
and Postgres is trying to choose between two inappropriatedoes not
exist indexes where that decision is very very hard for it to make. If
it guesses
wrong in *either* direction it'll go very poorly. We can try to
improve the frequency of getting the right decision but it'll never be
100% and even if it was it'll still not perform as well as the right
index would have.

I have seen plenty of applications where the slowdown was in the
reverse direction --
where a query like "find the last login for the current user" was
planned just as Josh is asking for by retrieving all the records for
the user and sorting by login time and it caused large problems in
production when some users had a disproportionately large number of
records.

The real solution for users is to create the compound index on both columns (or
partial index in some cases). Trying to make do with an ordered scan
or a index scan and sort are both going to cause problems in real
world usage.

In fact I think the real story here is that Postgres is doing a
surprisingly good job at making do without the right index and that's
causing users to get surprisingly far before they run into problems.
That may not be the best thing for users in the long run but that's a
problem that should be solved by better development tools to help
users identify scalability problems early.



--
greg

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

30 сентября 2014 г., 10:09:20

On 29 September 2014 22:54, Josh Berkus <josh@agliodbs.com> wrote:
> On 09/26/2014 01:06 AM, Simon Riggs wrote:
>> On 23 September 2014 00:56, Josh Berkus <josh@agliodbs.com> wrote:
>>
>>> We've hashed that out a bit, but frankly I think it's much more
>>> profitable to pursue fixing the actual problem than providing a
>>> workaround like "risk", such as:
>>>
>>> a) fixing n_distinct estimation
>>> b) estimating stacked quals using better math (i.e. not assuming total
>>> randomness)
>>> c) developing some kind of correlation stats
>>>
>>> Otherwise we would be just providing users with another knob there's no
>>> rational way to set.
>>
>> I believe this is a serious issue for PostgreSQL users and one that
>> needs to be addressed.
>>
>> n_distinct can be fixed manually, so that is less of an issue.
>
> It's an issue for the 99.8% of our users who don't know what n_distinct
> is, let alone how to calculate it.  Also, changing it requires an
> exclusive lock on the table. Of course, you and I have been over this
> issue before.

In 9.4 you'll be able to set n_distinct using only a Share Update
Exclusive lock.

So that's no longer a problem.

The quality of the n_distinct itself is an issue, but with no current
solution, but then that is why you can set it manually,

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

30 сентября 2014 г., 12:25:32

On 30 September 2014 00:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> The way I'm seeing it, you can't assume the LIMIT will apply to any
>> IndexScan that doesn't have an index condition. If it has just a
>> filter, or nothing at all, just an ordering then it could easily scan
>> the whole index if the stats are wrong.
>
> That statement applies with equal force to *any* plan with a LIMIT;
> it's not just index scans.

Agreed

> The real question is to what extent are the tuples satisfying the extra
> filter condition randomly distributed with respect to the index order
> (or physical order, if it's a seqscan).

Agreed

> The existing cost estimation
> code effectively assumes that they're perfectly uniformly distributed;
> which is a good average-case assumption but can be horribly wrong in
> the worst case.

Agreed. This is the main observation from which we can work.

> If we could settle on some other model for the probable distribution
> of the matching tuples, we could adjust the cost estimates for LIMIT
> accordingly.  I have not enough statistics background to know what a
> realistic alternative would be.

I'm not sure that the correlation alone is sufficient to be able to do
that. We'd need to estimate where the values looked for are likely to
be wrt other values, then increase estimate accordingly. That sounds
like a lot of pushups grovelling through quals and comparing against
stats. So my thinking is actually to rule that out, unless you've some
ideas for how to do that?

> Another possibility is to still assume a uniform distribution but estimate
> for, say, a 90% probability instead of 50% probability that we'll find
> enough tuples after scanning X amount of the table.  Again, I'm not too
> sure what that translates to in terms of the actual math, but it sounds
> like something a statistics person could do in their sleep.
>
> I do not think we should estimate for the worst case though.  If we do,
> we'll hear cries of anguish from a lot of people, including many of the
> same ones complaining now, because the planner stopped picking fast-start
> plans even for cases where they are orders of magnitude faster than the
> alternatives.

Fast start plans still make sense when performing an IndexScan with no
filter conditions. Those types of plan should not be changed from
current costing - they are accurate, good and very important because
of their frequency in real workloads.

What I think we are seeing is Ordered plans being selected too often
in preference to Sorted plans when we make selectivity or stats
errors. As well as data distributions that aren't correctly described
by the statistics causing much longer execution times.

Here are some plan selection strategies

* Cost based - attempt to exactly calculate the cost based upon
existing stats - increase the complexity of cost calc to cover other
aspects. Even if we do that, these may not be that helpful in covering
the cases where the stats turn out to be wrong.

* Risk based - A risk adjusted viewpoint would be that we should treat
the cost as mid-way between the best and the worst. The worst is
clearly scanning (100% - N) of the tuples, the best is just N tuples.
So we should be costing scans with excess filter conditions as a (100%
Scan)/2, no matter the conditions, based purely upon risk.

* Simplified heuristic - deselect ordered plans when they are driven
from scans without quals or indexscans with filters, since the risk
adjusted cost is likely to be higher than the sorted cost. Inspecting
the plan tree for this could be quite costly, so would only be done
when the total cost is $high, prior to it being adjusted by LIMIT.

In terms of practical steps... I suggest the following:

* Implement enable_orderedscan = on (default) | off. A switch to allow
plans to de-select ordered plans, so we can more easily see the
effects of such plans in the wild.

* Code heuristic approach - I can see where to add my heuristic in the
grouping planner. So we just need to do a left? deep search of the
plan tree looking for scans of the appropriate type and bail out if we
find one.

Thoughts?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

"Graeme B. Bell"

Дата:

30 сентября 2014 г., 18:39:45

>> The existing cost estimation
>> code effectively assumes that they're perfectly uniformly distributed;
>> which is a good average-case assumption but can be horribly wrong in
>> the worst case.


Sorry, just an outsider jumping in with a quick comment.

Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases
wherestrategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat
approachto scheduling. What problems would it raise? 

Graeme.

Re: Yet another abort-early plan disaster on 9.3

От

Claudio Freire

Дата:

30 сентября 2014 г., 18:59:41

On Tue, Sep 30, 2014 at 8:34 AM, Graeme B. Bell <grb@skogoglandskap.no> wrote:
>
>>> The existing cost estimation
>>> code effectively assumes that they're perfectly uniformly distributed;
>>> which is a good average-case assumption but can be horribly wrong in
>>> the worst case.
>
>
> Sorry, just an outsider jumping in with a quick comment.
>
> Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases
wherestrategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat
approachto scheduling. 

> What problems would it raise?

Interleaved I/O, that would kill performance for both plans if it
happens on rotating media.

Re: Yet another abort-early plan disaster on 9.3

От

Tom Lane

Дата:

30 сентября 2014 г., 19:32:54

"Graeme B. Bell" <grb@skogoglandskap.no> writes:
> Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases
wherestrategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schr�dinger's Cat
approachto scheduling. What problems would it raise? 

You can't run two plans and have them both returning rows to the client,
or performing inserts/updates/deletes as the case may be.

            regards, tom lane

Re: Yet another abort-early plan disaster on 9.3

От

Jeff Janes

Дата:

30 сентября 2014 г., 19:54:50

On Mon, Sep 29, 2014 at 7:12 PM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:

Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it?  Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics.  But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.

I have a nasty feeling that assuming a uniform distribution, may still end up being the best we can do - but I maybe being unduly pessimistic!.

As a semi-competent statistician, my gut feeling is that our best bet would be not to rely on the competence of statisticians for too much, and instead try to give the executor the ability to abandon a fruitless path and pick a different plan instead. Of course this option is foreclosed once a tuple is returned to the client (unless the ctid is also cached, so we can make sure not to send it again on the new plan).

I think that the exponential explosion of possibilities is going to be too great to analyze in any rigorous way.

Cheers,

Jeff

Re: Yet another abort-early plan disaster on 9.3

От

Jeff Janes

Дата:

30 сентября 2014 г., 20:28:13

On Mon, Sep 29, 2014 at 2:54 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 09/26/2014 01:06 AM, Simon Riggs wrote:
> On 23 September 2014 00:56, Josh Berkus <josh@agliodbs.com> wrote:
>
>> We've hashed that out a bit, but frankly I think it's much more
>> profitable to pursue fixing the actual problem than providing a
>> workaround like "risk", such as:
>>
>> a) fixing n_distinct estimation
>> b) estimating stacked quals using better math (i.e. not assuming total
>> randomness)
>> c) developing some kind of correlation stats
>>
>> Otherwise we would be just providing users with another knob there's no
>> rational way to set.
>
> I believe this is a serious issue for PostgreSQL users and one that
> needs to be addressed.
>
> n_distinct can be fixed manually, so that is less of an issue.

It's an issue for the 99.8% of our users who don't know what n_distinct
is, let alone how to calculate it. Also, changing it requires an
exclusive lock on the table. Of course, you and I have been over this
issue before.

If 99.6% of our users don't have a problem with n_distinct in their system, that would mean that only 50% of the people with the problem don't know how to solve it. And those people can usually get excellent free help on the internet.

But if the problem not with n_distinct, but rather with most_common_freqs (which I encounter more often than problems with n_distinct), all I can do is shrug and say "yeah I know about that problem. Either crank up statistics target as high as it will go, or it sucks to be you."

One thing I'm wondering is why our estimator is creates n_distinct as a
% so seldom. Really, any time n_distinct is over 10K we should be
estimating a % instead. Now, estimating that % has its own issues, but
it does seem like a peculiar quirk of our stats model.

Anyway, in the particular case I posted fixing n_distinct to realistic
numbers (%) fixed the query plan.

But wouldn't fixing the absolute number also have fixed the plan? If you are going to set a number manually and then nail it in place so that analyze stops changing it, then I can certainly see how the fractional method is desirable. But if the goal is not to do that but have the correct value estimated in the first place, I don't really see much benefit from converting the estimate into a fraction and then back again.

>
> The problem, as I see it, is different. We assume that if there are
> 100 distinct values and you use LIMIT 1 that you would only need to
> scan 1% of rows. We assume that the data is arranged in the table in a
> very homogenous layout. When data is not, and it seldom is, we get
> problems.
>
> Simply put, assuming that LIMIT will reduce the size of all scans is
> just way wrong. I've seen many plans where increasing the LIMIT
> dramatically improves the plan.
>
> If we can at least agree it is a problem, we can try to move forwards.

I don't think anyone doubts there is a problem (many more than one of them), there is just disagreement about the priority and what can be done about it.

That is certainly another problem. Does correlation stat figure in the
LIMIT calculation at all, currently? That's what correlation stat is
for, no?

I don't think correlation is up to the task as a complete solution, although it might help a little. There is no way a simple correlation can encode that John retired 15 years ago and hasn't logged on since, while Johannes was hired yesterday and never logged on before then.

Cheers,

Jeff

Re: Yet another abort-early plan disaster on 9.3

От

"Graeme B. Bell"

Дата:

30 сентября 2014 г., 20:51:34

Thanks for your replies everyone.

> You can't run two plans and have them both returning rows to the client,

That wasn't what I had in mind.

I can envisage cases where the worst case behaviour of one plan results in zero rows by the time the alternative plan
hasgenerated the complete result, never mind a single row (e.g. anything with LIMIT in it could fall into that
category).Maybe it's enough to alleviate the problems caused by planning heuristics known to have bad worst-case
performancethat is hard to avoid with a single-threaded approach? 

Providing we're not modifying data in the query, and providing we kill the 'loser' thread when either (the first result
/all results) come in, maybe there's value in letting them race and picking the best plan retrospectively. 

I guess it's going into another topic, but I wonder what % of DBs/queries look like this:

- little or no I/O thrash (e.g. tuples mostly in memory already or DB configured to have a relatively low
'random_page_cost')
- ordered results, or, the whole result set is being produced at once.
- SELECTs only

In my own work (national scale GIS) this is what most of our queries & query environments look like.

Graeme

On 30 Sep 2014, at 18:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Graeme B. Bell" <grb@skogoglandskap.no> writes:
>> Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases
wherestrategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat
approachto scheduling. What problems would it raise? 
>
> You can't run two plans and have them both returning rows to the client,
> or performing inserts/updates/deletes as the case may be.
>
>             regards, tom lane
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Yet another abort-early plan disaster on 9.3

От

Gavin Flower

Дата:

01 октября 2014 г., 00:12:05

On 01/10/14 05:54, Jeff Janes wrote:

On Mon, Sep 29, 2014 at 7:12 PM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it?  Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics.  But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.
I have a nasty feeling that assuming a uniform distribution, may still end up being the best we can do - but I maybe being unduly pessimistic!.
As a semi-competent statistician, my gut feeling is that our best bet would be not to rely on the competence of statisticians for too much, and instead try to give the executor the ability to abandon a fruitless path and pick a different plan instead. Of course this option is foreclosed once a tuple is returned to the client (unless the ctid is also cached, so we can make sure not to send it again on the new plan).

I think that the exponential explosion of possibilities is going to be too great to analyze in any rigorous way.

Cheers,

Jeff

Many moons ago, I passed several 300 level statistics papers.

I looked at this problem and found it was too hard to even properly characterise the problem (looks 'simple' - if you don't look too closely), and ended up feeling it was definitely 'way above my pay grade'! :-)

It might be possible to tackle it more pragmatically, instead of trying to be all analytic and rigorously list all the possible influences, have a look at queries of this nature that are taking far too long. Then get a feel for combinations of issues involved and how they contribute. If you have enough data, you might be able to use something like Principle Component Analysis (I was fortunate to meet a scientist who had got heavily into this area of statistics). Such an approach might yield valuable insights, even if the problem is not fully characterised, let alone 'solved'.

Cheers,
Gavin

Re: Yet another abort-early plan disaster on 9.3

От

Merlin Moncure

Дата:

01 октября 2014 г., 01:14:24

On Tue, Sep 30, 2014 at 11:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Sep 29, 2014 at 7:12 PM, Gavin Flower
> <GavinFlower@archidevsys.co.nz> wrote:
>>
>>
>> Would it be feasible to get a competent statistician to advise what data
>> to collect, and to analyze it?  Maybe it is possible to get a better
>> estimate on how much of a table needs to be scanned, based on some fairly
>> simple statistics.  But unless research is done, it is probably impossible
>> to determine what statistics might be useful, and how effective a better
>> estimate could be.
>>
>> I have a nasty feeling that assuming a uniform distribution, may still end
>> up being the best we can do - but I maybe being unduly pessimistic!.
>
> As a semi-competent statistician, my gut feeling is that our best bet would
> be not to rely on the competence of statisticians for too much, and instead
> try to give the executor the ability to abandon a fruitless path and pick a
> different plan instead. Of course this option is foreclosed once a tuple is
> returned to the client (unless the ctid is also cached, so we can make sure
> not to send it again on the new plan).
>
> I think that the exponential explosion of possibilities is going to be too
> great to analyze in any rigorous way.

Call it the 'Parking in Manhattan' strategy -- you know when it's time
to pull forward when you've smacked into the car behind you.

Kidding aside, this might be the path forward since it's A. more
general and can catch all kinds of problem cases that our statistics
system won't/can't catch and B. At least in my case it seems like more
complicated plans tend to not return much data until the inner most
risky parts have been involved.  Even if that wasn't the case,
withholding data to the client until a user configurable time
threshold had been passed (giving the planner time to back up if
necessary) would be a reasonable user facing tradeoff via GUC:
'max_planner_retry_time'.

merlin

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

01 октября 2014 г., 02:02:11

On 30 September 2014 18:28, Jeff Janes <jeff.janes@gmail.com> wrote:

>> Anyway, in the particular case I posted fixing n_distinct to realistic
>> numbers (%) fixed the query plan.
>
>
> But wouldn't fixing the absolute number also have fixed the plan?

There are two causes of this issue.

1. Poor estimates of n_distinct. Fixable by user.

2. Poor assumption of homogeneous distribution. No way for user to
fix. Insufficient stats detail to be able to solve in current planner.

I see (2) as the main source of issues, since as we observe, (1) is fixable.

An example is a social media application where the business query is
"Display the last 10 posts". If the user is a frequent, recent user
then the query could come back very quickly, so a reverse scan on
post_id would work great. If the user hasn't logged on for ages, then
that plan needs to scan lots and lots of data to get to find 10 posts.
That gives the problem that only certain users experience poor
performance - even the data isn't consistent in its distribution, so
stats wouldn't help much, even if we could capture the profile of the
"typical user".

>> > The problem, as I see it, is different. We assume that if there are
>> > 100 distinct values and you use LIMIT 1 that you would only need to
>> > scan 1% of rows. We assume that the data is arranged in the table in a
>> > very homogenous layout. When data is not, and it seldom is, we get
>> > problems.
>> >
>> > Simply put, assuming that LIMIT will reduce the size of all scans is
>> > just way wrong. I've seen many plans where increasing the LIMIT
>> > dramatically improves the plan.
>> >
>> > If we can at least agree it is a problem, we can try to move forwards.
>
>
> I don't think anyone doubts there is a problem (many more than one of them),
> there is just disagreement about the priority and what can be done about it.

>> That is certainly another problem.  Does correlation stat figure in the
>> LIMIT calculation at all, currently?  That's what correlation stat is
>> for, no?
>
>
> I don't think correlation is up to the task as a complete solution, although
> it might help a little.  There is no way a simple correlation can encode
> that John retired 15 years ago and hasn't logged on since, while Johannes
> was hired yesterday and never logged on before then.

Ah, OK, essentially the same example.

Which is why I ruled out correlation stats based approaches and
suggested a risk-weighted cost approach.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

01 октября 2014 г., 21:56:43

On 09/30/2014 04:01 PM, Simon Riggs wrote:
> On 30 September 2014 18:28, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>>> Anyway, in the particular case I posted fixing n_distinct to realistic
>>> numbers (%) fixed the query plan.
>>
>>
>> But wouldn't fixing the absolute number also have fixed the plan?
>
> There are two causes of this issue.
>
> 1. Poor estimates of n_distinct. Fixable by user.
>
> 2. Poor assumption of homogeneous distribution. No way for user to
> fix. Insufficient stats detail to be able to solve in current planner.
>
> I see (2) as the main source of issues, since as we observe, (1) is fixable.

I disagree that (1) is not worth fixing just because we've provided
users with an API to override the stats.  It would unquestionably be
better for us to have a better n_distinct estimate in the first place.
Further, this is an easier problem to solve, and fixing n_distinct
estimates would fix a large minority of currently pathological queries.
 It's like saying "hey, we don't need to fix the leak in your radiator,
we've given you a funnel in the dashboard you can pour water into."

I do agree that (2) is worth fixing *as well*.  In a first
approximation, one possibility (as Tom suggests) would be to come up
with a mathematical model for a selectivity estimate which was somewhere
*between* homogenous distribution and the worst case.  While that
wouldn't solve a lot of cases, it would be a start towards having a
better model.

>> I don't think correlation is up to the task as a complete solution, although
>> it might help a little.  There is no way a simple correlation can encode
>> that John retired 15 years ago and hasn't logged on since, while Johannes
>> was hired yesterday and never logged on before then.
>
> Ah, OK, essentially the same example.
>
> Which is why I ruled out correlation stats based approaches and
> suggested a risk-weighted cost approach.

By "risk-weighted" you mean just adjusting cost estimates based on what
the worst case cost looks like, correct?  That seemed to be your
proposal from an earlier post.  If so, we're in violent agreement here.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

02 октября 2014 г., 11:19:39

On 1 October 2014 19:56, Josh Berkus <josh@agliodbs.com> wrote:
> On 09/30/2014 04:01 PM, Simon Riggs wrote:
>> On 30 September 2014 18:28, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>>>> Anyway, in the particular case I posted fixing n_distinct to realistic
>>>> numbers (%) fixed the query plan.
>>>
>>>
>>> But wouldn't fixing the absolute number also have fixed the plan?
>>
>> There are two causes of this issue.
>>
>> 1. Poor estimates of n_distinct. Fixable by user.
>>
>> 2. Poor assumption of homogeneous distribution. No way for user to
>> fix. Insufficient stats detail to be able to solve in current planner.
>>
>> I see (2) as the main source of issues, since as we observe, (1) is fixable.
>
> I disagree that (1) is not worth fixing just because we've provided
> users with an API to override the stats.  It would unquestionably be
> better for us to have a better n_distinct estimate in the first place.
> Further, this is an easier problem to solve, and fixing n_distinct
> estimates would fix a large minority of currently pathological queries.
>  It's like saying "hey, we don't need to fix the leak in your radiator,
> we've given you a funnel in the dashboard you can pour water into."

Having read papers on it, I believe the problem is intractable. Coding
is not the issue. To anyone: please prove me wrong, in detail, with
references so it can be coded.

> I do agree that (2) is worth fixing *as well*.  In a first
> approximation, one possibility (as Tom suggests) would be to come up
> with a mathematical model for a selectivity estimate which was somewhere
> *between* homogenous distribution and the worst case.  While that
> wouldn't solve a lot of cases, it would be a start towards having a
> better model.

This may have a reasonable solution, but I don't know it. A more
accurate mathematical model will still avoid the main problem: it is a
guess, not certain knowledge and the risk will still remain.

>>> I don't think correlation is up to the task as a complete solution, although
>>> it might help a little.  There is no way a simple correlation can encode
>>> that John retired 15 years ago and hasn't logged on since, while Johannes
>>> was hired yesterday and never logged on before then.
>>
>> Ah, OK, essentially the same example.
>>
>> Which is why I ruled out correlation stats based approaches and
>> suggested a risk-weighted cost approach.
>
> By "risk-weighted" you mean just adjusting cost estimates based on what
> the worst case cost looks like, correct?  That seemed to be your
> proposal from an earlier post.  If so, we're in violent agreement here.

I proposed a clear path for this earlier in the thread and received no
comments as yet. Please look at that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

Peter Geoghegan

Дата:

02 октября 2014 г., 12:30:07

On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I disagree that (1) is not worth fixing just because we've provided
>> users with an API to override the stats.  It would unquestionably be
>> better for us to have a better n_distinct estimate in the first place.
>> Further, this is an easier problem to solve, and fixing n_distinct
>> estimates would fix a large minority of currently pathological queries.
>>  It's like saying "hey, we don't need to fix the leak in your radiator,
>> we've given you a funnel in the dashboard you can pour water into."
>
> Having read papers on it, I believe the problem is intractable. Coding
> is not the issue. To anyone: please prove me wrong, in detail, with
> references so it can be coded.

I think it might be close to intractable if you're determined to use a
sampling model. HyperLogLog looks very interesting for n_distinct
estimation, though. My abbreviated key patch estimates the cardinality
of abbreviated keys (and original strings that are to be sorted) with
high precision and fixed overhead.  Maybe we can figure out a way to
do opportunistic streaming of HLL. Believe it or not, the way I use
HLL for estimating cardinality is virtually free. Hashing is really
cheap when the CPU is bottlenecked on memory bandwidth.

If you're interested, download the patch, and enable the debug traces.
You'll see HyperLogLog accurately indicate the cardinality of text
datums as they're copied into local memory before sorting.

--
Regards,
Peter Geoghegan

Re: Yet another abort-early plan disaster on 9.3

От

Ryan Johnson

Дата:

02 октября 2014 г., 16:59:15

On 29/09/2014 9:00 AM, Merlin Moncure wrote:
> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The problem, as I see it, is different. We assume that if there are
>> 100 distinct values and you use LIMIT 1 that you would only need to
>> scan 1% of rows. We assume that the data is arranged in the table in a
>> very homogenous layout. When data is not, and it seldom is, we get
>> problems.
> Hm, good point -- 'data proximity'.  At least in theory, can't this be
> measured and quantified?  For example, given a number of distinct
> values, you could estimate the % of pages read (or maybe non
> sequential seeks relative to the number of pages) you'd need to read
> all instances of a particular value in the average (or perhaps the
> worst) case.   One way of trying to calculate that would be to look at
> proximity of values in sampled pages (and maybe a penalty assigned for
> high update activity relative to table size).  Data proximity would
> then become a cost coefficient to the benefits of LIMIT.
Latecomer to the conversation here, but it seems like this issue (unlike
some) is really easy to recognize at runtime. The optimizer assumed the
scan would access  O(1) pages; if the scan has not returned enough
results after k pages, that would be a really good indication that it's
time to rethink the plan, and probably before too much work has been
done higher in the plan (esp. if there's any kind of buffering between
operators, perhaps intentionally so in special cases like this)

Not sure pgsql has any dynamic reoptimization infrastructure in place,
tho. If not, these sorts of dangerous plans are best left alone IMO.

Ryan

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

02 октября 2014 г., 22:56:35

On 10/02/2014 02:30 AM, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Having read papers on it, I believe the problem is intractable. Coding
>> is not the issue. To anyone: please prove me wrong, in detail, with
>> references so it can be coded.
>
> I think it might be close to intractable if you're determined to use a
> sampling model. HyperLogLog looks very interesting for n_distinct
> estimation, though. My abbreviated key patch estimates the cardinality
> of abbreviated keys (and original strings that are to be sorted) with
> high precision and fixed overhead.  Maybe we can figure out a way to
> do opportunistic streaming of HLL. Believe it or not, the way I use
> HLL for estimating cardinality is virtually free. Hashing is really
> cheap when the CPU is bottlenecked on memory bandwidth.

Yes, it's only intractable if you're wedded to the idea of a tiny,
fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
can get a MUCH more accurate n_distinct estimate using multiple
algorithms, of which HLL is one.  While n_distinct will still have some
variance, it'll be over a much smaller range.

The n_distinct algo we use in Postgres is specifically designed (by its
author) to choose the smallest reasonable number of distinct values
capable of producing the observed distribution.  This made sense when we
added it because we didn't have query plans where underestimating
n_distinct produced a penalty.  Now we do, and we ought to change algos.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Peter Geoghegan

Дата:

03 октября 2014 г., 03:54:11

On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one.  While n_distinct will still have some
> variance, it'll be over a much smaller range.

I think that HyperLogLog, as a streaming algorithm, will always
require that the entire set be streamed. This doesn't need to be a big
deal - in the case of my abbreviated key patch, it appears to
basically be free because of the fact that we were streaming
everything anyway. It's a very cool algorithm, with fixed overhead and
constant memory usage. It makes very useful guarantees around
accuracy.

I have this intuition that even though I'm more or less not paying
anything for a great cardinality estimate, it's kind of a shame that I
still throw it away after the sort, each and every time. I have a hard
time actually putting my finger on how it could be put to further use,
though. And besides, this only helps if you happen to need to do a
sort (or something that requires a sequential scan, since the cost
certainly isn't anywhere near "free" when you didn't need to do that
anyway).

Our current lack of block-based sampling probably implies that we are
almost as badly off as if we *did* a sequential scan. Not that I'm
suggesting that we give up on the idea of sampling (which would be
crazy).

Streaming algorithms like HyperLogLog are very recent ideas, as these
things go. I wouldn't be all that discouraged by the fact that it
might not have been put to use in this way (for database statistics)
by somebody before now.
--
Regards,
Peter Geoghegan

Re: Yet another abort-early plan disaster on 9.3

От

Jeff Janes

Дата:

03 октября 2014 г., 22:58:53

On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 10/02/2014 02:30 AM, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Having read papers on it, I believe the problem is intractable. Coding
>> is not the issue. To anyone: please prove me wrong, in detail, with
>> references so it can be coded.
>
> I think it might be close to intractable if you're determined to use a
> sampling model. HyperLogLog looks very interesting for n_distinct
> estimation, though. My abbreviated key patch estimates the cardinality
> of abbreviated keys (and original strings that are to be sorted) with
> high precision and fixed overhead. Maybe we can figure out a way to
> do opportunistic streaming of HLL. Believe it or not, the way I use
> HLL for estimating cardinality is virtually free. Hashing is really
> cheap when the CPU is bottlenecked on memory bandwidth.

Yes, it's only intractable if you're wedded to the idea of a tiny,
fixed-size sample. If we're allowed to sample, say, 1% of the table, we
can get a MUCH more accurate n_distinct estimate using multiple
algorithms, of which HLL is one. While n_distinct will still have some
variance, it'll be over a much smaller range.

In my hands, the problems with poor n_distinct were not due to the insufficient size of the sample, but the insufficient randomness of it. Increasing default_statistics_target did help but not because it increases the number of rows sampled, but rather because it increases the number of blocks sampled. Once substantially all of the blocks are part of the block sampling, the bias is eliminated even though it was still sampling a small fraction of the rows (roughly one per block).

So one idea would be go get rid of the 2-stage sampling algorithm (sample blocks, sample rows from the chosen blocks) and just read the whole table and sample rows from it unbiased, at least under some conditions. Some low level benchmarking on my favorite server showed that reading 1% of a system's blocks (in block number order within each file) was no faster than reading all of them from an IO perspective. But that is a virtualized server that wasn't really speced out to be an IO intensive database server in the first place. It would be interesting to see what people get on real hardware that they actually designed for the task.

A problem right now is that we only have one knob. I want to compute more accurate n_distinct and most_common_freqs, but I don't want to store huge numbers entries for most_common_vals and histogram_bounds.

Cheers,

Jeff

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

04 октября 2014 г., 02:30:41

On 3.10.2014 21:58, Jeff Janes wrote:
> On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@agliodbs.com
> <mailto:josh@agliodbs.com>> wrote:
>
>     Yes, it's only intractable if you're wedded to the idea of a tiny,
>     fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>     can get a MUCH more accurate n_distinct estimate using multiple
>     algorithms, of which HLL is one.  While n_distinct will still have some
>     variance, it'll be over a much smaller range.
>
>
> In my hands, the problems with poor n_distinct were not due to the
> insufficient size of the sample, but the insufficient randomness of it.
> Increasing  default_statistics_target did help but not because it
> increases the number of rows sampled, but rather because it increases
> the number of blocks sampled.  Once substantially all of the blocks are
> part of the block sampling, the bias is eliminated even though it was
> still sampling a small fraction of the rows (roughly one per block).

I don't think that's entirely accurate. According to [1], there's a
lower boundary on ratio error, depending on the number of sampled rows.

Say there's a table with 10M rows, we sample 30k rows (which is the
default). Then with probability 5% we'll get ratio error over 20. That
is, we may either estimate <5% or >200% of the actual ndistinct value.
Combined with our arbitrary 10% limit that we use to decide whether
ndistinct scales with the number of rows, this sometimes explodes.

By increasing the statistics target, you get much larger sample and thus
lower probability of such error. But nevertheless, it breaks from time
to time, and the fact that statistics target is static (and not scaling
with the size of the table to get appropriate sample size) is not really
helping IMHO. Static sample size may work for histograms, for ndistinct
not so much.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

> So one idea would be go get rid of the 2-stage sampling algorithm
> (sample blocks, sample rows from the chosen blocks) and just read
> the whole table and sample rows from it unbiased, at least under
> some conditions. Some low level benchmarking on my favorite server
> showed that reading 1% of a system's blocks (in block number order
> within each file) was no faster than reading all of them from an IO
> perspective. But that is a virtualized server that wasn't really
> speced out to be an IO intensive database server in the first place.
> It would be interesting to see what people get on real hardware that
> they actually designed for the task.

I think there was a discussion about the sampling on pgsql-hackers a
while ago ... and yes, here it is [2]. However it seems there was no
clear conclusion on how to change it at that time ...

[2]
http://www.postgresql.org/message-id/CA+TgmoZaqyGSuaL2v+YFVsX06DQDQh-pEV0nobGPws-dNwAwBw@mail.gmail.com

regards
Tomas

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

04 октября 2014 г., 02:42:05

On 3.10.2014 02:54, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample. If we're allowed to sample, say, 1% of the
>> table, we can get a MUCH more accurate n_distinct estimate using
>> multiple algorithms, of which HLL is one. While n_distinct will
>> still have some variance, it'll be over a much smaller range.
>
> I think that HyperLogLog, as a streaming algorithm, will always
> require that the entire set be streamed. This doesn't need to be a
> big deal - in the case of my abbreviated key patch, it appears to
> basically be free because of the fact that we were streaming
> everything anyway. It's a very cool algorithm, with fixed overhead
> and constant memory usage. It makes very useful guarantees around
> accuracy.

I think you're mixing two things here - estimating the number of
distinct values in a sample (which can be done very efficiently using
HLL) and estimating the number of distinct values in the whole table.
For that HLL is not usable, unless you process all the data.

Sadly HLL is rather incompatible with the usual estimators, because the
ones I'm aware of need to know the number of occurences for the distinct
values etc.

But couldn't we just piggyback this on autovacuum? One of the nice HLL
features is that it's additive - you can build "partial counters" for
ranges of blocks (say, a few MBs per range), and then merge them when
needed. By keeping the parts it's possible to rebuild it separately.

But maybe this idea is way too crazy ...

regards
Tomas

Re: Yet another abort-early plan disaster on 9.3

От

Greg Stark

Дата:

10 октября 2014 г., 14:16:57

On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one.  While n_distinct will still have some
> variance, it'll be over a much smaller range.

I've gone looking for papers on this topic but from what I read this
isn't so. To get any noticeable improvement you need to read 10-50% of
the table and that's effectively the same as reading the entire table
-- and it still had pretty poor results. All the research I could find
went into how to analyze the whole table while using a reasonable
amount of scratch space and how to do it incrementally.

--
greg

Re: Yet another abort-early plan disaster on 9.3

От

"Tomas Vondra"

Дата:

10 октября 2014 г., 15:10:58

Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one.  While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

Would be helpful to get a collection of test cases that currently perform
poorly. I have collected a few from the archives, but if those who follow
this thread can provide additional test cases / point to a thread
describing related etc. that'd be great.

It certainly won't be perfect, but if it considerably improves the
estimates then I believe it's step forward. Ultimately, it's impossible to
improve the estimates without increasing the sample size.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

regards
Tomas

Re: Yet another abort-early plan disaster on 9.3

От

Craig James

Дата:

10 октября 2014 г., 17:21:15

On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

We've solved this problem using an external (non-Postgres) dynamically optimizing index. In addition to the "early abort," we also require an efficient "late start", the equivalent of "offset 100 limit 10". It's a common problem for web sites that let users page through data with just a tiny amount of state information (a cookie).

Our index is for chemical structures. Chemicals are indexed on chemical fragments. A search typically starts with 50-200 indexed "columns" (chemical fragments). The query is always flat, "A and B and ... and Z". The indexed fragments are both correlated (the existence of one strongly raises the chances of another) and anti-correlated (certain combinations are very rare).

The dynamic optimizer watches the performance of each index in real time. It promotes highly selective indexes and demotes or removes redundant indexes. In a typical query, the initial 50-200 indexes are reduced to 5-10 indexes within the first 100-200 rows examined. The remaining indexes have little correlation yet retain most of the selectivity. (One critical factor with a dynamic optimizer is that the data must be randomized before it's presented to the optimizer. Databases tend to have clusters of similar data. If the optimizer starts in such a cluster, it will optimize poorly.)

Our query is simple (a flat AND) compared to what Postgres has to handle. Even so, a dynamic optimizer is the only effective solution.

Static planners simply can't handle the "early abort" condition, even with good statistics. Many have pointed out that data are "lumpy" rather than well distributed. A more subtle problem is that you can have evenly distributed data, but badly distributed correlations. "Agnes" and "Bob" may be names that are distributed well in a real-estate database, but it might happen that all of the information about homes whose owners' names are "Agnes" and "Bob" occurs at the very end of all of your data because they just got married and bought a house.

The end result is that even with perfect statistics on each column, you're still screwed. The combinatorial explosion of possible correlations between indexes is intractable.

Craig

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

10 октября 2014 г., 19:54:31

On 10.10.2014 16:21, Craig James wrote:
> On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
>
>     > I've gone looking for papers on this topic but from what I read this
>     > isn't so. To get any noticeable improvement you need to read 10-50% of
>     > the table and that's effectively the same as reading the entire table
>     > -- and it still had pretty poor results. All the research I could find
>     > went into how to analyze the whole table while using a reasonable
>     > amount of scratch space and how to do it incrementally.
>
>     I think it's really difficult to discuss the estimation without some
>     basic
>     agreement on what are the goals. Naturally, we can't get a perfect
>     estimator with small samples (especially when the sample size is
>     fixed and
>     not scaling with the table). But maybe we can improve the estimates
>     without scanning most of the table?
>
>     FWIW I've been playing with the adaptive estimator described in [1] and
>     the results looks really interesting, IMHO. So far I was testing it on
>     synthetic datasets outside the database, but I plan to use it instead of
>     our estimator, and do some more tests.
>
>
> We've solved this problem using an external (non-Postgres) dynamically
> optimizing index. In addition to the "early abort," we also require an
> efficient "late start", the equivalent of "offset 100 limit 10". It's a
> common problem for web sites that let users page through data with just
> a tiny amount of state information (a cookie).

Yeah, paging is a known example, both for the inefficiency once you get
to pages far away, and because of the planning challenges. I think there
are known solutions to this problem
(http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way),
although those are not applicable to all cases.

But I'm not sure how that's related to the ndistinct estimation problem,
discussed in this thread (or rather in this subthread)?

> Our index is for chemical structures. Chemicals are indexed on
> chemical fragments
> <http://emolecules.com/info/molecular-informatics>. A search
> typically starts with 50-200 indexed "columns" (chemical fragments).
> The query is always flat, "A and B and ... and Z". The indexed
> fragments are both correlated (the existence of one strongly raises
> the chances of another) and anti-correlated (certain combinations are
> very rare).

Maybe I don't understand the problem well enough, but isn't this a
perfect match for GIN indexes? I mean, you essentially need to do
queries like "WHERE substance @@ ('A & B & !C')" etc. Which is exactly
what GIN does, because it keeps pointers to tuples for each fragment.

> Static planners simply can't handle the "early abort" condition,
> even with good statistics. Many have pointed out that data are
> "lumpy" rather than well distributed. A more subtle problem is that
> you can have evenly distributed data, but badly distributed
> correlations. "Agnes" and "Bob" may be names that are distributed
> well in a real-estate database, but it might happen that all of the
> information about homes whose owners' names are "Agnes" and "Bob"
> occurs at the very end of all of your data because they just got
> married and bought a house.
>
> The end result is that even with perfect statistics on each column,
> you're still screwed. The combinatorial explosion of possible
> correlations between indexes is intractable.

Static planners clearly have limitations, but we don't have dynamic
planning in PostgreSQL, so we have to live with them. And if we could
improve the quality of estimates - lowering the probability of poorly
performing plans, it's probably good to do that.

It won't be perfect, but until we have dynamic planning it's better than
nothing.

regards
Tomas

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

10 октября 2014 г., 20:54:05

On 10.10.2014 14:10, Tomas Vondra wrote:
> Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one.  While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the
estimates for the  "scalar" case (i.e. data types that we can sort).
Implementing this for the "minimal" case is possible, but requires a bit
more work.

It only computes the estimate and prints a WARNING with both the current
and new estimate, but the old estimate is stored.

I also attach a few synthetic examples of synthetic datasets with
distributions stored in various ways, that I used for testing. In all
cases there's a single table with 10M rows and a single INT column.
There are three kinds of skew:

1) smooth skew

   - N distinct values (100, 10.000 and 100.000 values)
   - average moves to 0 as 'k' increases ('k' between 1 and 9)
   - smooth distribution of frequencies

   INSERT INTO test
        SELECT pow(random(),k) * 10000 FROM generate_series(1,10000000);

2) step skew

   - a few very frequent values, many rare values
   - for example this generates 5 very frequent and ~10k rare values

   INSERT INTO test
        SELECT (CASE WHEN (v < 90000) THEN MOD(v,5) ELSE v END)
          FROM (
             SELECT (random()*100000)::int AS v
               FROM generate_series(1,10000000)
          ) foo;

Results
=======

I tested this with various statistics target settings (10, 100, 1000),
which translates to different sample sizes.

statistics target 100 (default, 30k rows, 0.3% sample)
======================================================

a) smooth skew, 101 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   9986       10542
   3   8902       10883
   5   7579       10824
   7   6639       10188
   9   5947       10013

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      106        107
   106      35         104
   1006     259        1262
   10006    2823       11047

statistics target 10 (3k rows, 0.03% sample)
============================================

a) smooth skew, 101 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   9846       10014
   3   4399       7190
   5   2532       5477
   7   1938       4932
   9   1623       1623

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      100        114
   106      5          5
   1006     37         532
   10006    323        20970

statistics target 1000 (300k rows, 3% sample)
=============================================

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   10001      10002
   3   10000      10000
   5   9998       10011
   7   9973       10045
   9   9939       10114

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      106        107
   106      101        107
   1006     957        1096
   10006    9551       10550

Summary
=======

I'm yet to see an example where the adaptive estimator produces worse
results than the current estimator, irrespectedly of the distribution
and sample size / statistics target.

What really matters is the sample size, with respect to the table size,
so I'll use the 0.03%, 0.3%, and 3% instead of the statistics target.

For the large sample (3%) both estimators produce reasonably accurate
results. This however won't work as the tables grow, because the number
of rows we sample is static (does not grow with the table).

As the sample decreases, the adaptive estimator starts winning. For the
0.3% sample the difference is easily 3x for the high-skew cases. E.g.
for one of the "step skew" distributions the actual ndistinct value is
10006, current estimator gives 2823 while adaptive gives 11047. That's
ratio error ~3.5 vs. 1.1x.

For the tiny 0.03% sample the difference gets even more siginficant,
especially for the step-skew cases, where the improvement is often an
order of magnitude.

Proposal
========

I think the adaptive estimator works very well, and I plan to submit it
to the next commitfest after a bit of polishing. Examples of
distributions that break it are welcome of course.

Also, I think it's clear it'd be useful to be able to scale the sample
proportionally to the table (instead of only allowing the current
statistics target approach). I understand it may result in scanning
large part of the table, but I don't see a problem in that's not a
default behavior (and clearly documented). I don't see a way around that
- small samples simply result in poor estimates.

I was thinking that this could be done using statistics target, but it's
already used for other things (e.g. size of MCV list, histogram) and we
don't want to break that by adding yet another function.

Ideas?

regards
Tomas

Вложения

Re: Yet another abort-early plan disaster on 9.3

От

Craig James

Дата:

10 октября 2014 г., 21:00:09

On Fri, Oct 10, 2014 at 9:53 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 10.10.2014 16:21, Craig James wrote:
> Our index is for chemical structures. Chemicals are indexed on
> chemical fragments
> <http://emolecules.com/info/molecular-informatics>. A search
> typically starts with 50-200 indexed "columns" (chemical fragments).
> The query is always flat, "A and B and ... and Z". The indexed
> fragments are both correlated (the existence of one strongly raises
> the chances of another) and anti-correlated (certain combinations are
> very rare).

Maybe I don't understand the problem well enough, but isn't this a
perfect match for GIN indexes? I mean, you essentially need to do
queries like "WHERE substance @@ ('A & B & !C')" etc. Which is exactly
what GIN does, because it keeps pointers to tuples for each fragment.

On the day our web site opened we were using tsearch. Before the end of the day we realized it was a bad idea, for the very reasons discussed here. The early-abort/late-start problem ("offset N limit M") could take minutes to return the requested page. With the external dynamically-optimized index, we can almost always get answers in less than a couple seconds, often in 0.1 seconds.

Craig

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

10 октября 2014 г., 21:13:29

On 10.10.2014 19:59, Craig James wrote:
> On Fri, Oct 10, 2014 at 9:53 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
>
>
>     On 10.10.2014 16:21, Craig James wrote:
>     > Our index is for chemical structures. Chemicals are indexed on
>     > chemical fragments
>     > <http://emolecules.com/info/molecular-informatics>. A search
>     > typically starts with 50-200 indexed "columns" (chemical fragments).
>     > The query is always flat, "A and B and ... and Z". The indexed
>     > fragments are both correlated (the existence of one strongly raises
>     > the chances of another) and anti-correlated (certain combinations are
>     > very rare).
>
>     Maybe I don't understand the problem well enough, but isn't this a
>     perfect match for GIN indexes? I mean, you essentially need to do
>     queries like "WHERE substance @@ ('A & B & !C')" etc. Which is exactly
>     what GIN does, because it keeps pointers to tuples for each fragment.
>
>
> On the day our web site opened we were using tsearch. Before the end of
> the day we realized it was a bad idea, for the very reasons discussed
> here. The early-abort/late-start problem ("offset N limit M") could take
> minutes to return the requested page. With the external
> dynamically-optimized index, we can almost always get answers in less
> than a couple seconds, often in 0.1 seconds.

In the early days of tsearch, it did not support GIN indexes, and AFAIK
GiST are not nearly as fast for such queries. Also, the GIN fastscan
implemented by Alexander Korotkov in 9.4 makes a huge difference for
queries combining frequent and rare terms.

Maybe it'd be interesting to try this on 9.4. I'm not saying it will
make it faster than the optimized index, but it might be an interesting
comparison.

Tomas

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

15 октября 2014 г., 20:20:29

On 10/10/2014 04:16 AM, Greg Stark wrote:
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one.  While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

So, right now our estimation is off on large tables by -10X to -10000X.
 First, the fact that it's *always* low is an indication we're using the
wrong algorithm.  Second, we can most certainly do better than a median
of -1000X.

One interesting set of algorithms is block-based sampling.  That is, you
read 5% of the physical table in random blocks, reading every row in the
block.  The block size is determined by your storage block size, so
you're not actually reading any more physically than you are logically;
it really is just 5% of the table, especially on SSD.

Then you apply algorithms which first estimate the correlation of common
values in the block (i.e. how likely is it that the table is completely
sorted?), and then estimates of how many values there might be total
based on the correlation estimate.

I no longer have my ACM membership, so I can't link this, but
researchers were able to get +/- 3X accuracy for a TPCH workload using
this approach.  A real database would be more variable, of course, but
even so we should be able to achieve +/- 50X, which would be an order of
magnitude better than we're doing now.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

15 октября 2014 г., 21:02:41

On 15.10.2014 19:20, Josh Berkus wrote:
> On 10/10/2014 04:16 AM, Greg Stark wrote:
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one.  While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
>
> So, right now our estimation is off on large tables by -10X to
> -10000X. First, the fact that it's *always* low is an indication
> we're using the wrong algorithm. Second, we can most certainly do
> better than a median of -1000X.

A few days ago I posted an experimental patch with the adaptive
estimator, described in [1]. Not perfect, but based on the testing I did
I believe it's a superior algorithm to the one we use now. Would be nice
to identify a few datasets where the current estimate is way off.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

> One interesting set of algorithms is block-based sampling. That is,
> you read 5% of the physical table in random blocks, reading every row
> in the block. The block size is determined by your storage block
> size, so you're not actually reading any more physically than you are
> logically; it really is just 5% of the table, especially on SSD.
>
> Then you apply algorithms which first estimate the correlation of
> common values in the block (i.e. how likely is it that the table is
> completely sorted?), and then estimates of how many values there
> might be total based on the correlation estimate.

I think we might also use a different approach - instead of sampling the
data when ANALYZE kicks in, we might collect a requested sample of rows
on the fly. Say we want 1% sample - whenever you insert a new row, you
do [random() < 0.01] and if it happens to be true you keep a copy of the
row aside. Then, when you need the sample, you simply read the sample
and you're done - no random access to the main table, no problems with
estimated being off due to block-level sampling, etc.

Not sure how to track deletions/updates, though. Maybe rebuilding the
sample if the number of deletions exceeds some threshold, but that
contradicts the whole idea a bit.

> I no longer have my ACM membership, so I can't link this, but
> researchers were able to get +/- 3X accuracy for a TPCH workload
> using this approach. A real database would be more variable, of
> course, but even so we should be able to achieve +/- 50X, which
> would be an order of magnitude better than we're doing now.

If you know the title of the article, it's usually available elsewhere
on the web - either at the university site, or elsewhere. I found these
two articles about block-based sampling:

http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf

   https://www.stat.washington.edu/research/reports/1999/tr355.pdf

Maybe there are more, but most of the other links were about how Oracle
does this in 11g.

Tomas

Re: Yet another abort-early plan disaster on 9.3

От

Greg Stark

Дата:

17 октября 2014 г., 20:26:35

On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> If you know the title of the article, it's usually available elsewhere
> on the web - either at the university site, or elsewhere. I found these
> two articles about block-based sampling:
>
>
> http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf

There are a series of papers with Chaudhuri as lead author which I
agree sounds like what Josh is talking about. Note that he's Microsoft
Research's database group lead and it would be a pretty safe bet
anything published from there is going to be covered by patents from
here till next Tuesday (and seventeen years beyond).

I think this is all putting the cart before the horse however. If we
could fix our current sampling to use the data more efficiently that
would be a good start before we start trying to read even more data.

We currently read just one row from each block on average instead of
using the whole block. That's what would be needed in the worst case
if the blocks were a very biased sample (which indeed they probably
are in most databases due to the way Postgres handles updates). But we
could at least give users the option to use more than one row per
block when they know it's ok (such as data that hasn't been updated)
or detect when it's ok (such as by carefully thinking about how
Postgres's storage mechanism would bias the data).

But I looked into this and ran into a problem. I think our algorithm
for calculating the most frequent values list is bunk. The more rows I
picked from each block the more biased that list was towards values
seen earlier in the table. What's worse, when that list was biased it
threw off the histogram since we exclude the most frequent values from
the histogram, so all estimates were thrown off.

If we could fix the most frequent values collection to not be biased
when it sees values in a clumpy way then I think we would be okay to
set the row sample size in Vitter's algorithm to a factor of N larger
than the block sample size where N is somewhat smaller than the
average number of rows per block. In fact even if we used all the rows
in the block I think I've convinced myself that the results would be
accurate in most circumstances.

I think to calcualte the most frequent values more accurately it would
take a two pass approach. Scan some random sample of blocks with a
counting bloom filter then do a second pass (possibly for the same
sample?) keeping counts only for values that the counting bloom filter
said hashed to the most common hash values. That might not be exactly
the most common values but should be at least a representative sample
of the most common values.

--
greg

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

18 октября 2014 г., 20:01:41

On 17.10.2014 19:25, Greg Stark wrote:
> On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> If you know the title of the article, it's usually available
>> elsewhere on the web - either at the university site, or elsewhere.
>> I found these two articles about block-based sampling:
>>
>>
>> http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf
>
>>
> There are a series of papers with Chaudhuri as lead author which I
> agree sounds like what Josh is talking about. Note that he's
> Microsoft Research's database group lead and it would be a pretty
> safe bet anything published from there is going to be covered by
> patents from here till next Tuesday (and seventeen years beyond).

Hmmm. I have 0 experience with handling patents and related issues. Any
idea how to address that?

> I think this is all putting the cart before the horse however. If we
> could fix our current sampling to use the data more efficiently that
> would be a good start before we start trying to read even more data.
>
> We currently read just one row from each block on average instead of
> using the whole block. That's what would be needed in the worst case
> if the blocks were a very biased sample (which indeed they probably
> are in most databases due to the way Postgres handles updates). But
> we could at least give users the option to use more than one row per
> block when they know it's ok (such as data that hasn't been updated)
> or detect when it's ok (such as by carefully thinking about how
> Postgres's storage mechanism would bias the data).

I think this will be very tricky, and in fact it may make the estimates
much worse easily, because  all the algorithms assume random sampling.

For example the ndistinct estimator uses the low-frequency values (that
were observed only once or twice in the sample). By using multiple rows
from each block, you'll significantly influence this probability for
columns with values correlated to block (which is quite common.

Take for example fact tables in data warehouses - those are usually
denormalized, mostly append-only. Say each row has "date_id" which is a
sequential number of a day, with 0 sometime in the past. Daily
increments are usually stored on many consecutive blocks, so on each
block there's usually a single date_id value.

By sampling all rows on a block you gain exactly nothing, and in fact it
results in observing no low-frequency values, making the estimator
absolutely useless.

I can imagine fixing this (although I don't see how exactly), but the
thing is we need to fix *all* the estimators we have, not just
ndistinct. And that's going to be tough.

I don't think adding a knob to tune the number of tuples sampled per
block is a good approach. Either we can solve the issues I described
(and in that case it's unnecessary), or we can't solve them and it turns
into a massive foot gun.

> But I looked into this and ran into a problem. I think our algorithm
> for calculating the most frequent values list is bunk. The more rows
> I picked from each block the more biased that list was towards
> values seen earlier in the table. What's worse, when that list was
> biased it threw off the histogram since we exclude the most frequent
> values from the histogram, so all estimates were thrown off.

I think the 'minimal' stats (when we have just '=' for the type) behaves
like this, but fixing it by switching to a two-pass approach should not
be that difficult (but would cost a few more CPU cycles).

Or do you suggest that even the scalar MCV algorithm behaves has this
bias issue? I doubt that, because the MCV works with an array sorted by
number of occurences, so the position within the table is irrelevant.

> If we could fix the most frequent values collection to not be biased
> when it sees values in a clumpy way then I think we would be okay to
> set the row sample size in Vitter's algorithm to a factor of N
> larger than the block sample size where N is somewhat smaller than
> the average number of rows per block. In fact even if we used all the
> rows in the block I think I've convinced myself that the results
> would be accurate in most circumstances.

I don't expect fixing the MCV to be overly difficult (although it will
need a few more CPU cycles).

But making it work with the block sampling will be much harder, because
of the bias. The phrase 'in most circumstances' doesn't sound really
convincing to me ...

> I think to calcualte the most frequent values more accurately it
> would take a two pass approach. Scan some random sample of blocks
> with a counting bloom filter then do a second pass (possibly for the
> same sample?) keeping counts only for values that the counting bloom
> filter said hashed to the most common hash values. That might not be
> exactly the most common values but should be at least a
> representative sample of the most common values.

I don't see why the counting bloom filter would be necessary, in a two
pass approach?

regards
Tomas

Re: Yet another abort-early plan disaster on 9.3

От

Greg Stark

Дата:

20 октября 2014 г., 01:25:44

On Sat, Oct 18, 2014 at 6:01 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

> Hmmm. I have 0 experience with handling patents and related issues. Any
> idea how to address that?

Well there's no real way to address it. But to summarize: 1) We should
not go searching for patents, knowing that something is patented
increases your liability if you end up infringing on it 2) You should
consult your lawyer for advise which he can give under attorney-client
privilege and not expose you to such liability. 3) The whole patent
system is fundamentally broken and serves only to protect incumbents
from innovative competitors :(

I think realistically software patents are so vague and cover so many
basic algorithms which are obvious to anyone in the field because
they're part of the basic knowledge taught in every algorithms course.
So Postgres is probably infringing on hundreds of patents which would
all easily be declared invalid if the owners ever sought to use them
to attack widely used free software.

That doesn't mean we should be specifically seeking out specific
algorithms that are known to have been published in papers coming out
of major proprietary database vendors though. That just seems like
asking for trouble. We should be looking for solutions that seem
obvious to us or were published 17+ years ago or were published by
people who specifically claim they're not patented.

> I think this will be very tricky, and in fact it may make the estimates
> much worse easily, because  all the algorithms assume random sampling.

Well this is where Josh and I agree but come to different conclusions.
He wants to use more of each block so he can take larger samples in
the hopes it will produce better statistics. I want to use more of
each block so we can continue to take rigorously justified sample
sizes but without having to read such a large portion of the table.

Either way our current sampling method isn't meeting anyone's needs.
It requires you to do copious amounts of I/O which makes running
analyze after an upgrade or data load a major contributor to your
outage window.

>
> For example the ndistinct estimator

Yes, well, the ndistinct estimator just sucks. It's always going to
suck. Unless it has a chance to see every row of the table there's
absolutely no way it's ever going to not suck. Any attempt to estimate
ndistinct from a sample of any size is going to suck. That's just the
way it is.

Our sample sizes are based on the size needed to build the histogram.
That requires only a fairly small and slow growing sample though our
block based sampling inflates the amount of I/O that leads to.
ndistinct and MCV are a different kind of problem that would require a
proportional sample where the sample needs to grow linearly as the
table grows. To get a good estimate of ndistinct requires a large
enough proportion to effectively require reading the whole table. To
get decent MCV stats only requires a smaller proportion but it's still
a proportion that grows linearly which is going to be impractical for
large data.

There are two strategies for dealing with ndistinct that I can see
here. Either a) We decide that to estimate ndistinct we need to scan
the entire table and just make that a separate type of ANALYZE that
you run less frequently. Basically this is what we did with VACUUM and
indeed we could even invest in infrastructure like the FSM which
allows you to process only the changed pages so it can be
incrementally.

Or we decide gathering periodic statistics isn't the way to handle
ndistinct and instead watch the incoming inserts and outgoing deletes
and keep the statistic up to date incrementally. That can be done
though obviously it's a bit tricky. But there's tons of published
research on updating database statistics incrementally -- which brings
us back to patents...

> I think the 'minimal' stats (when we have just '=' for the type) behaves
> like this, but fixing it by switching to a two-pass approach should not
> be that difficult (but would cost a few more CPU cycles).
>
> Or do you suggest that even the scalar MCV algorithm behaves has this
> bias issue? I doubt that, because the MCV works with an array sorted by
> number of occurences, so the position within the table is irrelevant.

Hum. I'll have to reconstruct the tests I was doing back then and see
what was going on. I never really understand what was causing it.

What I observed was that if I increased the number of rows read per
sampled block then the MCV was more and more dominated by the rows
found early in the table. This was in a column where the values were
actually uniformly distributed so there should have been only any
values which happened to be sampled substantially more than the mean
frequency. They were integers though so should have been covered by
the sorting approach.

> I don't see why the counting bloom filter would be necessary, in a two
> pass approach?

I guess I was imagining it was not keeping all the values in memory
for at the same time. Isn't that the whole point of the lossy =
algorithm? But now that I think of it I don't understand why that
algorithm is needed at all.

--
greg

Re: Yet another abort-early plan disaster on 9.3

От

Jeff Janes

Дата:

21 ноября 2014 г., 21:38:34

On Fri, Oct 10, 2014 at 10:53 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 10.10.2014 14:10, Tomas Vondra wrote:
> Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one. While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the
estimates for the "scalar" case (i.e. data types that we can sort).
Implementing this for the "minimal" case is possible, but requires a bit
more work.

It only computes the estimate and prints a WARNING with both the current
and new estimate, but the old estimate is stored.

When I run this patch on the regression database, I get a case where the current method is exact but the adaptive one is off:

WARNING: ndistinct estimate current=676.00 adaptive=906.00

select count(distinct stringu1) from onek;

676

It should be seeing every single row, so I don't know why the adaptive method is off. Seems like a bug.

Cheers,

Jeff

Re: Yet another abort-early plan disaster on 9.3

От

Tomas Vondra

Дата:

23 ноября 2014 г., 23:20:07

On 21.11.2014 19:38, Jeff Janes wrote:
>
> When I run this patch on the regression database, I get a case where
> the current method is exact but the adaptive one is off:
>
> WARNING:  ndistinct estimate current=676.00 adaptive=906.00
>
> select count(distinct stringu1) from onek;
> 676
>
> It should be seeing every single row, so I don't know why the
> adaptive method is off. Seems like a bug.

Thanks for noticing this. I wouldn't call it a bug, but there's clearly
room for improvement.

The estimator, as described in the original paper, does not expect the
sampling to be done "our" way (using fixed number of rows) but assumes
to get a fixed percentage of rows. Thus it does not expect the number of
sampled rows to get so close (or equal) to the total number of rows.

I think the only way to fix this is by checking if samplerows is close
to totalrows, and use a straightforward estimate in that case (instead
of a more sophisticated one). Something along these lines:

    if (samplerows >= 0.95 * totalrows)
        stats->stadistinct = (d + d/0.95) / 2;

which means "if we sampled >= 95% of the table, use the number of
observed distinct values directly".

I have modified the estimator to do the adaptive estimation, and then do
this correction too (and print the values). And with that in place I get
these results

  WARNING:  ndistinct estimate current=676.00 adaptive=996.00
  WARNING:  corrected ndistinct estimate current=676.00 adaptive=693.79

So it gets fairly close to the original estimate (and exact value).

In the end, this check should be performed before calling the adaptive
estimator at all (and not calling it in case we sampled most of the rows).

I also discovered an actual bug in the optimize_estimate() function,
using 'f_max' instead of the number of sampled rows.

Attached is a patch fixing the bug, and implementing the sample size check.

regards
Tomas

Вложения

ndistinct-estimator-v2.patch

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

05 декабря 2014 г., 09:46:26

On 30 September 2014 at 05:53, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 29 September 2014 16:00, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> The problem, as I see it, is different. We assume that if there are
>>> 100 distinct values and you use LIMIT 1 that you would only need to
>>> scan 1% of rows. We assume that the data is arranged in the table in a
>>> very homogenous layout. When data is not, and it seldom is, we get
>>> problems.
>>
>> Hm, good point -- 'data proximity'.  At least in theory, can't this be
>> measured and quantified?  For example, given a number of distinct
>> values, you could estimate the % of pages read (or maybe non
>> sequential seeks relative to the number of pages) you'd need to read
>> all instances of a particular value in the average (or perhaps the
>> worst) case.   One way of trying to calculate that would be to look at
>> proximity of values in sampled pages (and maybe a penalty assigned for
>> high update activity relative to table size).  Data proximity would
>> then become a cost coefficient to the benefits of LIMIT.
>
> The necessary first step to this is to realise that we can't simply
> apply the LIMIT as a reduction in query cost, in all cases.
>
> The way I'm seeing it, you can't assume the LIMIT will apply to any
> IndexScan that doesn't have an index condition. If it has just a
> filter, or nothing at all, just an ordering then it could easily scan
> the whole index if the stats are wrong.
>
> So plans like this could be wrong, by assuming the scan will end
> earlier because of the LIMIT than it actually will.
>
> Limit
>   IndexScan (no index cond)
>
> Limit
>   NestJoin
>     IndexScan (no index cond)
>     SomeScan
>
> Limit
>   NestJoin
>     NestJoin
>       IndexScan (no index cond)
>       SomeScan
>    SomeScan
>
> and deeper...
>
> I'm looking for a way to identify and exclude such plans, assuming
> that this captures at least some of the problem plans.

After looking at this for some time I now have a patch that solves this.

It relies on the observation that index scans with no bounded quals
don't play nicely with LIMIT. The solution relies upon the point that
LIMIT does not reduce the startup cost of plans, only the total cost.
So we can solve the problem by keeping the total cost estimate, just
move some of that into startup cost so LIMIT does not reduce costs as
much as before.

It's a simple patch, but it solves the test cases I know about and
does almost nothing to planning time.

I tried much less subtle approaches involving direct prevention of
LIMIT pushdown but the code was much too complex for my liking.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

avoid_limit_pushdown.v3.patch

Re: Yet another abort-early plan disaster on 9.3

От

Merlin Moncure

Дата:

05 декабря 2014 г., 18:45:42

On Fri, Dec 5, 2014 at 12:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 30 September 2014 at 05:53, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 29 September 2014 16:00, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> The problem, as I see it, is different. We assume that if there are
>>>> 100 distinct values and you use LIMIT 1 that you would only need to
>>>> scan 1% of rows. We assume that the data is arranged in the table in a
>>>> very homogenous layout. When data is not, and it seldom is, we get
>>>> problems.
>>>
>>> Hm, good point -- 'data proximity'.  At least in theory, can't this be
>>> measured and quantified?  For example, given a number of distinct
>>> values, you could estimate the % of pages read (or maybe non
>>> sequential seeks relative to the number of pages) you'd need to read
>>> all instances of a particular value in the average (or perhaps the
>>> worst) case.   One way of trying to calculate that would be to look at
>>> proximity of values in sampled pages (and maybe a penalty assigned for
>>> high update activity relative to table size).  Data proximity would
>>> then become a cost coefficient to the benefits of LIMIT.
>>
>> The necessary first step to this is to realise that we can't simply
>> apply the LIMIT as a reduction in query cost, in all cases.
>>
>> The way I'm seeing it, you can't assume the LIMIT will apply to any
>> IndexScan that doesn't have an index condition. If it has just a
>> filter, or nothing at all, just an ordering then it could easily scan
>> the whole index if the stats are wrong.
>>
>> So plans like this could be wrong, by assuming the scan will end
>> earlier because of the LIMIT than it actually will.
>>
>> Limit
>>   IndexScan (no index cond)
>>
>> Limit
>>   NestJoin
>>     IndexScan (no index cond)
>>     SomeScan
>>
>> Limit
>>   NestJoin
>>     NestJoin
>>       IndexScan (no index cond)
>>       SomeScan
>>    SomeScan
>>
>> and deeper...
>>
>> I'm looking for a way to identify and exclude such plans, assuming
>> that this captures at least some of the problem plans.
>
> After looking at this for some time I now have a patch that solves this.
>
> It relies on the observation that index scans with no bounded quals
> don't play nicely with LIMIT. The solution relies upon the point that
> LIMIT does not reduce the startup cost of plans, only the total cost.
> So we can solve the problem by keeping the total cost estimate, just
> move some of that into startup cost so LIMIT does not reduce costs as
> much as before.
>
> It's a simple patch, but it solves the test cases I know about and
> does almost nothing to planning time.
>
> I tried much less subtle approaches involving direct prevention of
> LIMIT pushdown but the code was much too complex for my liking.

Neat -- got any test cases (would this have prevented OP's problem)?

merlin

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

05 декабря 2014 г., 19:05:16

On 6 December 2014 at 00:45, Merlin Moncure <mmoncure@gmail.com> wrote:

> Neat -- got any test cases (would this have prevented OP's problem)?

No test case was posted, so I am unable to confirm.

A test case I produced that appears to be the same issue is fixed.

I await confirmation from the OP.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

Josh Berkus

Дата:

10 декабря 2014 г., 04:46:42

On 12/05/2014 08:04 AM, Simon Riggs wrote:
> On 6 December 2014 at 00:45, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> Neat -- got any test cases (would this have prevented OP's problem)?
>
> No test case was posted, so I am unable to confirm.
>
> A test case I produced that appears to be the same issue is fixed.
>
> I await confirmation from the OP.
>

So that's proprietary/confidential data.  However, the company involved
has a large testbed and I could test their data using a patched version
of Postgres.   In 3 months their data distribution has drifted, so I'll
need to do some work to recreate the original bad plan circumstances.
I'll keep you posted on how the patch works for that setup.

It would be great to come up with a generic/public test for a bad
abort-early situation.  Ideas?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

10 декабря 2014 г., 05:09:11

On 10 December 2014 at 10:46, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/05/2014 08:04 AM, Simon Riggs wrote:
>> On 6 December 2014 at 00:45, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>>> Neat -- got any test cases (would this have prevented OP's problem)?
>>
>> No test case was posted, so I am unable to confirm.
>>
>> A test case I produced that appears to be the same issue is fixed.
>>
>> I await confirmation from the OP.
>>
>
> So that's proprietary/confidential data.  However, the company involved
> has a large testbed and I could test their data using a patched version
> of Postgres.   In 3 months their data distribution has drifted, so I'll
> need to do some work to recreate the original bad plan circumstances.
> I'll keep you posted on how the patch works for that setup.
>
> It would be great to come up with a generic/public test for a bad
> abort-early situation.  Ideas?

If you could contribute that, it would be welcome.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

12 декабря 2014 г., 06:22:59

On 30 September 2014 at 10:25, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 30 September 2014 00:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> The existing cost estimation
>> code effectively assumes that they're perfectly uniformly distributed;
>> which is a good average-case assumption but can be horribly wrong in
>> the worst case.
>
> Agreed. This is the main observation from which we can work.
>
>> If we could settle on some other model for the probable distribution
>> of the matching tuples, we could adjust the cost estimates for LIMIT
>> accordingly.  I have not enough statistics background to know what a
>> realistic alternative would be.
>
> I'm not sure that the correlation alone is sufficient to be able to do
> that. We'd need to estimate where the values looked for are likely to
> be wrt other values, then increase estimate accordingly. That sounds
> like a lot of pushups grovelling through quals and comparing against
> stats. So my thinking is actually to rule that out, unless you've some
> ideas for how to do that?
>
>> Another possibility is to still assume a uniform distribution but estimate
>> for, say, a 90% probability instead of 50% probability that we'll find
>> enough tuples after scanning X amount of the table.  Again, I'm not too
>> sure what that translates to in terms of the actual math, but it sounds
>> like something a statistics person could do in their sleep.


The problem is one of risk. Whatever distribution we use, it will be
wrong in some cases and good in others.

For example, if we look at "10 Most Recent Calls" for a user, then
frequent users would have one distribution, infrequent users another.
So we have multiple distributions in the same data. We just can't hold
enough information to make sense of this.

Think about how much data needs to be scanned if the user has only done 9 calls.

What I've done in the past is to rewrite the query in different ways
to force different plans, then call each plan depending upon the user
characteristics. This is can also be done with hints, in a more
ignorant way.


>> I do not think we should estimate for the worst case though.  If we do,
>> we'll hear cries of anguish from a lot of people, including many of the
>> same ones complaining now, because the planner stopped picking fast-start
>> plans even for cases where they are orders of magnitude faster than the
>> alternatives.
>
> Fast start plans still make sense when performing an IndexScan with no
> filter conditions. Those types of plan should not be changed from
> current costing - they are accurate, good and very important because
> of their frequency in real workloads.
>
> What I think we are seeing is Ordered plans being selected too often
> in preference to Sorted plans when we make selectivity or stats
> errors. As well as data distributions that aren't correctly described
> by the statistics causing much longer execution times.
>
> Here are some plan selection strategies
>
> * Cost based - attempt to exactly calculate the cost based upon
> existing stats - increase the complexity of cost calc to cover other
> aspects. Even if we do that, these may not be that helpful in covering
> the cases where the stats turn out to be wrong.
>
> * Risk based - A risk adjusted viewpoint would be that we should treat
> the cost as mid-way between the best and the worst. The worst is
> clearly scanning (100% - N) of the tuples, the best is just N tuples.
> So we should be costing scans with excess filter conditions as a (100%
> Scan)/2, no matter the conditions, based purely upon risk.
>
> * Simplified heuristic - deselect ordered plans when they are driven
> from scans without quals or indexscans with filters, since the risk
> adjusted cost is likely to be higher than the sorted cost. Inspecting
> the plan tree for this could be quite costly, so would only be done
> when the total cost is $high, prior to it being adjusted by LIMIT.
>
>
> In terms of practical steps... I suggest the following:
>
> * Implement enable_orderedscan = on (default) | off. A switch to allow
> plans to de-select ordered plans, so we can more easily see the
> effects of such plans in the wild.
>
> * Code heuristic approach - I can see where to add my heuristic in the
> grouping planner. So we just need to do a left? deep search of the
> plan tree looking for scans of the appropriate type and bail out if we
> find one.

After looking at this for some time I now have a patch that solves this.

It relies on the observation that index scans with no bounded quals
don't play nicely with LIMIT. The solution relies upon the point that
LIMIT does not reduce the startup cost of plans, only the total cost.
So we can solve the problem by keeping the total cost estimate, just
move some of that into startup cost so LIMIT does not reduce costs as
much as before.

It's a simple patch, but it solves the test cases I know about and
does almost nothing to planning time.

I tried much less subtle approaches involving direct prevention of
LIMIT pushdown but the code was much too complex for my liking.

- - -

The only other practical way to do this would be to have a
LimitPlanDisaster node

LimitPlanDisaster
-> PreSortedPath
-> CheapestPath

The PlanDisaster node would read the PreSortedPath for costlimit C
After we reach the limit we switch to the CheapestPath and execute
that instead for the remainder of the Limit.

Or we could do time limits, just harder to make that make sense.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

avoid_limit_pushdown.v4.patch

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

12 декабря 2014 г., 06:31:52

On 12 December 2014 at 03:22, Simon Riggs <simon@2ndquadrant.com> wrote:

> It's a simple patch, but it solves the test cases I know about and
> does almost nothing to planning time.

Test cases attached. The files marked "pettus_*" are written up from
Christophe Pettus' blog.
The other test case is one of my own devising, based upon recent
customer problems.

The "10 most recent calls" is a restatement of actual problems seen in the past.

Also attached is a new parameter called enable_sortedpath which can be
used to turn on/off the sorted path generated by the planner.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Yet another abort-early plan disaster on 9.3

От

Simon Riggs

Дата:

17 декабря 2014 г., 10:55:37

On 12 December 2014 at 03:31, Simon Riggs <simon@2ndquadrant.com> wrote:

> Also attached is a new parameter called enable_sortedpath which can be
> used to turn on/off the sorted path generated by the planner.

Now with attachment. (Thanks Jeff!)

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

enable_sorted_path.v1a.patch

Re: Yet another abort-early plan disaster on 9.3

От

Michael Paquier

Дата:

13 февраля 2015 г., 10:16:32

On Wed, Dec 17, 2014 at 4:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 December 2014 at 03:31, Simon Riggs <simon@2ndquadrant.com> wrote:

> Also attached is a new parameter called enable_sortedpath which can be
> used to turn on/off the sorted path generated by the planner.

Now with attachment. (Thanks Jeff!)

Moved this patch to CF 2015-02 because it did not receive any reviews. Better to not lose track of it.
--

Michael

Re: Yet another abort-early plan disaster on 9.3

От

Evgeniy Shishkin

Дата:

05 ноября 2015 г., 12:45:43

Sorry for disrupting the thread,

i am wondering will it be possible to use BRIN indexes to better estimate distribution?

I mean create btree index and brin index,
probe brin during planning and estimate if abort early plan with btree will be better.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Yet another abort-early plan disaster on 9.3

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения