Обсуждение: PoC: using sampling to estimate joins / complex conditions

Поиск
Список
Период
Сортировка

PoC: using sampling to estimate joins / complex conditions

От
Tomas Vondra
Дата:
Hi,

estimating joins is one of the significant gaps related to extended
statistics, and I've been regularly asked about what we might do about
that. This is an early experimental patch that I think might help us
with improving this, possible even in PG15.

Note: I do not claim this is exactly how it should be implemented, but
it's probably sufficient to demonstrate the pros/cons of various
alternative approaches, etc.

In short, the patch samples the tables and uses those samples to
estimate selectivity for scans and joins. The samples are collected
during planning, which may be quite expensive - random I/O for each
query, etc. It'd be possible to build them during analyze, but that'd
require solving serialization, tweak CREATE STATISTICS to handle join
queries, etc. I decided to keep the PoC simple.

It still uses CREATE STATISTICS with a new "sample" kind, instructing
the optimizer to use sampling when estimating clauses on the attributes.

A little example demonstrating what the patch does:

  create table t (a int, b int, c int);

  insert into t select mod(i,10), mod(i,20), mod(i,40)
    from generate_series(1,10000000) s(i);

  analyze t;

  -- estimate without any statistics / sampling
  explain analyze select * from t where a = 0 and b = 0 and c = 0;

                             QUERY PLAN
  -------------------------------------------------------------------
   Seq Scan on t  (cost=0.00..229055.00 rows=1361 width=12)
                  (actual time=0.025..761.571 rows=250000 loops=1)
     Filter: ((a = 0) AND (b = 0) AND (c = 0))
     Rows Removed by Filter: 9750000
   Planning Time: 0.471 ms
   Execution Time: 901.182 ms
  (5 rows)

  -- enable sampling on those columns
  create statistics s (sample) on a, b, c from t;

  explain analyze select * from t where a = 0 and b = 0 and c = 0;

                             QUERY PLAN
  -------------------------------------------------------------------
   Seq Scan on t  (cost=0.00..229055.00 rows=250390 width=12)
                  (actual time=0.307..717.937 rows=250000 loops=1)
     Filter: ((a = 0) AND (b = 0) AND (c = 0))
     Rows Removed by Filter: 9750000
   Planning Time: 194.528 ms
   Execution Time: 851.832 ms
  (5 rows)

Of course, in this case a MCV would work well too, because there are
very few combinations in (a,b,c) - a sample would work even when that's
not the case, and it has various other benefits (can estimate almost any
expression while MCV supports only a subset, etc.)

Now, let's look at a join between a fact and a dimension table:

  create table f (d1 int, d2 int, f1 int, f2 int, f3 int);

  create table d (d1 int, d2 int, d3 int, d4 int, d5 int,
                  primary key (d1, d2));

  insert into d select i, i, mod(i,100), mod(i,100), mod(i,100)
    from generate_series(0,999) s(i);

  insert into f select mod(i,1000), mod(i,1000), mod(i,100), mod(i,100),
                       mod(i,100) from generate_series(1,1000000) s(i);

  analyze f, d;

  explain analyze select * from f join d using (d1,d2)
    where f1 < 50 and f2 < 50 and d3 < 50 and d4 < 50;

                                QUERY PLAN
  ----------------------------------------------------------------------
   Hash Join  (cost=25.75..22717.01 rows=63 width=32)
              (actual time=3.197..861.899 rows=500000 loops=1)
     Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2))
     ->  Seq Scan on f  (cost=0.00..21370.00 rows=251669 width=20)
                        (actual time=0.033..315.401 rows=500000 loops=1)
           Filter: ((f1 < 50) AND (f2 < 50))
           Rows Removed by Filter: 500000
     ->  Hash  (cost=22.00..22.00 rows=250 width=20)
               (actual time=3.139..3.141 rows=500 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 34kB
           ->  Seq Scan on d  (cost=0.00..22.00 rows=250 width=20)
                        (actual time=0.018..1.706 rows=500 loops=1)
                 Filter: ((d3 < 50) AND (d4 < 50))
                 Rows Removed by Filter: 500
   Planning Time: 0.806 ms
   Execution Time: 1099.229 ms
  (12 rows)

So, not great - underestimated by 10000x is likely to lead to
inefficient plans. And now with the samples enabled on both sides:

  create statistics s1 (sample) on d1, d2, f1, f2, f3 from f;
  create statistics s2 (sample) on d1, d2, d3, d4, d5 from d;

                                QUERY PLAN
  ----------------------------------------------------------------------
   Hash Join  (cost=29.50..24057.25 rows=503170 width=32)
              (actual time=0.630..837.483 rows=500000 loops=1)
     Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2))
     ->  Seq Scan on f  (cost=0.00..21370.00 rows=503879 width=20)
                        (actual time=0.008..301.584 rows=500000 loops=1)
           Filter: ((f1 < 50) AND (f2 < 50))
           Rows Removed by Filter: 500000
     ->  Hash  (cost=22.00..22.00 rows=500 width=20)
               (actual time=0.616..0.618 rows=500 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 34kB
           ->  Seq Scan on d  (cost=0.00..22.00 rows=500 width=20)
                             (actual time=0.004..0.321 rows=500 loops=1)
                 Filter: ((d3 < 50) AND (d4 < 50))
                 Rows Removed by Filter: 500
   Planning Time: 603.442 ms
   Execution Time: 1071.735 ms
  (12 rows)

Yes, it takes 600ms to do the sampling, but I'm sure most of this can be
eliminated by optimizing the code and/or storing the samples just like
other types of stats.

Note that most of the 1000x underestimate is not due to poor estimates
at the scan level, but mostly due to the join condition having two
correlated clauses. Yes, adding a proper foreign key would probably
improve this (we already leverage this information in planning), but
there can be cross-table correlations between the other conditions, and
the FK can't help with that. Correlations between different dimension
tables are quite common, and sampling can help with those.

Note: There's another PoC patch using multi-column MCVs to improve join
estimated - that has the same limitations as MCVs for scans. It works
quite fine (only) when the MCV represents large part of the data, and it
does not support evaluating arbitrary expressions.


Now, a little bit about the implementation, sampling limitations etc.

At the scan level, sampling is fairly straightforward - the patch simply
runs a TABLESAMPLE query through SPI, with a sample fraction calculated
from a GUC (estimate_sample_rate, 1% by default) and statistics target.
The samples may be too large and the calculation may need some changes,
but that's a minor detail I think. Not sure SPI is the right way to do
this, but for PoC it's good enough.

For joins, sampling is way more complicated - we can't sample both
tables randomly, because that'd require huge samples on both sides - as
shown in [3], sampling n rows from a join with table having N rows
requires sqrt(n * N) from the table. Which is a lot.

So what this patch attempts to do is "correlated sampling", described in
[1] and [3]. Imagine a join on a foreign key, as in the example query.
(The patch only looks for a PK, for simplicity.)

This is a pretty common pattern, especially in star and snowflake
queries, which join a "fact" table to one or more "dimension" tables.

The "correlated" sampling means the "fact" table (side of the join
without the PK) is sampled randomly, but the dimensions are simply
scanned for matching rows. The PK means there can only be one matching
row for each sample one, so we're "enriching" the random sample.

This is what [1] describes as CS2, including how to extend the approach
to joins without the PK/FK requirement and various corner cases, and [3]
improves that to leverage indexes. [4] discussed various CS2 variations,
addressing various problems - reducing space requirements, etc.

The current PoC patch is however very simplistic and naive - for example
it does not attempt to correlate joins with multiple dimensions, so for
example when joining F with D1 and then D2, we sample (F,D1) and then
(F,D2) independently. This means we sample F twice, which can be quite
expensive, and it also fails to miss correlations between D1 and D2
(which is common in actual data sets).

There are various other efficiency issues, because the joins go through
calc_joinrel_size_estimate and compute_semi_anti_join_factors, and each
place does the sampling again. The samples should be cached somewhere
and reused, probably.

I'm sure there's plenty open questions, some of which are mentioned in
the many XXX comments added to the patch.

FWIW The patch does have some issues with expressions, so joins on
complex expressions (e.g. ON ((a+b) = (c+d)) do not work properly. That
shouldn't be a big deal for PoC, I think.


regards


[1] CS2: A new database synopsis for query estimation
https://www.researchgate.net/publication/262350868_CS2_A_new_database_synopsis_for_query_estimation

[2] Join Size Estimation Subject to Filter Conditions

https://www.semanticscholar.org/paper/Join-Size-Estimation-Subject-to-Filter-Conditions-Vengerov-Menck/c8bd4caf0fc9c8a4fbffc7e05416901d4fd7a41b

[3] Cardinality Estimation Done Right: Index-Based Join Sampling

https://www.semanticscholar.org/paper/Cardinality-Estimation-Done-Right%3A-Index-Based-Join-Leis-Radke/15f211eaafc6ce421a511a413613e1d2683879d2

[4] Improved Correlated Sampling for Join SizeEstimation
https://www.comp.nus.edu.sg/~taining/estimation/report.pdf

[5] A Survey on Advancing the DBMS Query Optimizer: Cardinality
Estimation, Cost Model, and Plan Enumeration
https://arxiv.org/abs/2101.01507

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: PoC: using sampling to estimate joins / complex conditions

От
Michael Paquier
Дата:
On Sun, Jun 27, 2021 at 07:55:24PM +0200, Tomas Vondra wrote:
> estimating joins is one of the significant gaps related to extended
> statistics, and I've been regularly asked about what we might do about
> that. This is an early experimental patch that I think might help us
> with improving this, possible even in PG15.

The patch does not apply, so a rebase would be in place.  I have
switched that as waiting on author for now, moving it to the next CF.
--
Michael

Вложения

Re: PoC: using sampling to estimate joins / complex conditions

От
Andres Freund
Дата:
Hi,

On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote:
> estimating joins is one of the significant gaps related to extended
> statistics, and I've been regularly asked about what we might do about
> that. This is an early experimental patch that I think might help us
> with improving this, possible even in PG15.

The tests in this patch fail:
https://cirrus-ci.com/task/5304621299138560
https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs

Looks like the regression test output hasn't been updated?

Greetings,

Andres Freund



Re: PoC: using sampling to estimate joins / complex conditions

От
Tomas Vondra
Дата:
On 1/5/22 00:58, Andres Freund wrote:
> Hi,
> 
> On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote:
>> estimating joins is one of the significant gaps related to extended
>> statistics, and I've been regularly asked about what we might do about
>> that. This is an early experimental patch that I think might help us
>> with improving this, possible even in PG15.
> 
> The tests in this patch fail:
> https://cirrus-ci.com/task/5304621299138560
> https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs
> 
> Looks like the regression test output hasn't been updated?
> 

Yeah, I haven't updated some of the test output because some of those
changes are a bit wrong (and I think that's fine for a PoC patch). I
should have mentioned that in the message, though. Sorry about that.

There are three types of failures:


1) Changes to deparsed statistics definition in \d command:

-    "public.ctlt_all_a_b_stat" ON a, b FROM ctlt_all
+    "public.ctlt_all_a_b_stat" (ndistinct, dependencies, mcv) ON a, b
FROM ctlt_all

This happens because there's a new kind "sample" but it's not set by
default if creating new statistics, and the deparsing logic decides it
means it has to list the kinds explicitly. I've fixed this in the
attached patch, but it was mostly harmless and I'm not sure this is how
sample should behave.


2) Three GUC parameters allowing to enable/disable sampling for
different parts of a query (scan, join, correlated join sampling). I
still consider those GUCs temporary, for experiments, but I've added
them to the expected output.


3) Changes in estimates for OR conditions - a couple estimates get less
accurate, because OR clauses are handled as a single clause the first
time we pass them to statext_clauselist_selectivity(). So we combine the
results incorrectly. OR clauses may need some changes, because it's
causing issues in other patches too (e.g. in the "Var op Var" one).


I haven't done anything about (3) yet - it's a valid issue and needs to
be fixed (either by changing how we handle OR clauses, or maybe handling
samples and MCVs at the same time). Or maybe some other way. In any
case, there is more important stuff that needs fixing first.

The main issue is planning overhead - for the example in the first
message, with a simple query joining two tables, you'll see this:

   Planning Time: 603.442 ms
   Execution Time: 1071.735 ms

There's a couple reasons why it takes this long:


1) We sample the same relation repeatedly - once as a "scan" and then
while estimating joins. And for this query we do that in three different
contexts:

- set_joinrel_size_estimates
- populate_joinrel_with_paths (twice)

I guess we'll get different number of samples for different queries, but
it'll get worse for queries joining more tables etc. It seems fairly
simple to cache the samples - for example in StatisticExtInfo (or maybe
somewhere else, to keep just one sample per relation, not per RTE).

Unfortunately, I'm not sure this works for "independent" samples, not
for correlated ones (which are just FK lookups for another sample, so
that depends on what's the other sample). Which is a bummer, because
correlated samples are the more expensive ones :-(


2) The correlated samples are currently built using a query, executed
through SPI in a loop. So given a "driving" sample of 30k rows, we do
30k lookups - that'll take time, even if we do that just once and cache
the results.

I'm sure there there's room for some improvement, though - for example
we don't need to fetch all columns included in the statistics object,
but just stuff referenced by the clauses we're estimating. That could
improve chance of using IOS etc.

I wonder if there's a more efficient way to do this, in a batched manner
or something ... But even then it'll still be quite expensive.


The only other alternative I can think of is collecting samples during
ANALYZE, and storing them somewhere. That'll be difficult for correlated
samples, particularly for transitive cases (in snowflake schema). But I
believe it's doable (at least for cases covered by FK constraints).

But I'm not sure where/how to store the samples. An obvious option would
be to serialize them into pg_statistic_ext_data, and I'll try doing that
(it's  a bit like a huge MCV without any cutoff). But maybe there's a
better way that would not require constant serialization/deserialization
of many tuples.


Any ideas about these options?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: PoC: using sampling to estimate joins / complex conditions

От
Andres Freund
Дата:
Hi,

On 2022-01-21 01:06:37 +0100, Tomas Vondra wrote:
> Yeah, I haven't updated some of the test output because some of those
> changes are a bit wrong (and I think that's fine for a PoC patch). I
> should have mentioned that in the message, though. Sorry about that.

Given that the patch hasn't been updated since January and that it's a PoC in
the final CF, it seems like it should at least be moved to the next CF? Or
perhaps returned?

I've just marked it as waiting-on-author for now - iirc that leads to fewer
reruns by cfbot once it's failing...


> 2) The correlated samples are currently built using a query, executed
> through SPI in a loop. So given a "driving" sample of 30k rows, we do
> 30k lookups - that'll take time, even if we do that just once and cache
> the results.

Ugh, yea, that's going to increase overhead by at least a few factors.


> I'm sure there there's room for some improvement, though - for example
> we don't need to fetch all columns included in the statistics object,
> but just stuff referenced by the clauses we're estimating. That could
> improve chance of using IOS etc.

Yea. Even just avoid avoiding SPI / planner + executor seems likely to be a
big win.


It seems one more of the cases where we really need logic to recognize "cheap"
vs "expensive" plans, so that we only do sampling when useful. I don't think
that's solved just by having a declarative syntax.


Greetings,

Andres Freund



Re: PoC: using sampling to estimate joins / complex conditions

От
Tomas Vondra
Дата:

On 3/22/22 00:35, Andres Freund wrote:
> Hi,
> 
> On 2022-01-21 01:06:37 +0100, Tomas Vondra wrote:
>> Yeah, I haven't updated some of the test output because some of those
>> changes are a bit wrong (and I think that's fine for a PoC patch). I
>> should have mentioned that in the message, though. Sorry about that.
> 
> Given that the patch hasn't been updated since January and that it's a PoC in
> the final CF, it seems like it should at least be moved to the next CF? Or
> perhaps returned?
> 
> I've just marked it as waiting-on-author for now - iirc that leads to fewer
> reruns by cfbot once it's failing...
> 

Either option works for me.

> 
>> 2) The correlated samples are currently built using a query, executed
>> through SPI in a loop. So given a "driving" sample of 30k rows, we do
>> 30k lookups - that'll take time, even if we do that just once and cache
>> the results.
> 
> Ugh, yea, that's going to increase overhead by at least a few factors.
> 
> 
>> I'm sure there there's room for some improvement, though - for example
>> we don't need to fetch all columns included in the statistics object,
>> but just stuff referenced by the clauses we're estimating. That could
>> improve chance of using IOS etc.
> 
> Yea. Even just avoid avoiding SPI / planner + executor seems likely to be a
> big win.
> 
> 
> It seems one more of the cases where we really need logic to recognize "cheap"
> vs "expensive" plans, so that we only do sampling when useful. I don't think
> that's solved just by having a declarative syntax.
> 

Right.

I was thinking about walking the first table, collecting all the values,
and then doing a single IN () query for the second table - a bit like a
custom join (which seems a bit terrifying, TBH).

But even if we manage to make this much cheaper, there will still be
simple queries where it's going to be prohibitively expensive.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company