Обсуждение: PoC: using sampling to estimate joins / complex conditions
Hi, estimating joins is one of the significant gaps related to extended statistics, and I've been regularly asked about what we might do about that. This is an early experimental patch that I think might help us with improving this, possible even in PG15. Note: I do not claim this is exactly how it should be implemented, but it's probably sufficient to demonstrate the pros/cons of various alternative approaches, etc. In short, the patch samples the tables and uses those samples to estimate selectivity for scans and joins. The samples are collected during planning, which may be quite expensive - random I/O for each query, etc. It'd be possible to build them during analyze, but that'd require solving serialization, tweak CREATE STATISTICS to handle join queries, etc. I decided to keep the PoC simple. It still uses CREATE STATISTICS with a new "sample" kind, instructing the optimizer to use sampling when estimating clauses on the attributes. A little example demonstrating what the patch does: create table t (a int, b int, c int); insert into t select mod(i,10), mod(i,20), mod(i,40) from generate_series(1,10000000) s(i); analyze t; -- estimate without any statistics / sampling explain analyze select * from t where a = 0 and b = 0 and c = 0; QUERY PLAN ------------------------------------------------------------------- Seq Scan on t (cost=0.00..229055.00 rows=1361 width=12) (actual time=0.025..761.571 rows=250000 loops=1) Filter: ((a = 0) AND (b = 0) AND (c = 0)) Rows Removed by Filter: 9750000 Planning Time: 0.471 ms Execution Time: 901.182 ms (5 rows) -- enable sampling on those columns create statistics s (sample) on a, b, c from t; explain analyze select * from t where a = 0 and b = 0 and c = 0; QUERY PLAN ------------------------------------------------------------------- Seq Scan on t (cost=0.00..229055.00 rows=250390 width=12) (actual time=0.307..717.937 rows=250000 loops=1) Filter: ((a = 0) AND (b = 0) AND (c = 0)) Rows Removed by Filter: 9750000 Planning Time: 194.528 ms Execution Time: 851.832 ms (5 rows) Of course, in this case a MCV would work well too, because there are very few combinations in (a,b,c) - a sample would work even when that's not the case, and it has various other benefits (can estimate almost any expression while MCV supports only a subset, etc.) Now, let's look at a join between a fact and a dimension table: create table f (d1 int, d2 int, f1 int, f2 int, f3 int); create table d (d1 int, d2 int, d3 int, d4 int, d5 int, primary key (d1, d2)); insert into d select i, i, mod(i,100), mod(i,100), mod(i,100) from generate_series(0,999) s(i); insert into f select mod(i,1000), mod(i,1000), mod(i,100), mod(i,100), mod(i,100) from generate_series(1,1000000) s(i); analyze f, d; explain analyze select * from f join d using (d1,d2) where f1 < 50 and f2 < 50 and d3 < 50 and d4 < 50; QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=25.75..22717.01 rows=63 width=32) (actual time=3.197..861.899 rows=500000 loops=1) Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2)) -> Seq Scan on f (cost=0.00..21370.00 rows=251669 width=20) (actual time=0.033..315.401 rows=500000 loops=1) Filter: ((f1 < 50) AND (f2 < 50)) Rows Removed by Filter: 500000 -> Hash (cost=22.00..22.00 rows=250 width=20) (actual time=3.139..3.141 rows=500 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 34kB -> Seq Scan on d (cost=0.00..22.00 rows=250 width=20) (actual time=0.018..1.706 rows=500 loops=1) Filter: ((d3 < 50) AND (d4 < 50)) Rows Removed by Filter: 500 Planning Time: 0.806 ms Execution Time: 1099.229 ms (12 rows) So, not great - underestimated by 10000x is likely to lead to inefficient plans. And now with the samples enabled on both sides: create statistics s1 (sample) on d1, d2, f1, f2, f3 from f; create statistics s2 (sample) on d1, d2, d3, d4, d5 from d; QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=29.50..24057.25 rows=503170 width=32) (actual time=0.630..837.483 rows=500000 loops=1) Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2)) -> Seq Scan on f (cost=0.00..21370.00 rows=503879 width=20) (actual time=0.008..301.584 rows=500000 loops=1) Filter: ((f1 < 50) AND (f2 < 50)) Rows Removed by Filter: 500000 -> Hash (cost=22.00..22.00 rows=500 width=20) (actual time=0.616..0.618 rows=500 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 34kB -> Seq Scan on d (cost=0.00..22.00 rows=500 width=20) (actual time=0.004..0.321 rows=500 loops=1) Filter: ((d3 < 50) AND (d4 < 50)) Rows Removed by Filter: 500 Planning Time: 603.442 ms Execution Time: 1071.735 ms (12 rows) Yes, it takes 600ms to do the sampling, but I'm sure most of this can be eliminated by optimizing the code and/or storing the samples just like other types of stats. Note that most of the 1000x underestimate is not due to poor estimates at the scan level, but mostly due to the join condition having two correlated clauses. Yes, adding a proper foreign key would probably improve this (we already leverage this information in planning), but there can be cross-table correlations between the other conditions, and the FK can't help with that. Correlations between different dimension tables are quite common, and sampling can help with those. Note: There's another PoC patch using multi-column MCVs to improve join estimated - that has the same limitations as MCVs for scans. It works quite fine (only) when the MCV represents large part of the data, and it does not support evaluating arbitrary expressions. Now, a little bit about the implementation, sampling limitations etc. At the scan level, sampling is fairly straightforward - the patch simply runs a TABLESAMPLE query through SPI, with a sample fraction calculated from a GUC (estimate_sample_rate, 1% by default) and statistics target. The samples may be too large and the calculation may need some changes, but that's a minor detail I think. Not sure SPI is the right way to do this, but for PoC it's good enough. For joins, sampling is way more complicated - we can't sample both tables randomly, because that'd require huge samples on both sides - as shown in [3], sampling n rows from a join with table having N rows requires sqrt(n * N) from the table. Which is a lot. So what this patch attempts to do is "correlated sampling", described in [1] and [3]. Imagine a join on a foreign key, as in the example query. (The patch only looks for a PK, for simplicity.) This is a pretty common pattern, especially in star and snowflake queries, which join a "fact" table to one or more "dimension" tables. The "correlated" sampling means the "fact" table (side of the join without the PK) is sampled randomly, but the dimensions are simply scanned for matching rows. The PK means there can only be one matching row for each sample one, so we're "enriching" the random sample. This is what [1] describes as CS2, including how to extend the approach to joins without the PK/FK requirement and various corner cases, and [3] improves that to leverage indexes. [4] discussed various CS2 variations, addressing various problems - reducing space requirements, etc. The current PoC patch is however very simplistic and naive - for example it does not attempt to correlate joins with multiple dimensions, so for example when joining F with D1 and then D2, we sample (F,D1) and then (F,D2) independently. This means we sample F twice, which can be quite expensive, and it also fails to miss correlations between D1 and D2 (which is common in actual data sets). There are various other efficiency issues, because the joins go through calc_joinrel_size_estimate and compute_semi_anti_join_factors, and each place does the sampling again. The samples should be cached somewhere and reused, probably. I'm sure there's plenty open questions, some of which are mentioned in the many XXX comments added to the patch. FWIW The patch does have some issues with expressions, so joins on complex expressions (e.g. ON ((a+b) = (c+d)) do not work properly. That shouldn't be a big deal for PoC, I think. regards [1] CS2: A new database synopsis for query estimation https://www.researchgate.net/publication/262350868_CS2_A_new_database_synopsis_for_query_estimation [2] Join Size Estimation Subject to Filter Conditions https://www.semanticscholar.org/paper/Join-Size-Estimation-Subject-to-Filter-Conditions-Vengerov-Menck/c8bd4caf0fc9c8a4fbffc7e05416901d4fd7a41b [3] Cardinality Estimation Done Right: Index-Based Join Sampling https://www.semanticscholar.org/paper/Cardinality-Estimation-Done-Right%3A-Index-Based-Join-Leis-Radke/15f211eaafc6ce421a511a413613e1d2683879d2 [4] Improved Correlated Sampling for Join SizeEstimation https://www.comp.nus.edu.sg/~taining/estimation/report.pdf [5] A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration https://arxiv.org/abs/2101.01507 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sun, Jun 27, 2021 at 07:55:24PM +0200, Tomas Vondra wrote: > estimating joins is one of the significant gaps related to extended > statistics, and I've been regularly asked about what we might do about > that. This is an early experimental patch that I think might help us > with improving this, possible even in PG15. The patch does not apply, so a rebase would be in place. I have switched that as waiting on author for now, moving it to the next CF. -- Michael
Вложения
Hi, On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote: > estimating joins is one of the significant gaps related to extended > statistics, and I've been regularly asked about what we might do about > that. This is an early experimental patch that I think might help us > with improving this, possible even in PG15. The tests in this patch fail: https://cirrus-ci.com/task/5304621299138560 https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs Looks like the regression test output hasn't been updated? Greetings, Andres Freund
On 1/5/22 00:58, Andres Freund wrote: > Hi, > > On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote: >> estimating joins is one of the significant gaps related to extended >> statistics, and I've been regularly asked about what we might do about >> that. This is an early experimental patch that I think might help us >> with improving this, possible even in PG15. > > The tests in this patch fail: > https://cirrus-ci.com/task/5304621299138560 > https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs > > Looks like the regression test output hasn't been updated? > Yeah, I haven't updated some of the test output because some of those changes are a bit wrong (and I think that's fine for a PoC patch). I should have mentioned that in the message, though. Sorry about that. There are three types of failures: 1) Changes to deparsed statistics definition in \d command: - "public.ctlt_all_a_b_stat" ON a, b FROM ctlt_all + "public.ctlt_all_a_b_stat" (ndistinct, dependencies, mcv) ON a, b FROM ctlt_all This happens because there's a new kind "sample" but it's not set by default if creating new statistics, and the deparsing logic decides it means it has to list the kinds explicitly. I've fixed this in the attached patch, but it was mostly harmless and I'm not sure this is how sample should behave. 2) Three GUC parameters allowing to enable/disable sampling for different parts of a query (scan, join, correlated join sampling). I still consider those GUCs temporary, for experiments, but I've added them to the expected output. 3) Changes in estimates for OR conditions - a couple estimates get less accurate, because OR clauses are handled as a single clause the first time we pass them to statext_clauselist_selectivity(). So we combine the results incorrectly. OR clauses may need some changes, because it's causing issues in other patches too (e.g. in the "Var op Var" one). I haven't done anything about (3) yet - it's a valid issue and needs to be fixed (either by changing how we handle OR clauses, or maybe handling samples and MCVs at the same time). Or maybe some other way. In any case, there is more important stuff that needs fixing first. The main issue is planning overhead - for the example in the first message, with a simple query joining two tables, you'll see this: Planning Time: 603.442 ms Execution Time: 1071.735 ms There's a couple reasons why it takes this long: 1) We sample the same relation repeatedly - once as a "scan" and then while estimating joins. And for this query we do that in three different contexts: - set_joinrel_size_estimates - populate_joinrel_with_paths (twice) I guess we'll get different number of samples for different queries, but it'll get worse for queries joining more tables etc. It seems fairly simple to cache the samples - for example in StatisticExtInfo (or maybe somewhere else, to keep just one sample per relation, not per RTE). Unfortunately, I'm not sure this works for "independent" samples, not for correlated ones (which are just FK lookups for another sample, so that depends on what's the other sample). Which is a bummer, because correlated samples are the more expensive ones :-( 2) The correlated samples are currently built using a query, executed through SPI in a loop. So given a "driving" sample of 30k rows, we do 30k lookups - that'll take time, even if we do that just once and cache the results. I'm sure there there's room for some improvement, though - for example we don't need to fetch all columns included in the statistics object, but just stuff referenced by the clauses we're estimating. That could improve chance of using IOS etc. I wonder if there's a more efficient way to do this, in a batched manner or something ... But even then it'll still be quite expensive. The only other alternative I can think of is collecting samples during ANALYZE, and storing them somewhere. That'll be difficult for correlated samples, particularly for transitive cases (in snowflake schema). But I believe it's doable (at least for cases covered by FK constraints). But I'm not sure where/how to store the samples. An obvious option would be to serialize them into pg_statistic_ext_data, and I'll try doing that (it's a bit like a huge MCV without any cutoff). But maybe there's a better way that would not require constant serialization/deserialization of many tuples. Any ideas about these options? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, On 2022-01-21 01:06:37 +0100, Tomas Vondra wrote: > Yeah, I haven't updated some of the test output because some of those > changes are a bit wrong (and I think that's fine for a PoC patch). I > should have mentioned that in the message, though. Sorry about that. Given that the patch hasn't been updated since January and that it's a PoC in the final CF, it seems like it should at least be moved to the next CF? Or perhaps returned? I've just marked it as waiting-on-author for now - iirc that leads to fewer reruns by cfbot once it's failing... > 2) The correlated samples are currently built using a query, executed > through SPI in a loop. So given a "driving" sample of 30k rows, we do > 30k lookups - that'll take time, even if we do that just once and cache > the results. Ugh, yea, that's going to increase overhead by at least a few factors. > I'm sure there there's room for some improvement, though - for example > we don't need to fetch all columns included in the statistics object, > but just stuff referenced by the clauses we're estimating. That could > improve chance of using IOS etc. Yea. Even just avoid avoiding SPI / planner + executor seems likely to be a big win. It seems one more of the cases where we really need logic to recognize "cheap" vs "expensive" plans, so that we only do sampling when useful. I don't think that's solved just by having a declarative syntax. Greetings, Andres Freund
On 3/22/22 00:35, Andres Freund wrote: > Hi, > > On 2022-01-21 01:06:37 +0100, Tomas Vondra wrote: >> Yeah, I haven't updated some of the test output because some of those >> changes are a bit wrong (and I think that's fine for a PoC patch). I >> should have mentioned that in the message, though. Sorry about that. > > Given that the patch hasn't been updated since January and that it's a PoC in > the final CF, it seems like it should at least be moved to the next CF? Or > perhaps returned? > > I've just marked it as waiting-on-author for now - iirc that leads to fewer > reruns by cfbot once it's failing... > Either option works for me. > >> 2) The correlated samples are currently built using a query, executed >> through SPI in a loop. So given a "driving" sample of 30k rows, we do >> 30k lookups - that'll take time, even if we do that just once and cache >> the results. > > Ugh, yea, that's going to increase overhead by at least a few factors. > > >> I'm sure there there's room for some improvement, though - for example >> we don't need to fetch all columns included in the statistics object, >> but just stuff referenced by the clauses we're estimating. That could >> improve chance of using IOS etc. > > Yea. Even just avoid avoiding SPI / planner + executor seems likely to be a > big win. > > > It seems one more of the cases where we really need logic to recognize "cheap" > vs "expensive" plans, so that we only do sampling when useful. I don't think > that's solved just by having a declarative syntax. > Right. I was thinking about walking the first table, collecting all the values, and then doing a single IN () query for the second table - a bit like a custom join (which seems a bit terrifying, TBH). But even if we manage to make this much cheaper, there will still be simple queries where it's going to be prohibitively expensive. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company