Re: PoC: using sampling to estimate joins / complex conditions
| От | Tomas Vondra | 
|---|---|
| Тема | Re: PoC: using sampling to estimate joins / complex conditions | 
| Дата | |
| Msg-id | 280088a0-e364-a528-bcca-cc487bef2f84@enterprisedb.com обсуждение исходный текст | 
| Ответ на | Re: PoC: using sampling to estimate joins / complex conditions (Andres Freund <andres@anarazel.de>) | 
| Ответы | Re: PoC: using sampling to estimate joins / complex conditions | 
| Список | pgsql-hackers | 
On 1/5/22 00:58, Andres Freund wrote: > Hi, > > On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote: >> estimating joins is one of the significant gaps related to extended >> statistics, and I've been regularly asked about what we might do about >> that. This is an early experimental patch that I think might help us >> with improving this, possible even in PG15. > > The tests in this patch fail: > https://cirrus-ci.com/task/5304621299138560 > https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs > > Looks like the regression test output hasn't been updated? > Yeah, I haven't updated some of the test output because some of those changes are a bit wrong (and I think that's fine for a PoC patch). I should have mentioned that in the message, though. Sorry about that. There are three types of failures: 1) Changes to deparsed statistics definition in \d command: - "public.ctlt_all_a_b_stat" ON a, b FROM ctlt_all + "public.ctlt_all_a_b_stat" (ndistinct, dependencies, mcv) ON a, b FROM ctlt_all This happens because there's a new kind "sample" but it's not set by default if creating new statistics, and the deparsing logic decides it means it has to list the kinds explicitly. I've fixed this in the attached patch, but it was mostly harmless and I'm not sure this is how sample should behave. 2) Three GUC parameters allowing to enable/disable sampling for different parts of a query (scan, join, correlated join sampling). I still consider those GUCs temporary, for experiments, but I've added them to the expected output. 3) Changes in estimates for OR conditions - a couple estimates get less accurate, because OR clauses are handled as a single clause the first time we pass them to statext_clauselist_selectivity(). So we combine the results incorrectly. OR clauses may need some changes, because it's causing issues in other patches too (e.g. in the "Var op Var" one). I haven't done anything about (3) yet - it's a valid issue and needs to be fixed (either by changing how we handle OR clauses, or maybe handling samples and MCVs at the same time). Or maybe some other way. In any case, there is more important stuff that needs fixing first. The main issue is planning overhead - for the example in the first message, with a simple query joining two tables, you'll see this: Planning Time: 603.442 ms Execution Time: 1071.735 ms There's a couple reasons why it takes this long: 1) We sample the same relation repeatedly - once as a "scan" and then while estimating joins. And for this query we do that in three different contexts: - set_joinrel_size_estimates - populate_joinrel_with_paths (twice) I guess we'll get different number of samples for different queries, but it'll get worse for queries joining more tables etc. It seems fairly simple to cache the samples - for example in StatisticExtInfo (or maybe somewhere else, to keep just one sample per relation, not per RTE). Unfortunately, I'm not sure this works for "independent" samples, not for correlated ones (which are just FK lookups for another sample, so that depends on what's the other sample). Which is a bummer, because correlated samples are the more expensive ones :-( 2) The correlated samples are currently built using a query, executed through SPI in a loop. So given a "driving" sample of 30k rows, we do 30k lookups - that'll take time, even if we do that just once and cache the results. I'm sure there there's room for some improvement, though - for example we don't need to fetch all columns included in the statistics object, but just stuff referenced by the clauses we're estimating. That could improve chance of using IOS etc. I wonder if there's a more efficient way to do this, in a batched manner or something ... But even then it'll still be quite expensive. The only other alternative I can think of is collecting samples during ANALYZE, and storing them somewhere. That'll be difficult for correlated samples, particularly for transitive cases (in snowflake schema). But I believe it's doable (at least for cases covered by FK constraints). But I'm not sure where/how to store the samples. An obvious option would be to serialize them into pg_statistic_ext_data, and I'll try doing that (it's a bit like a huge MCV without any cutoff). But maybe there's a better way that would not require constant serialization/deserialization of many tuples. Any ideas about these options? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: