PoC/WIP: Extended statistics on expressions
От | Tomas Vondra |
---|---|
Тема | PoC/WIP: Extended statistics on expressions |
Дата | |
Msg-id | ad7891d2-e90c-b446-9fe2-7419143847d7@enterprisedb.com обсуждение исходный текст |
Ответы |
Re: PoC/WIP: Extended statistics on expressions
(Tomas Vondra <tomas.vondra@enterprisedb.com>)
|
Список | pgsql-hackers |
Hi, Attached is a PoC/WIP patch adding support for extended statistics on expressions. This is by no means "ready" - most of the stuff works, but often in a rather hackish way. I certainly don't expect this to pass regression tests, for example. There's an example demonstrating how this works for two queries at the end of this message. Now let's talk about the main parts of the patch: 1) extending grammar to allow expressions, not just plain columns Fairly straighforward, I think. I'm sure the logic which expressions are allowed is not 100% (e.g. volatile functions etc.) but that's a detail we can deal with later. 2) store the expressions in pg_statistic_ext catalog I ended up adding a separate column, similar to indexprs, except that the order of columns/expressions does not matter, so we don't need to bother with storing 0s in stxkeys - we simply consider expressions to be "after" all the simple columns. 3) build statistics This should work too, for all three types of statistics we have (mcv, dependencies and ndistinct). This should work too, although the code changes are often very hackish "to make it work". The main challenge here was how to represent the expressions in the statistics - e.g. in ndistinct, which track ndistinct estimates for combinations of parameters, and so far we used attnums for that. I decided the easiest way it to keep doing that, but offset the expressions by MaxHeapAttributeNumber. That seems to work, but maybe there's a better way. 4) apply the statistics This is the hard part, really, and the exact state of the support depends on type of statistics. For ndistinct coefficients, it generally works. I'm sure there may be bugs in estimate_num_groups, etc. but in principle it works. For MCV lists, it generally works too - you can define statistics on the expressions and the estimates should improve. The main downside here is that it requires at least two expressions, otherwise we can't build/apply the extended statistics. So for example SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0 may be estimated "correctly", once you drop any of the conditions it gets much worse as we don't have stats for individual expressions. That's rather annoying - it does not break the extended MCV, but the behavior will certainly cause confusion. For functional dependencies, the estimation does not work yet. Also, the missing per-column statistics have bigger impact than on MCV, because while MCV can work fine without it, the dependencies heavily rely on the per-column estimates. We only apply "corrections" based on the dependency degree, so we still need (good) per-column estimates, which does not quite work with the expressions. Of course, the lack of per-expression statistics may be somewhat fixed by adding indexes on expressions, but that's kinda expensive. Now, a simple example demonstrating how this improves estimates - let's create a table with 1M rows, and do queries with mod() expressions on it. It might be date_trunc() or something similar, that'd work too. table with 1M rows ================== test=# create table t (a int); test=# insert into t select i from generate_series(1,1000000) s(i); test=# analyze t; poorly estimated queries ======================== test=# explain (analyze, timing off) select * from t where mod(a,3) = 0 and mod(a,7) = 0; QUERY PLAN ---------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619 loops=1) Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0)) Rows Removed by Filter: 952381 Planning Time: 0.329 ms Execution Time: 156.675 ms (5 rows) test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t group by 1, 2; QUERY PLAN ----------------------------------------------------------------------------------------------- HashAggregate (cost=75675.00..98487.50 rows=1000000 width=8) (actual rows=21 loops=1) Group Key: mod(a, 3), mod(a, 7) Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB -> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.277 ms Execution Time: 502.803 ms (6 rows) improved estimates ================== test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t; test=# analyze t; test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t group by 1, 2; QUERY PLAN ----------------------------------------------------------------------------------------------- HashAggregate (cost=24425.00..24425.31 rows=21 width=8) (actual rows=21 loops=1) Group Key: mod(a, 3), mod(a, 7) Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.135 ms Execution Time: 500.092 ms (6 rows) test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t; test=# analyze t; test=# explain (analyze, timing off) select * from t where mod(a,3) = 0 and mod(a,7) = 0; QUERY PLAN ------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..24425.00 rows=46433 width=4) (actual rows=47619 loops=1) Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0)) Rows Removed by Filter: 952381 Planning Time: 0.702 ms Execution Time: 152.280 ms (5 rows) Clearly, estimates for both queries are significantly improved. Of course, this example is kinda artificial/simplistic. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: