Re: multivariate statistics v8
От | Tomas Vondra |
---|---|
Тема | Re: multivariate statistics v8 |
Дата | |
Msg-id | 567AF104.80700@2ndquadrant.com обсуждение исходный текст |
Ответ на | WIP: multivariate statistics / proof of concept (Tomas Vondra <tv@fuzzy.cz>) |
Ответы |
Re: multivariate statistics v9
Re: multivariate statistics v8 |
Список | pgsql-hackers |
Hi, attached is v8 of the multivariate statistics patch (or rather a patch series). The patch currently has 7 parts, but 0001 is just a fix of the pull_varnos issue (possibly incorrect/temporary), and 0007 is just an attempt to add the "multicolumn distinctness" (experimental for now). There are three noteworthy changes: 1) Correct estimation of OR-clauses - this turned out to be a rather minor change, thanks to simply transforming the OR-clauses to AND-clauses, see clauselist_selectivity_or() for details. 2) Abandoning the ALTER TABLE ... ADD STATISTICS syntax and instead adding separate commands CREATE STATISTICS / DROP STATISTICS, as proposed in the "multicolumn distinctness" thread: http://www.postgresql.org/message-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp This seems a better approach than the ALTER TABLE one - not only it nicely fixes the grammar issues, it also naturally extends to multi-table statistics (despite we don't know how those should work exactly). The syntax is this: CREATE STATISTICS name ON table (columns) WITH (options); DROP STATISTICS name; and the 'name' is optional (and if absent, should be generated just like for indexes, but that's not implemented yet). The remaining question is how unique the statistics name should be. My initial plan was to make it unique within a table, but that of course does not work well with the DROP STATISTICS (it'd have to specify the table name also), and it'd also now work with statistics on multiple tables (which is one of the reasons for abandoning ALTER TABLE stuff). So I think it should be unique across tables. Statistics are hardly a global object, so it should be unique within a schema. I thought that simply using the schema of the table would work, but that of course breaks with multiple tables in different schemas. So the only solution seems to be explicit schema for statistics. 3) I've also started hacking on adding the "multicolumn distinctness" proposed by Horiguchi-san, but I haven't really got that working. It seems to be a bit more complicated than I anticipated because of the "only equality conditions" restriction. So the 0007 patch only really adds basic syntax and trivial build. I do have bunch of ideas/questions about this statistics type. For example, should we compute just a single coefficient or the exact combination of columns specified in CREATE STATISTICS, or perhaps for some additional subsets? I.e. with CREATE STATISTICS ON t (a,b,c) WITH (ndistinct); should we compute just the coefficient for (a,b,c), or maybe also for (a,b), (b,c) and (a,c)? For N columns there's O(2^N) such combinations, but perhaps it's acceptable. Having the coefficient for just the single combination specified in CREATE STATISTICS makes the estimation difficult when some of the columns are not specified. For example, with coefficient just for (a,b,c), what should happen for (WHERE a=1 AND b=2)? Should we simply ignore the statistics, or apply it anyway and somehow compensate for the missing columns? I've also started working on something like a paper, hopefully explaining the ideas and implementation more clearly and consistently than possible on a mailing list (thanks to charts, figures and such). It's available here (both the .tex source and .pdf with the current version): https://bitbucket.org/tvondra/mvstats-paper/src It's not exactly short (~30 pages), and it's certainly incomplete with a plenty of TODO notes, but hopefully it's already useful and not entirely bogus. Comments and questions are welcome - both to the patch and paper. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
- 0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
- 0002-shared-infrastructure-and-functional-dependencies.patch
- 0003-clause-reduction-using-functional-dependencies.patch
- 0004-multivariate-MCV-lists.patch
- 0005-multivariate-histograms.patch
- 0006-multi-statistics-estimation.patch
- 0007-initial-version-of-ndistinct-conefficient-statistics.patch
В списке pgsql-hackers по дате отправления: