Re: WIP: multivariate statistics / proof of concept
От | Tomas Vondra |
---|---|
Тема | Re: WIP: multivariate statistics / proof of concept |
Дата | |
Msg-id | 54C3FED3.1060600@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: WIP: multivariate statistics / proof of concept (Michael Paquier <michael.paquier@gmail.com>) |
Ответы |
Re: WIP: multivariate statistics / proof of concept
(Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
|
Список | pgsql-hackers |
Hi, attached is an updated version of the multivariate stats patch. This is going to be a bit longer mail, so I'll put here a small ToC ;-) 1) patch split into 4 parts 2) where to start / documentation 3) state of the code 4) main changes/improvements 5) remaining limitations The motivation and design ideas, explained in the first message of this thread are still valid. It might be a good idea to read it first: http://www.postgresql.org/message-id/flat/543AFA15.4080608@fuzzy.cz BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro into the patch in person, or discuss the patch in general. 1) Patch split into 4 parts --------------------------- Firstly, the patch got broken into the following four pieces, to make the reviews somewhat easier: 1) 0001-shared-infrastructure-and-functional-dependencies.patch - infrastructure, shared by all the kinds of stats added in the following patches (catalog, ALTER TABLE, ANALYZE ...) - implementation of a simple statistics, tracking functional dependencies between columns (previously called "associative rules", but that's incorrect for several reasons) - this does not modify the optimizer in any way 2) 0002-clause-reduction-using-functional-dependencies.patch - applies the functional dependencies to optimizer (i.e. considers the rules in clauselist_selectivity()) 3) 0003-multivariate-MCV-lists.patch - multivariate MCV lists (both ANALYZE and optimizer parts) 4) 0004-multivariate-histograms.patch - multivariate histograms (both ANALYZE and optimizer parts) You may look at the patches at github here: https://github.com/tvondra/postgres/tree/multivariate-stats-squashed The branch is not stable, i.e. I'll rebase / squash / force-push changes in the future. (There's also multivariate-stats development branch with unsquashed changes, but you don't want to look at that, trust me.) The patches are not exactly small (being in the 50-100 kB range), but that's mostly because of the amount of comments explaining the goals and implementation details. 2) Where to start / documentation --------------------------------- I strived to document all the pieces properly, mostly in the form of comments. There's no sgml documentation at this point, which should obviously change in the future. Anyway, I'd suggest reading the first e-mail in this thread, explaining the ideas, and then these comments: 1) functional dependencies (patch 0001) - src/backend/utils/mvstats/dependencies.c 2) MCV lists (patch 0003) - src/backend/utils/mvstats/mcv.c 3) histograms (patch 0004) - src/backend/utils/mvstats/mcv.c - also see clauselist_mv_selectivity_mcvlist() in clausesel.c - also see clauselist_mv_selectivity_histogram() in clausesel.c 4) selectivity estimation (patches 0002-0004) - all in src/backend/optimizer/path/clausesel.c - clauselist_selectivity() - overview of how the stats are applied - clauselist_apply_dependencies() - functional dependencies reduction - clauselist_mv_selectivity_mcvlist() - MCV list estimation - clauselist_mv_selectivity_histogram() - histogram estimation 3) State of the code -------------------- I've spent a fair amount of time testing the patches, and while I believe there are no segfaults or so, I know parts of the code need a bit more love. The part most in need of improvements / comments is probably the code in clausesel.c - that seems a bit quirky. Reviews / comments regarding this part of the code are very welcome - I'm sure there are many ways to improve this part. There are a few FIXMEs elsewhere (e.g. about memory allocation in the (de)serialization code), but those are mostly well-defined issues that I know how to address (at least I believe so). 4) Main changes/improvements ---------------------------- There are many significant improvements. The previous patch version was in the 'proof of concept' category (missing pieces, knowingly broken in some areas), the current patch should 'mostly work'. The patch fixes two most annoying limitations of the first version: (a) support for all data types (not just those passed by value) (b) handles NULL values properly (c) adds support for IS [NOT] NULL clauses Aside from that the code was significantly improved, there are proper regression tests and plenty of comments explaining the details. 5) Remaining limitations ------------------------ (a) limited to stats on 8 columns This is mostly just a 'safeguard' restriction. (b) only data types with '<' operator I don't think this will change anytime soon, because all the algorithms for building the stats rely on this. I don't see this as a serious limitation though. (c) not handling DROP COLUMN or DROP TABLE and so on Currently this is not handled at all (so the regression tests do an explicit DELETE from the pg_mv_statistic catalog). Handling the DROP TABLE won't be difficult, it's similar to the current stats. Handling ALTER TABLE ... DROP COLUMN will be much more tricky I guess - should we drop all the stats referencing that column, or should we just remove it from the stats? Or should we keep it and treat it as NULL? Not sure what's the best solution. (d) limited list of compatible WHERE clauses The initial patch handled only simple operator clauses (Var op Constant) where operator is one of ('<', '<=', '=', '>=', '>'). Now it also handles IS [NOT] NULL clauses. Adding more clause types should not be overly difficult - starting with more traditional 'BooleanTest' conditions, or even multi-column conditions (Var op Var) which are difficult to estimate using simple-column stats. (e) optimizer uses single stats per table This is still true and I don't think this will change soon. i do have some ideas on how to merge multiple stats etc. but it's certainly complex stuff, unlikely to happen within this CF. The patch makes a lot of sense even without this particular feature, because you can create multiple stats, each suitable for different queries. (f) no JOIN conditions Similarly to the previous point, it's on the TODO but it's not going to happen in this CF. kind regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: Tom LaneДата:
Сообщение: Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers