Re: WIP: multivariate statistics / proof of concept

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: WIP: multivariate statistics / proof of concept
Дата
Msg-id 54C3FED3.1060600@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: WIP: multivariate statistics / proof of concept  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: WIP: multivariate statistics / proof of concept  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
Hi,

attached is an updated version of the multivariate stats patch. This is
going to be a bit longer mail, so I'll put here a small ToC ;-)

1) patch split into 4 parts
2) where to start / documentation
3) state of the code
4) main changes/improvements
5) remaining limitations

The motivation and design ideas, explained in the first message of this
thread are still valid. It might be a good idea to read it first:

  http://www.postgresql.org/message-id/flat/543AFA15.4080608@fuzzy.cz

BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
into the patch in person, or discuss the patch in general.


1) Patch split into 4 parts
---------------------------
Firstly, the patch got broken into the following four pieces, to make
the reviews somewhat easier:

1) 0001-shared-infrastructure-and-functional-dependencies.patch

   - infrastructure, shared by all the kinds of stats added
     in the following patches (catalog, ALTER TABLE, ANALYZE ...)

   - implementation of a simple statistics, tracking functional
     dependencies between columns (previously called "associative
     rules", but that's incorrect for several reasons)

   - this does not modify the optimizer in any way

2) 0002-clause-reduction-using-functional-dependencies.patch

   - applies the functional dependencies to optimizer (i.e. considers
     the rules in clauselist_selectivity())

3) 0003-multivariate-MCV-lists.patch

   - multivariate MCV lists (both ANALYZE and optimizer parts)

4) 0004-multivariate-histograms.patch

   - multivariate histograms (both ANALYZE and optimizer parts)


You may look at the patches at github here:

  https://github.com/tvondra/postgres/tree/multivariate-stats-squashed

The branch is not stable, i.e. I'll rebase / squash / force-push changes
in the future. (There's also multivariate-stats development branch with
unsquashed changes, but you don't want to look at that, trust me.)

The patches are not exactly small (being in the 50-100 kB range), but
that's mostly because of the amount of comments explaining the goals and
implementation details.


2) Where to start / documentation
---------------------------------
I strived to document all the pieces properly, mostly in the form of
comments. There's no sgml documentation at this point, which should
obviously change in the future.

Anyway, I'd suggest reading the first e-mail in this thread, explaining
the ideas, and then these comments:

1) functional dependencies (patch 0001)
   - src/backend/utils/mvstats/dependencies.c

2) MCV lists (patch 0003)
   - src/backend/utils/mvstats/mcv.c

3) histograms (patch 0004)
   - src/backend/utils/mvstats/mcv.c

   - also see clauselist_mv_selectivity_mcvlist() in clausesel.c
   - also see clauselist_mv_selectivity_histogram() in clausesel.c

4) selectivity estimation (patches 0002-0004)
   - all in src/backend/optimizer/path/clausesel.c
   - clauselist_selectivity() - overview of how the stats are applied
   - clauselist_apply_dependencies() - functional dependencies reduction
   - clauselist_mv_selectivity_mcvlist() - MCV list estimation
   - clauselist_mv_selectivity_histogram() - histogram estimation


3) State of the code
--------------------
I've spent a fair amount of time testing the patches, and while I
believe there are no segfaults or so, I know parts of the code need a
bit more love.

The part most in need of improvements / comments is probably the code in
clausesel.c - that seems a bit quirky. Reviews / comments regarding this
part of the code are very welcome - I'm sure there are many ways to
improve this part.

There are a few FIXMEs elsewhere (e.g. about memory allocation in the
(de)serialization code), but those are mostly well-defined issues that I
know how to address (at least I believe so).


4) Main changes/improvements
----------------------------
There are many significant improvements. The previous patch version was
in the 'proof of concept' category (missing pieces, knowingly broken in
some areas), the current patch should 'mostly work'.

The patch fixes two most annoying limitations of the first version:

  (a) support for all data types (not just those passed by value)
  (b) handles NULL values properly
  (c) adds support for IS [NOT] NULL clauses

Aside from that the code was significantly improved, there are proper
regression tests and plenty of comments explaining the details.


5) Remaining limitations
------------------------

  (a) limited to stats on 8 columns

      This is mostly just a 'safeguard' restriction.

  (b) only data types with '<' operator

      I don't think this will change anytime soon, because all the
      algorithms for building the stats rely on this. I don't see
      this as a serious limitation though.

  (c) not handling DROP COLUMN or DROP TABLE and so on

      Currently this is not handled at all (so the regression tests
      do an explicit DELETE from the pg_mv_statistic catalog).

      Handling the DROP TABLE won't be difficult, it's similar to the
      current stats. Handling ALTER TABLE ... DROP COLUMN will be much
      more tricky I guess - should we drop all the stats referencing
      that column, or should we just remove it from the stats? Or
      should we keep it and treat it as NULL? Not sure what's the best
      solution.

  (d) limited list of compatible WHERE clauses

      The initial patch handled only simple operator clauses

          (Var op Constant)

      where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
      handles IS [NOT] NULL clauses. Adding more clause types should
      not  be overly difficult - starting with more traditional
      'BooleanTest' conditions, or even multi-column conditions
          (Var op Var)

      which are difficult to estimate using simple-column stats.

  (e) optimizer uses single stats per table

      This is still true and I don't think this will change soon. i do
      have some ideas on how to merge multiple stats etc. but it's
      certainly complex stuff, unlikely to happen within this CF. The
      patch makes a lot of sense even without this particular feature,
      because you can create multiple stats, each suitable for different
      queries.

  (f) no JOIN conditions

      Similarly to the previous point, it's on the TODO but it's not
      going to happen in this CF.


kind regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: numeric access out of bounds
Следующее
От: Tom Lane
Дата:
Сообщение: Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers