Re: multivariate statistics / patch v7

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: multivariate statistics / patch v7
Дата	25 июля 2015 г. 21:09:43
Msg-id	55B3FB0B.7000201@2ndquadrant.com обсуждение исходный текст
Ответ на	Re: multivariate statistics / patch v7 (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы	Re: multivariate statistics / patch v7
Список	pgsql-hackers

Дерево обсуждения

Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
> Hi, I'd like to show you the modified constitution of
> multivariate statistics application logic. Please find the
> attached. They apply on your v7 patch.

Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.

FWIW I've rebased my patch to the current master, it's available on 
github as usual:
    https://github.com/tvondra/postgres/commits/mvstats

> The code to find mv-applicable clause is moved out of the main
> flow of clauselist_selectivity. As I said in the previous mail,
> the new function transformRestrictInfoForEstimate (too bad name
> but just for PoC:) scans clauselist and generates
> RestrictStatsData struct which drives mv-aware selectivity
> calculation. This struct isolates MV and non-MV estimation.
>
> The struct RestrictStatData mainly consists of the following
> three parts,
>
>    - clause to be estimated by current logic (MV is not applicable)
>    - clause to be estimated by MV-staistics.
>    - list of child RestrictStatDatas, which are to be run
>      recursively.
>
> mvclause_selectivty() is the topmost function where mv stats
> works. This structure effectively prevents main estimation flow
> from being broken by modifying mvstats part. Although I haven't
> measured but I'm positive the code is far reduced from yours.
>
> I attached two patches to this message. The first one is to
> rebase v7 patch to current(maybe) master and the second applies
> the refactoring.
>
> I'm a little anxious about performance but I think this makes the
> process to apply mv-stats far clearer. Regtests for mvstats
> succeeded asis except for fdep, which is not implememted in this
> patch.
>
> What do you think about this?

I'm not sure, at this point. I'm having a hard time understanding how 
exactly the code works - there are pretty much no comments explaining 
the implementation, so it takes time to understand the code. This is 
especially true about transformRestrictInfoForEstimate which is also 
quite long. I understand it's a PoC, but comments would really help.

On a conceptual level, I think the idea to split the estimation into two 
phases - enrich the expression tree with nodes with details about stats 
etc, and then actually do the estimation in the second phase might be 
interesting. Not because it's somehow clearer, but because it gives us a 
chance to see the expression tree as a whole, with details about all the 
stats (with the current code we process/estimate the tree 
incrementally). But I don't really know how useful that would be.

I don't think the proposed change makes the process somehow clearer. I 
know it's a PoC at this point, so I don't expect it to be perfect, but 
for me the original code is certainly clearer. Of course, I'm biased as 
I wrote the current code, and I (naturally) shaped it to match my ideas 
during the development process, and I'm much more familiar with it.

Omitting the support for functional dependencies is a bit unfortunate, I 
think. Is that merely to make the PoC simpler, or is there something 
that makes it impossible to support that kind of stats?

Another thing that I noticed is that you completely removed the code 
that combined multiple stats (and selected the best combination of 
stats). In other words, you've reverted to the intermediate single 
statistics approach, including removing the improved handling of OR 
clauses and conditions. It's a bit difficult to judge the proposed 
approach not knowing how well it supports those (quite crucial) 
features. What if it can't support some them., or what if it makes the 
code much more complicated (thus defeating the goal of making it more 
clear)?

I share your concern about the performance impact - one thing is that 
this new code might be slower than the original one, but a more serious 
issue IMHO is that the performance impact will happen even for relations 
with no multivariate stats at all. The original patch was very careful 
about getting ~0% overhead in such cases, and if the new code does not 
allow that, I don't see this approach as acceptable. We must not put 
additional overhead on people not using multivariate stats.

But I think it's worth exploring this idea a bit more - can you rebase 
it to the current patch version (as on github) and adding the missing 
pieces (functional dependencies, multi-statistics estimation and passing 
conditions)?

One more thing - I noticed you extended the pg_operator catalog with a 
oprmvstat attribute, used to flag operators that are compatible with 
multivariate stats. I'm not happy with the current approach (using 
oprrest to do this decision), but I'm not really sure this is a good 
solution either. The culprit is that it only answers one of the two 
important questions - Is it compatible? How to perform the estimation?

So we'd have to rely on oprrest anyway, when actually performing the 
estimation of a clause with "compatible" operator. And we'd have to keep 
in sync two places (catalog and checks in file), and we'd have to update 
the catalog after improving the implementation (adding support for 
another operator).

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: multivariate statistics / patch v7