Re: multivariate statistics v14
От | Tatsuo Ishii |
---|---|
Тема | Re: multivariate statistics v14 |
Дата | |
Msg-id | 20160322.214615.1524680563827969904.t-ishii@sraoss.co.jp обсуждение исходный текст |
Ответ на | Re: multivariate statistics v14 (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: multivariate statistics v14
|
Список | pgsql-hackers |
> On 03/22/2016 11:41 AM, Tatsuo Ishii wrote: >>>> Hum. So without 0006 or beyond, there's not much benefit for the >>>> PostgreSQL users, and you are not too confident about 0006 or >>>> beyond. Then I would think it is a little bit hard to justify in >>>> putting 000[2-5] into 9.6. I really like this feature and would >>>> like to see in PostgreSQL someday, but I'm not sure if we should >>>> put the patches (0002-0005) into PostgreSQL now. Please let me >>>> know if there's some reaons we should put the patches into >>>> PostgreSQL now. >>> >>> I don't think so. While being able to combine multiple statistics >>> is certainly useful, I'm convinced that the initial patched add >>> enough >> >> Can you please elaborate a little bit more how combining multiple >> statistics is useful? > > Sure. > > The goal of multivariate statistics is to approximate a probability > distribution on a group of columns. The larger the number of columns, > the less accurate the statistics will be (with respect to individual > columns), assuming fixed size of the sample in ANALYZE, and fixed > statistics size. > > For example, if you add a column to multivariate histogram, you'll do > some "bucket splits" by this dimension, thus reducing the accuracy for > the other columns. You may of course allow larger statistics > (e.g. histograms with more buckets), but that also requires larger > samples, and so on. > > Now, let's assume you have a query like this: > > WHERE (a=1) AND (b=2) AND (c=3) AND (d=4) > > and that "a" and "b" are correlated, and "c" and "d" are correlated, > but that otherwise the columns are independent. It'd be a bit silly to > require building statistics on (a,b,c,d), when two statistics on each > of the column pairs would be cheaper and also more accurate. > > That's of course a trivial case - independent groups of correlated > columns. But I'd say this is actually a pretty common case, and I do > believe there's not much controversy that we should support it. > > Another reason to allow multiple statistics is that columns in one > group may be a good fit for MCV list (which works well for discrete > values), while the other group may be a good candidate for histogram > (which works well for continuous values). This can't be solved by > first building a MCV and then a histogram on the group. > > The question of course is what to do if the groups are not > independent. The patch does that by assuming the statistics overlap, > and uses conditions on the columns included in both statistics to > combine them using conditional probabilities. I do believe this works > quite well, but this is perhaps the part that needs further > discussion. There are other ways to combine the statistics, but I do > expect them to be considerably more expensive. > > Is this a sufficient explanation? > > Of course, there's a fair amount of additional complexity that I have > not mentioned here (e.g. selecting the right combination of stats). Sorry, maybe I did not explain clearyly. My question is, if put patches only 0002 to 0005 into 9.6, does it still give any visible benefit to users? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
В списке pgsql-hackers по дате отправления: