Re: [HACKERS] PATCH: multivariate histograms and MCV lists
От | Tomas Vondra |
---|---|
Тема | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Дата | |
Msg-id | 80bc98b2-1c43-1fe8-b316-d447bb68cff2@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Ответы |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
(Dean Rasheed <dean.a.rasheed@gmail.com>)
Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>) |
Список | pgsql-hackers |
On 07/16/2018 02:54 PM, Dean Rasheed wrote: > On 16 July 2018 at 13:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>>> The top-level clauses allow us to make such deductions, with deeper >>>> clauses it's much more difficult (perhaps impossible). Because for >>>> example with (a=1 AND b=1) there can be just a single match, so if we >>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR >>>> b=2)) it's not that simple, because there may be multiple combinations >>>> and so a match in MCV does not guarantee anything. >>> >>> Actually, it guarantees a lower bound on the overall selectivity, and >>> maybe that's the best that we can do in the absence of any other >>> stats. >>> >> Hmmm, is that actually true? Let's consider a simple example, with two >> columns, each with just 2 values, and a "perfect" MCV list: >> >> a | b | frequency >> ------------------- >> 1 | 1 | 0.5 >> 2 | 2 | 0.5 >> >> And let's estimate sel(a=1 & b=2). > > OK.In this case, there are no MCV matches, so there is no lower bound (it's 0). > > What we could do though is also impose an upper bound, based on the > sum of non-matching MCV frequencies. In this case, the upper bound is > also 0, so we could actually say the resulting selectivity is 0. > Hmmm, it's not very clear to me how would we decide which of these cases applies, because in most cases we don't have MCV covering 100% rows. Anyways, I've been thinking about how to modify the code to wort the way you proposed (in a way sufficient for a PoC). But after struggling with it for a while it occurred to me it might be useful to do it on paper first, to verify how would it work on your examples. So let's use this data create table foo(a int, b int); insert into foo select 1,1 from generate_series(1,50000); insert into foo select 1,2 from generate_series(1,40000); insert into foo select 1,x/10 from generate_series(30,250000) g(x); insert into foo select 2,1 from generate_series(1,30000); insert into foo select 2,2 from generate_series(1,20000); insert into foo select 2,x/10 from generate_series(30,500000) g(x); insert into foo select 3,1 from generate_series(1,10000); insert into foo select 3,2 from generate_series(1,5000); insert into foo select 3,x from generate_series(3,600000) g(x); insert into foo select x,x/10 from generate_series(4,750000) g(x); Assuming we have perfectly exact statistics, we have these MCV lists (both univariate and multivariate): select a, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by a order by 2 desc; a | count | frequency --------+--------+----------- 3 | 614998 | 0.2727 2 | 549971 | 0.2439 1 | 339971 | 0.1508 1014 | 1 | 0.0000 57220 | 1 | 0.0000 ... select b, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by b order by 2 desc; b | count | frequency --------+-------+----------- 1 | 90010 | 0.0399 2 | 65010 | 0.0288 3 | 31 | 0.0000 7 | 31 | 0.0000 ... select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by a, b order by 3 desc; a | b | count | frequency --------+--------+-------+----------- 1 | 1 | 50000 | 0.0222 1 | 2 | 40000 | 0.0177 2 | 1 | 30000 | 0.0133 2 | 2 | 20000 | 0.0089 3 | 1 | 10000 | 0.0044 3 | 2 | 5000 | 0.0022 2 | 12445 | 10 | 0.0000 ... And let's estimate the two queries with complex clauses, where the multivariate stats gave 2x overestimates. SELECT * FROM foo WHERE a=1 and (b=1 or b=2); -- actual 90000, univariate: 24253, multivariate: 181091 univariate: sel(a=1) = 0.1508 sel(b=1) = 0.0399 sel(b=2) = 0.0288 sel(b=1 or b=2) = 0.0673 multivariate: sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770) The second multivariate estimate comes from assuming only the first 5 items make it to the multivariate MCV list (covering 6.87% of the data) and extrapolating the selectivity to the non-MCV data too. (Notice it's about 2x the actual selectivity, so this extrapolation due to not realizing the MCV already contains all the matches is pretty much responsible for the whole over-estimate). So, how would the proposed algorithm work? Let's start with "a=1": sel(a=1) = 0.1508 I don't see much point in applying the two "b" clauses independently (or how would it be done, as it's effectively a single clause): sel(b=1 or b=2) = 0.0673 And we get total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101 From the multivariate MCV we get mcv_sel = 0.0399 And finally total_sel = Max(total_sel, mcv_sel) = 0.0399 Which is great, but I don't see how that actually helped anything? We still only have the estimate for the ~7% covered by the MCV list, and not the remaining non-MCV part. I could do the same thing for the second query, but the problem there is actually exactly the same - extrapolation from MCV to non-MCV part roughly doubles the estimate. So unless I'm applying your algorithm incorrectly, this does not seem like a very promising direction :-( There may be valuable information we could learn from the univariate estimates (using a Max() of them as an upper boundary seems reasonable), but that's still quite crude. And it will only ever work with simple top-level clauses. Once the clauses get more complicated, it seems rather tricky - presumably multivariate stats would be only used for correlated columns, so trying to deduce something from univariate estimates on complex clauses on such columns seems somewhat suspicious. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: