Re: [HACKERS] PATCH: multivariate histograms and MCV lists
От | Dean Rasheed |
---|---|
Тема | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Дата | |
Msg-id | CAEZATCX04NOAqJm7+vb_jyrMxDo44Pzx8xd_z5rRH7snsLqxEA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
Список | pgsql-hackers |
On 16 July 2018 at 21:55, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 07/16/2018 02:54 PM, Dean Rasheed wrote: >> On 16 July 2018 at 13:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>>>> The top-level clauses allow us to make such deductions, with deeper >>>>> clauses it's much more difficult (perhaps impossible). Because for >>>>> example with (a=1 AND b=1) there can be just a single match, so if we >>>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR >>>>> b=2)) it's not that simple, because there may be multiple combinations >>>>> and so a match in MCV does not guarantee anything. >>>> >>>> Actually, it guarantees a lower bound on the overall selectivity, and >>>> maybe that's the best that we can do in the absence of any other >>>> stats. >>>> >>> Hmmm, is that actually true? Let's consider a simple example, with two >>> columns, each with just 2 values, and a "perfect" MCV list: >>> >>> a | b | frequency >>> ------------------- >>> 1 | 1 | 0.5 >>> 2 | 2 | 0.5 >>> >>> And let's estimate sel(a=1 & b=2). >> >> OK.In this case, there are no MCV matches, so there is no lower bound (it's 0). >> >> What we could do though is also impose an upper bound, based on the >> sum of non-matching MCV frequencies. In this case, the upper bound is >> also 0, so we could actually say the resulting selectivity is 0. >> > > Hmmm, it's not very clear to me how would we decide which of these cases > applies, because in most cases we don't have MCV covering 100% rows. > > Anyways, I've been thinking about how to modify the code to wort the way > you proposed (in a way sufficient for a PoC). But after struggling with > it for a while it occurred to me it might be useful to do it on paper > first, to verify how would it work on your examples. > > So let's use this data > > create table foo(a int, b int); > insert into foo select 1,1 from generate_series(1,50000); > insert into foo select 1,2 from generate_series(1,40000); > insert into foo select 1,x/10 from generate_series(30,250000) g(x); > insert into foo select 2,1 from generate_series(1,30000); > insert into foo select 2,2 from generate_series(1,20000); > insert into foo select 2,x/10 from generate_series(30,500000) g(x); > insert into foo select 3,1 from generate_series(1,10000); > insert into foo select 3,2 from generate_series(1,5000); > insert into foo select 3,x from generate_series(3,600000) g(x); > insert into foo select x,x/10 from generate_series(4,750000) g(x); > > Assuming we have perfectly exact statistics, we have these MCV lists > (both univariate and multivariate): > > select a, count(*), round(count(*) /2254937.0, 4) AS frequency > from foo group by a order by 2 desc; > > a | count | frequency > --------+--------+----------- > 3 | 614998 | 0.2727 > 2 | 549971 | 0.2439 > 1 | 339971 | 0.1508 > 1014 | 1 | 0.0000 > 57220 | 1 | 0.0000 > ... > > select b, count(*), round(count(*) /2254937.0, 4) AS frequency > from foo group by b order by 2 desc; > > b | count | frequency > --------+-------+----------- > 1 | 90010 | 0.0399 > 2 | 65010 | 0.0288 > 3 | 31 | 0.0000 > 7 | 31 | 0.0000 > ... > > select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency > from foo group by a, b order by 3 desc; > > a | b | count | frequency > --------+--------+-------+----------- > 1 | 1 | 50000 | 0.0222 > 1 | 2 | 40000 | 0.0177 > 2 | 1 | 30000 | 0.0133 > 2 | 2 | 20000 | 0.0089 > 3 | 1 | 10000 | 0.0044 > 3 | 2 | 5000 | 0.0022 > 2 | 12445 | 10 | 0.0000 > ... > > And let's estimate the two queries with complex clauses, where the > multivariate stats gave 2x overestimates. > > SELECT * FROM foo WHERE a=1 and (b=1 or b=2); > -- actual 90000, univariate: 24253, multivariate: 181091 > > univariate: > > sel(a=1) = 0.1508 > sel(b=1) = 0.0399 > sel(b=2) = 0.0288 > sel(b=1 or b=2) = 0.0673 > > multivariate: > sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770) > > The second multivariate estimate comes from assuming only the first 5 > items make it to the multivariate MCV list (covering 6.87% of the data) > and extrapolating the selectivity to the non-MCV data too. > > (Notice it's about 2x the actual selectivity, so this extrapolation due > to not realizing the MCV already contains all the matches is pretty much > responsible for the whole over-estimate). > Agreed. I think the actual MCV stats I got included the first 6 entries, but yes, that's only around 7% of the data. > So, how would the proposed algorithm work? Let's start with "a=1": > > sel(a=1) = 0.1508 > > I don't see much point in applying the two "b" clauses independently (or > how would it be done, as it's effectively a single clause): > > sel(b=1 or b=2) = 0.0673 > > And we get > > total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101 > > From the multivariate MCV we get > > mcv_sel = 0.0399 > > And finally > > total_sel = Max(total_sel, mcv_sel) = 0.0399 > > Which is great, but I don't see how that actually helped anything? We > still only have the estimate for the ~7% covered by the MCV list, and > not the remaining non-MCV part. > Right. If these are the only stats available, and there are just 2 top-level clauses like this, it either returns the MCV estimate, or the old univariate estimate (whichever is larger). It avoids over-estimating, but will almost certainly under-estimate when the MCV matches are incomplete. > I could do the same thing for the second query, but the problem there is > actually exactly the same - extrapolation from MCV to non-MCV part > roughly doubles the estimate. > > So unless I'm applying your algorithm incorrectly, this does not seem > like a very promising direction :-( > You could be right. Actually it's the order dependence with more than 2 top-level clauses that bothers me most about this algorithm. It's also not entirely obvious how to include histogram stats in this scheme. A different approach that I have been thinking about is, in mcv_update_match_bitmap(), attempt to work out the probability of all the clauses matching and it not being an MCV value. For example, given a clause like a=1 whose univariate estimate we know (0.1508 in the above example), knowing that the MCV values account for 0.0222+0.0177 of that, the remainder must be from non-MCV values. So we could say that the probability that a=1 and it not being and MCV is 0.1508-0.0222-0.0177 = 0.1109. So then the question is could we combine these non-MCV probabilities to give an estimate of how many non-MCV values we need to worry about? I've not fully thought that through, but it might be useful. The problem is, it's still likely to over-estimate when the MCV matches fully cover all possibilities, and I still don't see a way to reliably detect that case. Regards, Dean
В списке pgsql-hackers по дате отправления: