Re: [HACKERS] PATCH: multivariate histograms and MCV lists
От | Dean Rasheed |
---|---|
Тема | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Дата | |
Msg-id | CAEZATCVEU=ZZ31yqQY=nUg+p7dBLxqJg6iDN-9MoE81uYs-8dw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
(Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Список | pgsql-hackers |
On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 1/8/19 3:18 PM, Dean Rasheed wrote: > > So actually, the estimate for a group of values will be either the MCV > > item's frequency (if the MCV item is kept), or (roughly) the MCV > > item's base_frequency (if the MCV item is not kept). That suggests > > that we should simply keep items that are significantly more or less > > common than the item's base frequency -- i.e., keep rule (b) and ditch > > rule (a). > > > > Hmmm, but won't that interfere with how we with how we extrapolate the > MCV estimate to the non-MCV part? Currently the patch does what you > proposed, i.e. > > other_sel = simple_sel - mcv_basesel; > > I'm worried that if we only include the items that are significantly > more or less common than the base frequency, it may skew the other_sel > estimate. > I don't see how that would skew other_sel. Items close to the base frequency would also tend to be close to simple_sel, making other_sel approximately zero, so excluding them should have little effect. However... Re-reading the thread where we enhanced the per-column MCV stats last year [1], it was actually the case that an algorithm based on just looking at the relative standard error worked pretty well for a very wide range of data distributions. The final algorithm chosen in analyze_mcv_list() was only a marginal improvement on that, and was directly based upon the fact that, in the univariate statistics case, all the values not included in the MCV list are assigned the same selectivity. However, that's not the case for multivariate stats, because each group not included in the multivariate MCV list gets assigned a different selectivity based on its per-column stats. So perhaps what we should do for multivariate stats is simply use the relative standard error approach (i.e., reuse the patch in [2] with a 20% RSE cutoff). That had a lot of testing at the time, against a wide range of data distributions, and proved to be very good, not to mention being very simple. That approach would encompass both groups more and less common than the base frequency, because it relies entirely on the group appearing enough times in the sample to infer that any errors on the resulting estimates will be reasonably well controlled. It wouldn't actually look at the base frequency at all in deciding which items to keep. Moreover, if the group appears sufficiently often in the sample to justify being kept, each of the individual column values must also appear at least that often as well, which means that the errors on the base frequency estimate are also well controlled. That was one of my concerns about other algorithms such as "keep items significantly more or less common than the base frequency" -- in the less common case, there's no lower bound on the number of occurrences seen, and so no guarantee that the errors are kept under control. Regards, Dean [1] https://www.postgresql.org/message-id/flat/CAMkU%3D1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO%2Ba%3D4DVkzpuQ2tw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw%40mail.gmail.com
В списке pgsql-hackers по дате отправления:
Следующее
От: Tomas VondraДата:
Сообщение: Re: [HACKERS] PATCH: multivariate histograms and MCV lists