Re: multivariate statistics (v19)

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: multivariate statistics (v19)
Дата
Msg-id 0ce73e37-7be4-8d9f-1ec4-46b0ca1d90c6@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: multivariate statistics (v19)  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Ответы Re: multivariate statistics (v19)  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hi,

Thanks for looking into this!

On 09/12/2016 04:08 PM, Dean Rasheed wrote:
> On 3 August 2016 at 02:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v19 of the "multivariate stats" patch series
>
> Hi,
>
> I started looking at this - just at a very high level - I've not read
> much of the detail yet, but here are some initial review comments.
>
> I think the overall infrastructure approach for CREATE STATISTICS
> makes sense, and I agree with other suggestions upthread that it would
> be useful to be able to build statistics on arbitrary expressions,
> although that doesn't need to be part of this patch, it's useful to
> keep that in mind as a possible future extension of this initial
> design.
>
> I can imagine it being useful to be able to create user-defined
> statistics on an arbitrary list of expressions, and I think that would
> include univariate as well as multivariate statistics. Perhaps that's
> something to take into account in the naming of things, e.g., as David
> Rowley suggested, something like pg_statistic_ext, rather than
> pg_mv_statistic.
>
> I also like the idea that this might one day be extended to support
> statistics across multiple tables, although I think that might be
> challenging to achieve -- you'd need a method of taking a random
> sample of rows from a join between 2 or more tables. However, if the
> intention is to be able to support that one day, I think that needs to
> be accounted for in the syntax now -- specifically, I think it will be
> too limiting to only support things extending the current syntax of
> the form table1(col1, col2, ...), table2(col1, col2, ...), because
> that precludes building statistics on an expression referring to
> columns from more than one table. So I think we should plan further
> ahead and use a syntax giving greater flexibility in the future, for
> example something structured more like a query (like CREATE VIEW):
>
> CREATE STATISTICS name
>   [ WITH (options) ]
>   ON expression [, ...]
>   FROM table [, ...]
>   WHERE condition
>
> where the first version of the patch would only support expressions
> that are simple column references, and would require at least 2 such
> columns from a single table with no WHERE clause, i.e.:
>
> CREATE STATISTICS name
>   [ WITH (options) ]
>   ON column1, column2 [, ...]
>   FROM table
>
> For multi-table statistics, a WHERE clause would typically be needed
> to specify how the tables are expected to be joined, but potentially
> such a clause might also be useful in single-table statistics, to
> build partial statistics on a commonly queried subset of the table,
> just like a partial index.

Hmm, the "partial statistics" idea seems interesting, It would allow us 
to provide additional / more detailed statistics only for a subset of a 
table.

I'm however not sure about the join case - how would the syntax work 
with outer joins? But as you said, we only need
 CREATE STATISTICS name   [ WITH (options) ]   ON (column1, column2 [, ...])   FROM table   WHERE condition

until we add support for join statistics.

>
> Regarding the statistics themselves, I read the description of soft
> functional dependencies, and I'm somewhat skeptical about that
> algorithm. I don't like the arbitrary thresholds or the sudden jump
> from independence to dependence and clause reduction. As others have
> said, I think this should account for a continuous spectrum of
> dependence from fully independent to fully dependent, and combine
> clause selectivities in a way based on the degree of dependence. For
> example, if you computed an estimate for the fraction 'f' of the
> table's rows for which a -> b, then it might be reasonable to combine
> the selectivities using
>
>   P(a,b) = P(a) * (f + (1-f) * P(b))
>

Yeah, I agree that the thresholds resulting in sudden changes between 
"dependent" and "not dependent" are annoying. The question is whether it 
makes sense to fix that, though - the functional dependencies were meant 
as the simplest form of statistics, allowing us to get the rest of the 
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of 
dependency between 0 and 1, but I'm a bit afraid it'll result in 
complaints that the first patch got too large / complicated.

It also contradicts the idea of using functional dependencies as a 
low-overhead type of statistics, filtering the list of clauses that need 
to be estimated using more expensive types of statistics (MCV lists, 
histograms, ...). Switching to a degree of dependency would prevent 
removal of "unnecessary" clauses.

> Of course, having just a single number that tells you the columns are
> correlated, tells you nothing about whether the clauses on those
> columns are consistent with that correlation. For example, in the
> following table
>
> CREATE TABLE t(a int, b int);
> INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);
>
> 'b' is functionally dependent on 'a' (and vice versa), but if you
> query the rows with a<50 and with b<50, those clauses behave
> essentially independently, because they're not consistent with the
> functional dependence between 'a' and 'b', so the best way to combine
> their selectivities is just to multiply them, as we currently do.
>
> So whilst it may be interesting to determine that 'b' is functionally
> dependent on 'a', it's not obvious whether that fact by itself should
> be used in the selectivity estimates. Perhaps it should, on the
> grounds that it's best to attempt to use all the available
> information, but only if there are no more detailed statistics
> available. In any case, knowing that there is a correlation can be
> used as an indicator that it may be worthwhile to build more detailed
> multivariate statistics like a MCV list or a histogram on those
> columns.
>

Right. IIRC this is actually described in the README as "incompatible 
conditions". While implementing it, I concluded that this is OK and it's 
up to the developer to decide whether the queries are compatible with 
the "assumption of compatibility". But maybe this is reasoning is bogus 
and makes (the current implementation of) functional dependencies 
unusable in practice.

But I like the idea of reverting the order from

(a) look for functional dependencies
(b) reduce the clauses using functional dependencies
(c) estimate the rest using multivariate MCV/histograms

to

(a) estimate the rest using multivariate MCV/histograms
(b) try to apply functional dependencies on the remaining clauses

It contradicts the idea of functional dependencies as "low-overhead 
statistics" but maybe it's worth it.

>
> Looking at the ndistinct coefficient 'q', I think it would be better
> if the recorded statistic were just the estimate for
> ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
> more fundamental statistic, and it's easier to document and easier to
> interpret. Also, I don't believe that the coefficient 'q' is the right
> number to use for clause estimation:
>

IIRC the reason why I stored the coefficient instead of the ndistinct() 
values is that the coefficients are not directly related to number of 
rows in the original relation, so you can apply it directly to whatever 
cardinality estimate you have.

Otherwise it's mostly the same information - it's trivial to compute one 
from the other.
>
> Looking at README.ndistinct, I'm skeptical about the selectivity
> estimation argument. In the case where a -> b, you'd have q =
> ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
> is fine for a uniform distribution. But typically, there would be
> univariate statistics on a and b, so if for example a=1 were 100x more
> likely than average, you'd probably know that and the existing code
> computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
> = 1/ndistinct(a) would be a significant underestimate, since it would
> be ignoring known information about the distribution of a.
>
> But likewise if, as is later argued, you were to use 'q' as a
> correction factor applied to the individual clause selectivities, you
> could end up with significant overestimates: if you said P(a=1 & b=2)
> = q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
> a -> b, then b=2 would also be 100x more likely than average (assuming
> that b=2 was the value implied by the functional dependency), and that
> would also be reflected in the univariate statics on b, so then you'd
> end up with an overall selectivity of around 10000/ndistinct(a), which
> would be 100x too big. In fact, since a -> b means that q =
> ndistinct(b), there's a good chance of hitting data for which q * P(b)
> is greater than 1, so this formula would lead to a combined
> selectivity greater than P(a), which is obviously nonsense.

Well, yeah. The
    P(a=1) = 1/ndistinct(a)

was really just a simplification for the uniform distribution, and 
looking at "q" as a correction factor is much more practical - no doubt 
about that.

As for the overestimated and underestimates - I don't think we can 
entirely prevent that. We're essentially replacing one assumption (AVIA) 
with other assumptions (homogenity for ndistinct, compatibility for 
functional dependencies), hoping that those assumptions are weaker in 
some sense. But there'll always be cases that break those assumptions 
and I don't think we can prevent that.

Unlike the functional dependencies, this "homogenity" assumption is not 
dependent on the queries at all, so it should be possible to verify it 
during ANALYZE.

Also, maybe we could/should use the same approach as for functional 
dependencies, i.e. try using more detailed statistics first and then 
apply ndistinct coefficients only on the remaining clauses?

>
> Having a better estimate for ndistinct(a,b,...) looks very useful by
> itself for GROUP BY estimation, and there may be other places that
> would benefit from it, but I don't think it's the best statistic for
> determining functional dependence or combining clause selectivities.
>

Not sure. I think it may be very useful type of statistics, but I'm not 
going to fight for this very hard. I'm fine with ignoring this 
statistics type for now, getting the other "detailed" statistics types 
(MCV, histograms) in and then revisiting this.

> That's as much as I've looked at so far. It's such a big patch that
> it's difficult to consider all at once. I think perhaps the smallest
> committable self-contained unit providing a tangible benefit would be
> something containing the core infrastructure plus the ndistinct
> estimate and the improved GROUP BY estimation.
>

FWIW I find the ndistinct statistics as rather uninteresting (at least 
compared to the other types of statistics), which is why it's the last 
patch in the patch series. Perhaps I shouldn't have include it at all, 
as it's just a distraction.


regards
Dean



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: An extra error for client disconnection on Windows
Следующее
От: Tom Lane
Дата:
Сообщение: Re: kqueue