Re: Cross-column statistics revisited

Поиск
Список
Период
Сортировка
От Nathan Boley
Тема Re: Cross-column statistics revisited
Дата
Msg-id 6fa3b6e20810171854p4bd1efe5o8e16653a52176727@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cross-column statistics revisited  ("Joshua Tolley" <eggyknap@gmail.com>)
Ответы Re: Cross-column statistics revisited  ("Joshua Tolley" <eggyknap@gmail.com>)
Список pgsql-hackers
> I'm still working my way around the math, but copulas sound better
> than anything else I've been playing with.

I think the easiest way to think of them is, in 2-D finite spaces,
they are just a plot of the order statistics against one another. Feel
free to mail me off list if you have any math questions.

I've previously thought that, at least in the 2D case, we could use
image compression algorithms to compress the copula, but recently I've
realized that this is a change point problem. In terms of compression,
we want to decompose the copula into regions that are as homogenous as
possible.  I'm not familiar with change point problems in multiple
dimensions, but I'll try and ask someone that is, probably late next
week. If you decide to go the copula route, I'd be happy to write the
decomposition algorithm - or at least work on the theory.

Finally, a couple points that I hadn't seen mentioned earlier that
should probably be considered-

1) NULL's need to be treated specially - I suspect the assumption of
NULL independence is worse than other independence assumptions. Maybe
dealing with NULL dependence could be a half step towards full
dependence calculations?

2) Do we want to fold the MCV's into the dependence histogram? That
will cause problems in our copula approach but I'd hate to have to
keep an N^d histogram dependence relation in addition to the copula.

3) For equality selectivity estimates, I believe the assumption that
the ndistinct value distribution is uniform in the histogram will
become worse as the dimension increases. I proposed keeping track of
ndistinct per histogram beckets earlier in the marginal case partially
motivated by this exact scenario. Does that proposal make more sense
in this case? If so we'd need to store two distributions - one of the
counts and one of ndistinct.

4) How will this approach deal with histogram buckets that have
scaling count sizes ( ie -0.4 )?


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Joshua Tolley"
Дата:
Сообщение: Re: Cross-column statistics revisited
Следующее
От: Gregory Stark
Дата:
Сообщение: PGDay.it collation discussion notes