Re: Cross-column statistics revisited

Поиск
Список
Период
Сортировка
От Joshua Tolley
Тема Re: Cross-column statistics revisited
Дата
Msg-id e7e0a2570810181333y5d0432f8ld6e0dea5fa8cf285@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cross-column statistics revisited  ("Nathan Boley" <npboley@gmail.com>)
Ответы Re: Cross-column statistics revisited  ("Nathan Boley" <npboley@gmail.com>)
Список pgsql-hackers
On Fri, Oct 17, 2008 at 7:54 PM, Nathan Boley <npboley@gmail.com> wrote:
>> I'm still working my way around the math, but copulas sound better
>> than anything else I've been playing with.
>
> I think the easiest way to think of them is, in 2-D finite spaces,
> they are just a plot of the order statistics against one another. Feel
> free to mail me off list if you have any math questions.

I'm still working out what a copula is, and how I've got to figure out
the stuff below, too! :) /me has reading to do... The treatment of
copulae in one of my stats texts is much better than wikipedia, I
though, so I'm progressing. They sound like an excellent potential
solution the more I figure out what they are, for whatever my opinion
is worth, but making sure they work decently for all data types,
including those where there's no real notion of distance, may be
interesting. I still need to go through backend/utils/adt/selfuncs.c
to figure out exactly how we use the one-dimensional values.

> I've previously thought that, at least in the 2D case, we could use
> image compression algorithms to compress the copula, but recently I've
> realized that this is a change point problem. In terms of compression,
> we want to decompose the copula into regions that are as homogenous as
> possible.  I'm not familiar with change point problems in multiple
> dimensions, but I'll try and ask someone that is, probably late next
> week. If you decide to go the copula route, I'd be happy to write the
> decomposition algorithm - or at least work on the theory.
>
> Finally, a couple points that I hadn't seen mentioned earlier that
> should probably be considered-
>
> 1) NULL's need to be treated specially - I suspect the assumption of
> NULL independence is worse than other independence assumptions. Maybe
> dealing with NULL dependence could be a half step towards full
> dependence calculations?

Agreed, though how to treat them I have no idea.

>
> 2) Do we want to fold the MCV's into the dependence histogram? That
> will cause problems in our copula approach but I'd hate to have to
> keep an N^d histogram dependence relation in addition to the copula.

Yeah, if we're already trying to figure out how to compress copulae,
having also to compress MCV matrices seems painful and error-prone.
But I'm not sure why it would cause problems to keep them in the
copula -- is that just because we are most interested in the copula
modeling the parts of the distribution that are most sparsely
populated?

> 3) For equality selectivity estimates, I believe the assumption that
> the ndistinct value distribution is uniform in the histogram will
> become worse as the dimension increases. I proposed keeping track of
> ndistinct per histogram beckets earlier in the marginal case partially
> motivated by this exact scenario. Does that proposal make more sense
> in this case? If so we'd need to store two distributions - one of the
> counts and one of ndistinct.

It's definitely worth investigating (or seeing if someone else has
already published an investigation). There are probably several
similar stats we could track and make use of, provided it didn't slow
the planner too much to make use of them.

> 4) How will this approach deal with histogram buckets that have
> scaling count sizes ( ie -0.4 )?

I'm not sure what you mean here.

- Josh / eggyknap


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: PGDay.it collation discussion notes
Следующее
От: "M. Edward (Ed) Borasky"
Дата:
Сообщение: Lisp as a procedural language?