Re: Cross-column statistics revisited

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Cross-column statistics revisited
Дата	16 октября 2008 г. 23:39:11
Msg-id	6877.1224211124@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: Cross-column statistics revisited ("Joshua Tolley" <eggyknap@gmail.com>)
Ответы	Re: Cross-column statistics revisited
Список	pgsql-hackers

Дерево обсуждения

"Joshua Tolley" <eggyknap@gmail.com> writes:
> For what it's worth, neither version of correlation was what I had in
> mind. Statistical correlation between two variables is a single
> number, is fairly easy to calculate, and probably wouldn't help query
> plans much at all. I'm more interested in a more complex data
> gathering. The data model I have in mind (which I note I have *not*
> proven to actually help a large number of query plans -- that's
> obviously an important part of what I'd need to do in all this)
> involves instead a matrix of frequency counts.

Oh, I see the distinction you're trying to draw.  Agreed on both points:
a simple correlation number is pretty useless to us, and we don't have
hard evidence that a histogram-matrix will solve the problem.  However,
we do know that one-dimensional histograms of the sort currently
collected by ANALYZE work pretty well (if they're detailed enough).
It seems reasonable to me to extrapolate that same concept to two or
more dimensions.  The issue then becomes that a "detailed enough"
matrix might be impracticably bulky, so you need some kind of lossy
compression, and we don't have hard evidence about how well that will
work.  Nonetheless the road map seems clear enough to me.

> Right now our
> "histogram" values are really quantiles; the statistics_target T for a
> column determines a number of quantiles we'll keep track of, and we
> grab values from into an ordered list L so that approximately 1/T of
> the entries in that column fall between values L[n] and L[n+1]. I'm
> thinking that multicolumn statistics would instead divide the range of
> each column up into T equally sized segments,

Why would you not use the same histogram bin bounds derived for the
scalar stats (along each axis of the matrix, of course)?  This seems to
me to be arbitrarily replacing something proven to work with something
not proven.  Also, the above forces you to invent a concept of "equally
sized" ranges, which is going to be pretty bogus for a lot of datatypes.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Cross-column statistics revisited