Re: Cross-column statistics revisited

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Cross-column statistics revisited
Дата
Msg-id 6877.1224211124@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Cross-column statistics revisited  ("Joshua Tolley" <eggyknap@gmail.com>)
Ответы Re: Cross-column statistics revisited  ("Joshua Tolley" <eggyknap@gmail.com>)
Список pgsql-hackers
"Joshua Tolley" <eggyknap@gmail.com> writes:
> For what it's worth, neither version of correlation was what I had in
> mind. Statistical correlation between two variables is a single
> number, is fairly easy to calculate, and probably wouldn't help query
> plans much at all. I'm more interested in a more complex data
> gathering. The data model I have in mind (which I note I have *not*
> proven to actually help a large number of query plans -- that's
> obviously an important part of what I'd need to do in all this)
> involves instead a matrix of frequency counts.

Oh, I see the distinction you're trying to draw.  Agreed on both points:
a simple correlation number is pretty useless to us, and we don't have
hard evidence that a histogram-matrix will solve the problem.  However,
we do know that one-dimensional histograms of the sort currently
collected by ANALYZE work pretty well (if they're detailed enough).
It seems reasonable to me to extrapolate that same concept to two or
more dimensions.  The issue then becomes that a "detailed enough"
matrix might be impracticably bulky, so you need some kind of lossy
compression, and we don't have hard evidence about how well that will
work.  Nonetheless the road map seems clear enough to me.

> Right now our
> "histogram" values are really quantiles; the statistics_target T for a
> column determines a number of quantiles we'll keep track of, and we
> grab values from into an ordered list L so that approximately 1/T of
> the entries in that column fall between values L[n] and L[n+1]. I'm
> thinking that multicolumn statistics would instead divide the range of
> each column up into T equally sized segments,

Why would you not use the same histogram bin bounds derived for the
scalar stats (along each axis of the matrix, of course)?  This seems to
me to be arbitrarily replacing something proven to work with something
not proven.  Also, the above forces you to invent a concept of "equally
sized" ranges, which is going to be pretty bogus for a lot of datatypes.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Joshua Tolley"
Дата:
Сообщение: Re: Cross-column statistics revisited
Следующее
От: "Joshua Tolley"
Дата:
Сообщение: Re: Cross-column statistics revisited