Re: Cross-column statistics revisited

Поиск
Список
Период
Сортировка
От Joshua Tolley
Тема Re: Cross-column statistics revisited
Дата
Msg-id e7e0a2570810161830y6e7939dby24328e7ae0808f73@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cross-column statistics revisited  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Cross-column statistics revisited  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Thu, Oct 16, 2008 at 6:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It appears to me that a lot of people in this thread are confusing
> correlation in the sense of statistical correlation between two
> variables with correlation in the sense of how well physically-ordered
> a column is.

For what it's worth, neither version of correlation was what I had in
mind. Statistical correlation between two variables is a single
number, is fairly easy to calculate, and probably wouldn't help query
plans much at all. I'm more interested in a more complex data
gathering. The data model I have in mind (which I note I have *not*
proven to actually help a large number of query plans -- that's
obviously an important part of what I'd need to do in all this)
involves instead a matrix of frequency counts. Right now our
"histogram" values are really quantiles; the statistics_target T for a
column determines a number of quantiles we'll keep track of, and we
grab values from into an ordered list L so that approximately 1/T of
the entries in that column fall between values L[n] and L[n+1]. I'm
thinking that multicolumn statistics would instead divide the range of
each column up into T equally sized segments, to form in the
two-column case a matrix, where the values of the matrix are frequency
counts -- the number of rows whose values for each column fall within
the particular segments of their respective ranges represented by the
boundaries of that cell in the matrix. I just realized while writing
this that this might not extend to situations where the two columns
are from different tables and don't necessarily have the same row
count, but I'll have to think about that.

Anyway, the size of such a matrix would be exponential in T, and
cross-column statistics involving just a few columns could easily
involve millions of values, for fairly normal statistics_targets.
That's where the compression ideas come in to play. This would
obviously need a fair bit of testing, but it's certainly conceivable
that modern regression techniques could reduce that frequency matrix
to a set of functions with a small number of parameters. Whether that
would result in values the planner can look up for a given set of
columns without spending more time than it's worth is another question
that will need exploring.

I started this thread knowing that past discussions have posed the
following questions:

1. What sorts of cross-column data can we really use?
2. Can we collect that information?
3. How do we know what columns to track?

For what it's worth, my original question was whether anyone had
concerns beyond these, and I think that has been fairly well answered
in this thread.

- Josh / eggyknap


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Cross-column statistics revisited
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Cross-column statistics revisited