Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От tv@fuzzy.cz
Тема Re: proposal : cross-column stats
Дата
Msg-id 3a60dfc58a70d5169062a828b32a4680.squirrel@sq.gransy.com
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Dne 13.12.2010 03:00, Robert Haas napsal(a):
>>> Well, the question is what data you are actually storing.  It's
>>> appealing to store a measure of the extent to which a constraint on
>>> column X constrains column Y, because you'd only need to store
>>> O(ncolumns^2) values, which would be reasonably compact and would
>>> potentially handle the zip code problem - a classic "hard case" rather
>>> neatly.  But that wouldn't be sufficient to use the above equation,
>>> because there A and B need to be things like "column X has value x",
>>> and it's not going to be practical to store a complete set of MCVs for
>>> column X for each possible value that could appear in column Y.
>>
>> O(ncolumns^2) values? You mean collecting such stats for each possible
>> pair of columns? Well, I meant something different.
>>
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?
>
> Well, I'm not real familiar with contingency tables, but it seems like
> you could end up needing to store a huge amount of data to get any
> benefit out of it, in some cases.  For example, in the United States,
> there are over 40,000 postal codes, and some even larger number of
> city names, and doesn't the number of entries go as O(m*n)?  Now maybe
> this is useful enough anyway that we should Just Do It, but it'd be a
> lot cooler if we could find a way to give the planner a meaningful
> clue out of some more compact representation.

Yes, storing a complete contingency table is not feasible in such cases.
My original proposal actually did not address this particular issue
(cities and ZIP codes) as it was based on simplified contingency tables
(with bins corresponding to current histograms, not to individual values).
So the number of values to store would grow much slower.

On the other hand, this generalization really makes it unusable in some
cases, and the issue we're discussing here (cities and ZIP codes) is one
of them. I think in such cases we could build a contingency table for MCV
and then use it to estimate those conditional probabilities we need, but I
expect it to be very tricky.

Thanks for the comments.

Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: pg_archivecleanup should remove WAL files also in pg_xlog?
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: ALTER TABLE ... ADD FOREIGN KEY ... NOT ENFORCED