Re: cross column correlation revisted

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: cross column correlation revisted
Дата
Msg-id 4C3D9432.2050201@enterprisedb.com
обсуждение исходный текст
Ответ на cross column correlation revisted  (PostgreSQL - Hans-Jürgen Schönig<postgres@cybertec.at>)
Ответы Re: cross column correlation revisted  (PostgreSQL - Hans-Jürgen Schönig<postgres@cybertec.at>)
Re: cross column correlation revisted  (Yeb Havinga <yebhavinga@gmail.com>)
Re: cross column correlation revisted  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 14/07/10 13:12, PostgreSQL - Hans-Jürgen Schönig wrote:
> hello everybody,
>
> we are currently facing some serious issues with cross correlation issue.
> consider: 10% of all people have breast cancer. we have 2 genders (50:50).
> if i select all the men with breast cancer, i will get basically nobody - the planner will overestimate the output.
> this is the commonly known problem ...
>
> this cross correlation problem can be quite nasty in many many cases.
> underestimated nested loops can turn joins into a never ending nightmare and so on and so on.
>
> my ideas is the following:
> what if we allow users to specifiy cross-column combinations where we keep separate stats?
> maybe somehow like this ...
>
>     ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4)
>
> or ...
>
>     ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 = y.id2)
>
> clearly we cannot store correlation for all combinations of all columns so we somehow have to limit it.
>
> what is the general feeling about something like that?

+1 is my general feeling, it's good if you can tell the system to 
collect additional statistics where needed. And once you have that, you 
can write an agent or something to detect automatically which extra 
statistics might be useful.

However, the problem is how to represent and store the 
cross-correlation. For fields with low cardinality, like "gender" and 
boolean "breast-cancer-or-not" you can count the prevalence of all the 
different combinations, but that doesn't scale. Another often cited 
example is zip code + street address. There's clearly a strong 
correlation between them, but how do you represent that?

For scalar values we currently store a histogram. I suppose we could 
create a 2D histogram for two columns, but that doesn't actually help 
with the zip code + street address problem.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: PostgreSQL - Hans-Jürgen Schönig
Дата:
Сообщение: cross column correlation revisted
Следующее
От: PostgreSQL - Hans-Jürgen Schönig
Дата:
Сообщение: Re: cross column correlation revisted