Re: cross column correlation revisted

Поиск
Список
Период
Сортировка
От PostgreSQL - Hans-Jürgen Schönig
Тема Re: cross column correlation revisted
Дата
Msg-id B4E477A9-4885-4046-BB59-AAE13012C76F@cybertec.at
обсуждение исходный текст
Ответ на Re: cross column correlation revisted  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Список pgsql-hackers
On Jul 14, 2010, at 12:40 PM, Heikki Linnakangas wrote:

> On 14/07/10 13:12, PostgreSQL - Hans-Jürgen Schönig wrote:
>> hello everybody,
>>
>> we are currently facing some serious issues with cross correlation issue.
>> consider: 10% of all people have breast cancer. we have 2 genders (50:50).
>> if i select all the men with breast cancer, i will get basically nobody - the planner will overestimate the output.
>> this is the commonly known problem ...
>>
>> this cross correlation problem can be quite nasty in many many cases.
>> underestimated nested loops can turn joins into a never ending nightmare and so on and so on.
>>
>> my ideas is the following:
>> what if we allow users to specifiy cross-column combinations where we keep separate stats?
>> maybe somehow like this ...
>>
>>     ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4)
>>
>> or ...
>>
>>     ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 = y.id2)
>>
>> clearly we cannot store correlation for all combinations of all columns so we somehow have to limit it.
>>
>> what is the general feeling about something like that?
>
> +1 is my general feeling, it's good if you can tell the system to collect additional statistics where needed. And
onceyou have that, you can write an agent or something to detect automatically which extra statistics might be useful. 
>


it seems i can leave my bunker where i was hiding for cover when i was waiting for a reply ;).
yes, my idea was to have an agent as well - but this is just some follow up problem.


> However, the problem is how to represent and store the cross-correlation. For fields with low cardinality, like
"gender"and boolean "breast-cancer-or-not" you can count the prevalence of all the different combinations, but that
doesn'tscale. Another often cited example is zip code + street address. There's clearly a strong correlation between
them,but how do you represent that? 


we could play the same story with a table storing people including their home country and the color of their skin.
obviously we will have more black people in african countries..


>
> For scalar values we currently store a histogram. I suppose we could create a 2D histogram for two columns, but that
doesn'tactually help with the zip code + street address problem. 
>


i think we might go for a second relation here specifically for this issue and a boolean flag in the current stats
tableindicating that additional correlation stats exist (to avoid an additional lookup unless really necessary). 
do you have a useful syntax in mind? the thing is: this issue can be isolated inside a table (e.g. WHERE a.id = a.id2
ANDa.id3 = a.id4) or it might span two tables with an arbitrary number of fields. 
many thanks,
    hans


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: cross column correlation revisted
Следующее
От: Yeb Havinga
Дата:
Сообщение: Re: cross column correlation revisted