Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От tv@fuzzy.cz
Тема Re: proposal : cross-column stats
Дата
Msg-id 7482a3473efb8744e7e10b4dc9c47499.squirrel@sq.gransy.com
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Ответы Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Список pgsql-hackers
> On Dec21, 2010, at 11:37 , tv@fuzzy.cz wrote:
>> I doubt there is a way to this decision with just dist(A), dist(B) and
>> dist(A,B) values. Well, we could go with a rule
>>
>>  if [dist(A) == dist(A,B)] the [A => B]
>>
>> but that's very fragile. Think about estimates (we're not going to work
>> with exact values of dist(?)), and then about data errors (e.g. a city
>> matched to an incorrect ZIP code or something like that).
>
> Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
> kind of fragility without penalizing the non-correlated case...

Yes, I understand the intention, but I'm not sure how exactly do you want
to use the F(?,?) function to compute the P(A,B) - which is the value
we're looking for.

If I understand it correctly, you proposed something like this
 IF (F(A,B) > F(B,A)) THEN   P(A,B) := c*P(A); ELSE   P(A,B) := d*P(B); END IF;

or something like that (I guess c=dist(A)/dist(A,B) and
d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may
easily happen due to data errors / imprecise estimate.

And this actually matters, because P(A) and P(B) may be actually
significantly different. So this would be really vulnerable to slight
changes in the estimates etc.

>> This is the reason why they choose to always combine the values (with
>> varying weights).
>
> There are no varying weights involved there. What they do is to express
> P(A=x,B=y) once as
>
> ...
>
>   P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>               = dist(A)*P(A=x)/(2*dist(A,B)) +
> dist(B)*P(B=x)/(2*dist(A,B))
>               = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>
> That averaging steps add *no* further data-dependent weights.

Sorry, by 'varying weights' I didn't mean that the weights are different
for each value of A or B. What I meant is that they combine the values
with different weights (just as you explained).

regards
Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Florian Pflug
Дата:
Сообщение: Re: proposal : cross-column stats
Следующее
От: Florian Pflug
Дата:
Сообщение: Re: proposal : cross-column stats