Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: proposal : cross-column stats
Дата
Msg-id 1ABB35BF-A361-4535-A443-8968D2D60D1F@phlo.org
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (tv@fuzzy.cz)
Ответы Re: proposal : cross-column stats  (tv@fuzzy.cz)
Список pgsql-hackers
On Dec21, 2010, at 11:37 , tv@fuzzy.cz wrote:
> I doubt there is a way to this decision with just dist(A), dist(B) and
> dist(A,B) values. Well, we could go with a rule
> 
>  if [dist(A) == dist(A,B)] the [A => B]
> 
> but that's very fragile. Think about estimates (we're not going to work
> with exact values of dist(?)), and then about data errors (e.g. a city
> matched to an incorrect ZIP code or something like that).

Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
kind of fragility without penalizing the non-correlated case...

> This is the reason why they choose to always combine the values (with
> varying weights).

There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as
 P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as P(A=x,B=y) = P(A=x|B=y)*P(B=y).

Then they assume
 P(B=y|A=x) ~= dist(A)/dist(A,B) and P(A=x|B=y) ~= dist(B)/dist(A,B),

and go on to average the two different ways of computing P(A=x,B=y), which
finally gives
 P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2             = dist(A)*P(A=x)/(2*dist(A,B)) +
dist(B)*P(B=x)/(2*dist(A,B))            = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
 

That averaging steps add *no* further data-dependent weights. 

>> I'd like to find a statistical explanation for that definition of
>> F(A,B), but so far I couldn't come up with any. I created a Maple 14
>> worksheet while playing around with this - if you happen to have a
>> copy of Maple available I'd be happy to send it to you..
> 
> No, I don't have Maple. Have you tried Maxima
> (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
> even has an online notebook - that seems like a very comfortable way to
> exchange this kind of data.

I haven' tried them, but I will. That java-based GUI of Maple is driving
me nuts anyway... Thanks for the pointers!

best regards,
Florian Pflug




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Florian Pflug
Дата:
Сообщение: Re: [FeatureRequest] Base Convert Function
Следующее
От: tv@fuzzy.cz
Дата:
Сообщение: Re: proposal : cross-column stats