Re: tsvector pg_stats seems quite a bit off.

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: tsvector pg_stats seems quite a bit off.
Дата	29 мая 2010 г. 15:34:49
Msg-id	19743.1275147278@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>)
Ответы	Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>)
Список	pgsql-hackers

Дерево обсуждения

Jan Urbański <wulczer@wulczer.org> writes:
> On 29/05/10 17:09, Tom Lane wrote:
>> There is definitely something wrong with your math there.  It's not
>> possible for the 100'th most common word to have a frequency as high
>> as 0.06 --- the ones above it presumably have larger frequencies,
>> which makes the total quite a lot more than 1.0.

> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

Um, apparently I can't do simple arithmetic first thing in the morning
either, cause I got my number wrong too ;-)

After a bit more research: if you use the basic form of Zipf's law
with a 1/k distribution, the first frequency has to be about 0.07
to make the total come out to 1.0 for a reasonable number of words.
So we could use s = 0.07 / K when we wanted a final list of K words.
Some people (including the LC paper) prefer a higher exponent, ie
1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
seems awfully high for the type of data we're working with, so I think
the 1/k rule is probably what we want here.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Jan Urbański
Дата: 29 мая 2010 г., 15:16:33
Сообщение: Re: tsvector pg_stats seems quite a bit off.

Следующее

От: Jan Urbański
Дата: 29 мая 2010 г., 16:14:31
Сообщение: Re: tsvector pg_stats seems quite a bit off.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: tsvector pg_stats seems quite a bit off.

Предыдущее

Следующее