Re: tsvector pg_stats seems quite a bit off.

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: tsvector pg_stats seems quite a bit off.
Дата
Msg-id 16212.1275014877@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: tsvector pg_stats seems quite a bit off.  (Jan Urbański <wulczer@wulczer.org>)
Ответы Re: tsvector pg_stats seems quite a bit off.  (Jan Urbański <wulczer@wulczer.org>)
Re: tsvector pg_stats seems quite a bit off.  (Jan Urbański <wulczer@wulczer.org>)
Re: tsvector pg_stats seems quite a bit off.  (Jesper Krogh <jesper@krogh.cc>)
Список pgsql-hackers
Jan Urbański <wulczer@wulczer.org> writes:
> On 19/05/10 21:01, Jesper Krogh wrote:
>> In practice, just cranking the statistics estimate up high enough seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?

> The algorithm to determine most common vals does not do it accurately.
> That would require keeping all lexemes from the analysed tsvectors in
> memory, which would be impractical. If you want to learn more about the
> algorithm being used, try reading
> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
> ts_typanalyze.c

I re-scanned that paper and realized that there is indeed something
wrong with the way we are doing it.  The paper says (last sentence in
the definition of the algorithm, section 4.2):
When a user requests a list of items with threshold s, we outputthose entries in D where f >= (s-e)N.

What we are actually doing is emitting every entry with f >= 2.  Since
e is fixed at 1/w, this effectively means that we are setting s to be
only fractionally greater than e, which means very large relative errors
in the estimates.

Or, if you want it explained another way: we are emitting words whose f
is very small and whose delta is very large, representing items that got
added to the scan very late.  These really shouldn't be there because
their true frequency is probably a lot less than the intended threshold.

The net effect of this is first that there are a lot of rather useless
entries in the MCV list whose claimed frequency is quite small, like as
little as two occurrences.  Their true frequency could be quite a bit
more.  What's even worse is that we believe that the minimum of these
claimed frequencies is a reliable upper bound for the frequencies of
items not listed in the MCV list.  Both of these errors are manifest
in Jesper's description of his case, and they're also easy to reproduce
if you just take stats on any reasonable-size collection of documents.
Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list.

So I think we have to fix this.  The right thing is to select s and e
values that are actually meaningful in the terms of the paper, then
compute w from that, and be sure to enforce the minimum f value
specified in the algorithm ... ie, don't be afraid to throw away values
in the final D table.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: quoting and recovery.conf
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Specification for Trusted PLs?