Re: tsvector pg_stats seems quite a bit off.

Поиск

Список

Период

Сортировка

От	Jesper Krogh
Тема	Re: tsvector pg_stats seems quite a bit off.
Дата	26 мая 2010 г. 01:20:11
Msg-id	1D455DEB-6747-4C7D-8398-7EBD44F5D19C@krogh.cc обсуждение исходный текст
Ответ на	Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>)
Список	pgsql-hackers

Дерево обсуждения

On 26/05/2010, at 01.16, Jan Urbański <wulczer@wulczer.org> wrote:

> On 19/05/10 21:01, Jesper Krogh wrote:
>> The document base is arount 350.000 documents and
>> I have set the statistics target on the tsvector column
>> to 1000 since the 100 seems way of.
>
> So for tsvectors the statistics target means more or less "at any time
> track at most 10 * <target> lexemes simultaneously" where "track"
> means
> keeping them in memory while going through the tuples being analysed.
>
> Remember that the measure is in lexemes, not whole tsvectors and the
> 10
> factor is meant to approximate the average number of unique lexemes
> in a
> tsvector. If your documents are very large, this might not be a good
> approximation.

I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.
>
>> But the distribution is very "flat" at the end, the last 128 values
>> are
>> excactly
>> 1.00189e-05
>> which means that any term sitting outside the array would get an
>> estimate of
>> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows
>
> Yeah, this might meant that you could try cranking up the stats
> target a
> lot, to make the set of simulatenously tracked lexemes larger (it will
> cost time and memory during analyse though). If the documents have
> completely different contents, what can happen is that almost all
> lexemes are only seen a few times and get removed during the pruning
> of
> the working set. I have seen similar behaviour while working on the
> typanalyze function for tsvectors.

I Think i would prefer something less "magic"   I Can increase the
statistics target and get more reliable data but that increases also
the amount of tuples being picked out for analysis which is really
time consuming.

But that also means that what gets stored as the lower bound of the
historgram isn't anywhere near the lower bound, more the lower bound
of the "artificial" histogram that happened after the last pruning.

I Would suggest that the pruning in the end should be aware of this.
Perhaps by keeping track of the least frequent value that never got
pruned and using that as the last pruning ans lower bound?

Thanks a lot for the explanation it fits fairly well why i couldn't
construct a simple test set that had the problem.

>
>> So far I have no idea if this is bad or good, so a couple of sample
>> runs
>> of stuff that
>> is sitting outside the "most_common_vals" array:
>>
>> [gathered statistics suck]
>
>> So the "most_common_vals" seems to contain a lot of values that
>> should
>> never have been kept in favor
>> of other values that are more common.
>
>> In practice, just cranking the statistics estimate up high enough
>> seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?
>
> The algorithm to determine most common vals does not do it accurately.
> That would require keeping all lexemes from the analysed tsvectors in
> memory, which would be impractical. If you want to learn more about
> the
> algorithm being used, try reading
> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
> ts_typanalyze.c

I'll do some Reading

Jesper

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: tsvector pg_stats seems quite a bit off.