tsvector pg_stats seems quite a bit off.

Поиск
Список
Период
Сортировка
От Jesper Krogh
Тема tsvector pg_stats seems quite a bit off.
Дата
Msg-id 4BF4357E.6000505@krogh.cc
обсуждение исходный текст
Ответы Re: tsvector pg_stats seems quite a bit off.  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: tsvector pg_stats seems quite a bit off.  (Jan Urbański <wulczer@wulczer.org>)
Список pgsql-hackers
Hi.

I am working on getting full-text-search to work and have
come across something I think look a bit strange.

The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.

# ANALYZE verbose reference (document_tsvector);
INFO:  analyzing "reference"
INFO:  "reference": scanned 14486 of 14486 pages, containing 350174 live 
rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows
ANALYZE

Ok, so analyze allmost examined all rows. Looking into 
"most_common_freqs" I find
# select count(unnest) from (select unnest(most_common_freqs) from 
pg_stats where attname = 'document_tsvector') as foo; count
-------  2810
(1 row)


But the distribution is very "flat" at the end, the last 128 values are 
excactly
1.00189e-05
which means that any term sitting outside the array would get an estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

So far I have no idea if this is bad or good, so a couple of sample runs 
of stuff that
is sitting outside the "most_common_vals" array:

# explain analyze select id from efam.reference where document_tsvector 
@@ to_tsquery('searchterm') order by id limit 2000;
QUERYPLAN
 

------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135 
 
rows=1612 loops=1)   ->  Sort  (cost=35.99..35.99 rows=2 width=4) (actual 
time=20.709..23.190 rows=1612 loops=1)         Sort Key: id         Sort Method:  quicksort  Memory: 124kB         ->
BitmapHeap Scan on reference  (cost=28.02..35.98 rows=2 
 
width=4) (actual time=3.522..17.238 rows=1612 loops=1)               Recheck Cond: (document_tsvector @@ 
to_tsquery('searchterm'::text))               ->  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613 
loops=1)                     Index Cond: (document_tsvector @@ 
to_tsquery('searchterm'::text)) Total runtime: 30.743 ms
(9 rows)

Ok, the query-planner estimates that there are 2 rows .. excactly as 
predicted, works as expected but
in fact there are 1612 rows that matches.

So, analyze has sampled 6 of 7 rows in the table and this term exists in 
1612/350174 rows ~ freq: 0.0046 which
is way higher than the lower bound of 1.00189e-05 .. or it should have 
been sitting around the center of the 2810
values of the histogram collected.

So the "most_common_vals" seems to contain a lot of values that should 
never have been kept in favor
of other values that are more common.

In practice, just cranking the statistics estimate up high enough seems 
to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?

# select version();                                                          version

---------------------------------------------------------------------------------------------------------------------------
PostgreSQL9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCC 
 
gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit


Jesper
-- 
Jesper


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: pg_upgrade - link mode and transaction-wraparound data loss
Следующее
От: Stefan Kaltenbrunner
Дата:
Сообщение: pg_upgrade docs