Calculating selectivity for the query-planner on ts_vector colums.

Поиск
Список
Период
Сортировка
От Jesper Krogh
Тема Calculating selectivity for the query-planner on ts_vector colums.
Дата
Msg-id 4AE1F840.5090205@krogh.cc
обсуждение исходный текст
Ответы Re: Calculating selectivity for the query-planner on ts_vector colums.  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-performance
Hi

It seems to me that the row estimates on a ts_vector search is a bit on
the low side for terms that is not in th MCV-list in pg_stats:

ftstest=# explain select id from ftstest where ftstest_body_fts @@
to_tsquery('nonexistingterm') order by id limit 10;
                                         QUERY PLAN

---------------------------------------------------------------------------------------------
 Limit  (cost=221.93..221.95 rows=10 width=4)
   ->  Sort  (cost=221.93..222.01 rows=33 width=4)
         Sort Key: id
         ->  Bitmap Heap Scan on ftstest  (cost=154.91..221.22 rows=33
width=4)
               Recheck Cond: (ftstest_body_fts @@
to_tsquery('nonexistingterm'::text))
               ->  Bitmap Index Scan on ftstest_tfs_idx
(cost=0.00..154.90 rows=33 width=0)
                     Index Cond: (ftstest_body_fts @@
to_tsquery('nonexistingterm'::text))
(7 rows)

Then I have been reading:
http://www.postgresql.org/docs/8.4/static/row-estimation-examples.html
and trying to reproduce the selectivity number for this query:

selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)

num_distinct is around 10m.
ftstest=# SELECT
attname,array_dims(most_common_vals),array_dims(most_common_freqs) FROM
pg_stats                          WHERE tablename='ftstest' AND
attname='ftstest_body_fts';
     attname      | array_dims | array_dims
------------------+------------+------------
 ftstest_body_fts | [1:2088]   | [1:2090]
(1 row)

ftstest=# select tablename,attname,freq from (select tablename,attname,
sum(freq) as freq from (SELECT
tablename,attname,unnest(most_common_freqs) as freq FROM pg_stats) as
foo  group by tablename,attname) as foo2 where freq > 1;
 tablename |     attname      |  freq
-----------+------------------+---------
 ftstest   | ftstest_body_fts | 120.967
(1 row)

then the selectivity is
(1-120.967)/(10000000 - 2088) = -.00001199920543409463

Which seem .. well wrong.

The algorithm makes the assumption that if a record is matching one of
the MCV's then it is not in the matching a rare-term. The above
algorithm doesnt give me the 33 rows about, so can anyone shortly
describe the changes for this algorithm when using ts_vectors?

Thanks.

--
Jesper

В списке pgsql-performance по дате отправления:

Предыдущее
От: Jesper Krogh
Дата:
Сообщение: Re: Queryplan within FTS/GIN index -search.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Calculating selectivity for the query-planner on ts_vector colums.