Using FTI-Search (likely a more general runtime-puzzle)

Поиск
Список
Период
Сортировка
От Markus Wollny
Тема Using FTI-Search (likely a more general runtime-puzzle)
Дата
Msg-id 2266D0630E43BB4290742247C8910575014CE308@dozer.computec.de
обсуждение исходный текст
Ответы Re: Using FTI-Search (likely a more general runtime-puzzle)  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Re: Using FTI-Search (likely a more general runtime-puzzle)  (Oleg Bartunov <oleg@sai.msu.su>)
Список pgsql-general
Hello!

I have implemented the full text index feature from /contrib and so far
the results look quite promising. We use it to index several columns of
an article table. The main column is called 'text'. An example for a
search would be to retrieve all the articles with "grand theft auto" in
them somewhere (not necessarily one string but individual words).

------------------------------------------------------------------------
--------------------
Case 1: Our Good Old Fashioned Search (TM) over the article-table:

ANALYZE select distinct (p1.article_id)
from article p1, article p2, article p3
where lower(p1.text) LIKE '%grand%'
AND lower(p2.text) LIKE '%theft%'
AND lower(p3.text) LIKE '%auto%'
AND p1.article_id=p2.article_id
and p1.article_id=p3.article_id

Unique  (cost=3067.03..3067.04 rows=1 width=12) (actual
time=1455.25..1455.44 rows=30 loops=1)
  ->  Sort  (cost=3067.03..3067.03 rows=1 width=12) (actual
time=1455.24..1455.29 rows=30 loops=1)
        ->  Nested Loop  (cost=1098.52..3067.02 rows=1 width=12) (actual
time=1211.54..1453.82 rows=30 loops=1)
              ->  Hash Join  (cost=1098.52..2197.47 rows=6 width=8)
(actual time=1121.32..1309.23 rows=30 loops=1)
                    ->  Seq Scan on article p2  (cost=0.00..1098.43
rows=35 width=4) (actual time=352.03..539.28 rows=30 loops=1)
                    ->  Hash  (cost=1098.43..1098.43 rows=35 width=4)
(actual time=766.21..766.21 rows=0 loops=1)
                          ->  Seq Scan on article p3
(cost=0.00..1098.43 rows=35 width=4) (actual time=13.63..763.73 rows=411
loops=1)
              ->  Index Scan using idx0_article on article p1
(cost=0.00..140.79 rows=1 width=4) (actual time=4.79..4.80 rows=1
loops=30)
Total runtime: 1456.00 msec

This is the time, fti needs to beat in order to be of any use to us.

------------------------------------------------------------------------
--------------------

Case 2:
Searching the FTI-Table:
EXPLAIN ANALYZE select distinct (p.article_id)
from article p, article_fti f1, article_fti f2, article_fti f3
where f1.string ='grand' and f2.string ='theft' and f3.string ='auto'
and p.oid=f1.id and p.oid=f2.id and p.oid=f3.id;


Unique  (cost=39200652.79..39647427.57 rows=17870991 width=20) (actual
time=52506.81..52506.81 rows=0 loops=1)
  ->  Sort  (cost=39200652.79..39200652.79 rows=178709913 width=20)
(actual time=52506.81..52506.81 rows=0 loops=1)
        ->  Merge Join  (cost=222181.40..313718.83 rows=178709913
width=20) (actual time=52505.69..52505.69 rows=0 loops=1)
              ->  Merge Join  (cost=147175.62..170464.70 rows=1855421
width=8) (actual time=37084.36..37084.36 rows=1 loops=1)
                    ->  Sort  (cost=73587.81..73587.81 rows=19264
width=4) (actual time=22423.27..22423.27 rows=1 loops=1)
                          ->  Seq Scan on article_fti f2
(cost=0.00..72216.86 rows=19264 width=4) (actual time=18297.22..22422.20
rows=97 loops=1)
                    ->  Sort  (cost=73587.81..73587.81 rows=19264
width=4) (actual time=14661.04..14661.04 rows=1 loops=1)
                          ->  Seq Scan on article_fti f3
(cost=0.00..72216.86 rows=19264 width=4) (actual time=1344.87..14659.59
rows=159 loops=1)
              ->  Sort  (cost=75005.78..75005.78 rows=19264 width=12)
(actual time=15421.32..15421.32 rows=0 loops=1)
                    ->  Hash Join  (cost=1080.86..73634.84 rows=19264
width=12) (actual time=15421.10..15421.10 rows=0 loops=1)
                          ->  Seq Scan on article_fti f1
(cost=0.00..72216.86 rows=19264 width=4) (actual time=4879.59..14306.97
rows=350 loops=1)
                          ->  Hash  (cost=1063.29..1063.29 rows=7029
width=8) (actual time=1109.79..1109.79 rows=0 loops=1)
                                ->  Seq Scan on article p
(cost=0.00..1063.29 rows=7029 width=8) (actual time=12.20..1076.46
rows=7029 loops=1)
Total runtime: 52507.92 msec

This is way too long. I don't know why it doesn't use the indexes I gave
to it... If someone has got some idea as to what can be done to optimize
query-planner decisions, I'd be happy to comply. But anyway, I don't
worry about this too much and just force it to use them, once I know its
decision for seqential scan is wrong:

------------------------------------------------------------------------
--------------------

Case 3:
Searching the FTI-Table forcing use of indexes:
SET ENABLE_SEQSCAN=OFF;
EXPLAIN ANALYZE select distinct (p.article_id)
from article p, article_fti f1, article_fti f2, article_fti f3
where f1.string ='grand' and f2.string ='theft' and f3.string ='auto'
and p.oid=f1.id and p.oid=f2.id and p.oid=f3.id;
SET ENABLE_SEQSCAN=ON;

NOTICE:  QUERY PLAN:

Unique  (cost=5341.68..5342.61 rows=37 width=20) (actual
time=40.52..41.83 rows=65 loops=1)
  ->  Sort  (cost=5341.68..5341.68 rows=372 width=20) (actual
time=40.52..40.96rows=376 loops=1)
        ->  Nested Loop  (cost=3067.05..5325.79 rows=372 width=20)
(actual time=14.02..38.38 rows=376 loops=1)
              ->  Merge Join  (cost=3067.05..3084.32 rows=372 width=12)
(actualtime=13.95..20.45 rows=376 loops=1)
                    ->  Merge Join  (cost=2044.70..2054.37 rows=502
width=8) (actual time=5.75..7.81 rows=176 loops=1)
                          ->  Sort  (cost=1022.35..1022.35 rows=678
width=4) (actual time=2.31..2.44 rows=97 loops=1)
                                ->  Index Scan using
article_fti_string_idx on article_fti f2  (cost=0.00..990.47 rows=678
width=4) (actual time=0.16..1.74 rows=97 loops=1)
                          ->  Sort  (cost=1022.35..1022.35 rows=678
width=4) (actual time=3.42..3.66 rows=194 loops=1)
                                ->  Index Scan using
article_fti_string_idx on article_fti f3  (cost=0.00..990.47 rows=678
width=4) (actual time=0.13..2.53 rows=159 loops=1)
                    ->  Sort  (cost=1022.35..1022.35 rows=678 width=4)
(actual time=8.18..8.89 rows=532 loops=1)
                          ->  Index Scan using article_fti_string_idx on
article_fti f1  (cost=0.00..990.47 rows=678 width=4) (actual
time=0.13..5.98 rows=350 loops=1)
              ->  Index Scan using article_oid_idx on article p
(cost=0.00..6.01 rows=1 width=8) (actual time=0.03..0.03 rows=1
loops=376)
Total runtime: 42.41 msec

Not that's much more like it :) This query returns exactly 65 numbers.
But really we want not only the article-IDs, but also some info on the
article as well:

------------------------------------------------------------------------
--------------------

Case 3b:
Retrieving additional article-info:

SET ENABLE_SEQSCAN=OFF;
EXPLAIN ANALYZE
SELECT article_id, site_id, article_type, topstory, headline, published
from article where article_id in (
select distinct (p.article_id)
from article p, article_fti f1, article_fti f2, article_fti f3
where f1.string ='grand' and f2.string ='theft' and f3.string ='auto'
and p.oid=f1.id and p.oid=f2.id and p.oid=f3.id);
SET ENABLE_SEQSCAN=ON;

NOTICE:  QUERY PLAN:

Seq Scan on article  (cost=100000000.00..137645255.53 rows=3523
width=77) (actual time=1756.30..3720.44 rows=65 loops=1)
  SubPlan
    ->  Materialize  (cost=5342.61..5342.61 rows=37 width=20) (actual
time=0.17..0.28 rows=65 loops=7046)
          ->  Unique  (cost=5341.68..5342.61 rows=37 width=20) (actual
time=1182.07..1183.29 rows=65 loops=1)
                ->  Sort  (cost=5341.68..5341.68 rows=372 width=20)
(actual time=1182.06..1182.52 rows=376 loops=1)
                      ->  Nested Loop  (cost=3067.05..5325.79 rows=372
width=20) (actual time=318.55..1179.04 rows=376 loops=1)
                            ->  Merge Join  (cost=3067.05..3084.32
rows=372 width=12) (actual time=300.97..314.39 rows=376 loops=1)
                                  ->  Merge Join  (cost=2044.70..2054.37
rows=502 width=8) (actual time=159.24..163.45 rows=176 loops=1)
                                        ->  Sort  (cost=1022.35..1022.35
rows=678 width=4) (actual time=71.23..71.45 rows=97 loops=1)
                                              ->  Index Scan using
article_fti_string_idx on article_fti f2  (cost=0.00..990.47 rows=678
width=4) (actual time=1.46..69.08 rows=97 loops=1)
                                        ->  Sort  (cost=1022.35..1022.35
rows=678 width=4) (actual time=87.97..88.39 rows=194 loops=1)
                                              ->  Index Scan using
article_fti_string_idx on article_fti f3  (cost=0.00..990.47 rows=678
width=4) (actual time=0.19..85.06 rows=159 loops=1)
                                  ->  Sort  (cost=1022.35..1022.35
rows=678 width=4) (actual time=141.70..143.03 rows=532 loops=1)
                                        ->  Index Scan using
article_fti_string_idx on article_fti f1  (cost=0.00..990.47 rows=678
width=4) (actual time=0.17..138.95 rows=350 loops=1)
                            ->  Index Scan using article_oid_idx on
article p  (cost=0.00..6.01 rows=1 width=8) (actual time=1.65..2.27
rows=1 loops=376)
Total runtime: 3722.74 msec

Whoa! Why is that? How come this takes so extremely much longer?
Because...

------------------------------------------------------------------------
--------------------

Case 3c:
If I just select all the ids, I get a total runtime of  42.41 msec. And
if I do a select ... where article_id in (all the results from case3):

explain analyze SELECT article_id, site_id, article_type, topstory,
headline, published from article where article_id in
(8965,8966,9701,10204,11114,11788,11941,12339,12685,12696,12784,13324,14
049,30982,31697,31924,31941,31944,32470,36810,37950,38913,39736,40470,41
330,45651,47050,47997,49692,51253,51730,51817,52111,52233,52653,52770,53
090,54410,54812,54890,55310,55974,56051,56590,56610,56630,56950,57210,57
690,59670,60951,62630,62650,63070,63130,63551,63570,65691,65710,65771,66
050,66490,66531,67030,67110);

NOTICE:  QUERY PLAN:

Index Scan using idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article, idx0_article, idx0_article, idx0_article, idx0_article,
idx0_article on article  (cost=0.00..378.61 rows=67 width=77) (actual
time=0.08..5.17 rows=65 loops=1)
Total runtime: 6.15 msec

I get just another 6.15msec.

Now I do think that the total runtime of 3b should really be 3+3c, which
amounts to roughly 50msec, but not nearly 4 seconds. What am I doing
wrong?

Thanks for your advice!

Regards,

    Markus

В списке pgsql-general по дате отправления:

Предыдущее
От: Elielson Fontanezi
Дата:
Сообщение: RES: RES: [SQL] Queries not using Index
Следующее
От: Stephan Szabo
Дата:
Сообщение: Re: Using FTI-Search (likely a more general runtime-puzzle)