Обсуждение: bitmap heap scan recheck on full text search with gin index

Поиск
Список
Период
Сортировка

bitmap heap scan recheck on full text search with gin index

От
Hal Roberts
Дата:
Hi All,

I'm getting poor performance on full text searches that return lots of
entries from a table with about 7 million rows.  I think the cause is
rechecking the text match on all of the returned rows, even though I'm
using a @@ query on a gin index, which the docs say should not require
a recheck.

Here's the table:

****
mediacloud=> \d download_texts;
                                     Table "public.download_texts"
     Column       |  Type   |                                 Modifiers
-------------------+---------
+
----------------------------------------------------------------------------
download_texts_id | integer | not null default
nextval('download_texts_download_texts_id_seq'::regclass)
downloads_id      | integer | not null
download_text     | text    | not null
Indexes:
   "download_texts_pkey" PRIMARY KEY, btree (download_texts_id)
   "download_texts_downloads_id_index" UNIQUE, btree (downloads_id)
   "download_texts_textsearch_idx" gin
(to_tsvector('english'::regconfig, download_text)), tablespace
"large_table_space"
Foreign-key constraints:
   "download_texts_downloads_id_fkey" FOREIGN KEY (downloads_id)
REFERENCES downloads(downloads_id)
Tablespace: "large_table_space"
****

And here's the query:

****
mediacloud=> explain analyze select count(dt.download_texts_id) from
download_texts dt where to_tsvector('english', download_text) @@
to_tsquery('english', 'stimulus');
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=26161.16..26161.17 rows=1 width=4) (actual
time=153640.083..153640.083 rows=1 loops=1)
  ->  Bitmap Heap Scan on download_texts dt  (cost=3937.41..26146.11
rows=6018 width=4) (actual time=1957.074..153529.351 rows=72225 loops=1)
        Recheck Cond: (to_tsvector('english'::regconfig,
download_text) @@ '''stimulus'''::tsquery)
        ->  Bitmap Index Scan on download_texts_textsearch_idx
(cost=0.00..3935.90 rows=6018 width=0) (actual time=1048.556..1048.556
rows=72225 loops=1)
              Index Cond: (to_tsvector('english'::regconfig,
download_text) @@ '''stimulus'''::tsquery)
Total runtime: 153642.249 ms
****

Nearly all of the time is being spent in the bitmap heap scan, I
suspect because of the work of rereading and rechecking the text of
all the matched entries.  Is this indeed what's going on here?  Is
there any way to make postgres not do that recheck?


Thanks!

-hal

Re: bitmap heap scan recheck on full text search with gin index

От
Oleg Bartunov
Дата:
Hal,

just create separate column with tsvector and create index on it.

Oleg
On Thu, 5 Feb 2009, Hal Roberts wrote:

> Hi All,
>
> I'm getting poor performance on full text searches that return lots of
> entries from a table with about 7 million rows.  I think the cause is
> rechecking the text match on all of the returned rows, even though I'm using
> a @@ query on a gin index, which the docs say should not require a recheck.
>
> Here's the table:
>
> ****
> mediacloud=> \d download_texts;
>                                   Table "public.download_texts"
>   Column       |  Type   |                                 Modifiers
> -------------------+---------+----------------------------------------------------------------------------
> download_texts_id | integer | not null default
> nextval('download_texts_download_texts_id_seq'::regclass)
> downloads_id      | integer | not null
> download_text     | text    | not null
> Indexes:
> "download_texts_pkey" PRIMARY KEY, btree (download_texts_id)
> "download_texts_downloads_id_index" UNIQUE, btree (downloads_id)
> "download_texts_textsearch_idx" gin (to_tsvector('english'::regconfig,
> download_text)), tablespace "large_table_space"
> Foreign-key constraints:
> "download_texts_downloads_id_fkey" FOREIGN KEY (downloads_id) REFERENCES
> downloads(downloads_id)
> Tablespace: "large_table_space"
> ****
>
> And here's the query:
>
> ****
> mediacloud=> explain analyze select count(dt.download_texts_id) from
> download_texts dt where to_tsvector('english', download_text) @@
> to_tsquery('english', 'stimulus');
> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate  (cost=26161.16..26161.17 rows=1 width=4) (actual
> time=153640.083..153640.083 rows=1 loops=1)
> ->  Bitmap Heap Scan on download_texts dt  (cost=3937.41..26146.11 rows=6018
> width=4) (actual time=1957.074..153529.351 rows=72225 loops=1)
>      Recheck Cond: (to_tsvector('english'::regconfig, download_text) @@
> '''stimulus'''::tsquery)
>      ->  Bitmap Index Scan on download_texts_textsearch_idx
> (cost=0.00..3935.90 rows=6018 width=0) (actual time=1048.556..1048.556
> rows=72225 loops=1)
>            Index Cond: (to_tsvector('english'::regconfig, download_text) @@
> '''stimulus'''::tsquery)
> Total runtime: 153642.249 ms
> ****
>
> Nearly all of the time is being spent in the bitmap heap scan, I suspect
> because of the work of rereading and rechecking the text of all the matched
> entries.  Is this indeed what's going on here?  Is there any way to make
> postgres not do that recheck?
>
>
> Thanks!
>
> -hal
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83