Обсуждение: pg_trgm performance

Поиск
Список
Период
Сортировка

pg_trgm performance

От
Florian Weimer
Дата:
I've got a table with a few million rows, consisting of a single text
column.  The average length is about 17 characters.  For the sake of
an experiment, I put a trigram index on that table.  Unfortunately, %
queries without smallish LIMITs are ridiculously slow (they take
longer than an hour).  A full table scan with a "WHERE similarity(...)
>= 0.4" clause completes in just a couple of minutes.  The queries
only select a few hundred rows, so an index scan has got a real chance
to be faster than a sequential scan.

Am I missing something?  Or are trigrams just a poor match for my data
set?  Are the individual strings too long, maybe?

(This is with PostgreSQL 8.2.0, BTW.)

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: pg_trgm performance

От
"Steinar H. Gunderson"
Дата:
On Mon, Jan 15, 2007 at 11:16:36AM +0100, Florian Weimer wrote:
> Am I missing something?  Or are trigrams just a poor match for my data
> set?  Are the individual strings too long, maybe?

FWIW, I've seen the same results with 8.1.x.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: pg_trgm performance

От
"Guillaume Smet"
Дата:
Florian, Steinar,

Could you try to see if the GIN implementation of pg_trgm is faster in
your cases?

Florian, instead of using WHERE similarity(...) > 0.4, you should use
set_limit (SELECT set_limit(0.4);).

I posted it on -patches and it is available here:
http://people.openwide.fr/~gsmet/postgresql/pg_trgm_gin3.diff .

The patch is against HEAD but It applies on 8.2 without any problem.

After applying this patch and installing pg_trgm.sql (you should
uninstall pg_trgm before compiling using the old uninstall script),
then you can create:
CREATE INDEX idx_table_word ON table USING gin(word gin_trgm_ops);

17 characters is quite long so I'm not sure it will help you because
it usually has to recheck a high number of rows due to the GIN
implementation but I'd like to know if it's faster or slower in this
case.

If your data are not private and you don't have the time to test it, I
can test it here without any problem.

Thanks.

--
Guillaume

Re: pg_trgm performance

От
"Steinar H. Gunderson"
Дата:
On Sat, Feb 24, 2007 at 12:09:41AM +0100, Guillaume Smet wrote:
> Could you try to see if the GIN implementation of pg_trgm is faster in
> your cases?

I'm sorry, I can no longer remember where I needed pg_trgm. Simple testing of
your patch seems to indicate that the GiN version is about 65% _slower_ (18ms
vs. 30ms) for a test data set I found lying around, but I remember that on
the data set I needed it, the GIST version was a lot slower than that (think
3-400ms). The 18 vs. 30ms test is a random Amarok database, on 8.2.3
(Debian).

Sorry I couldn't be of more help.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: pg_trgm performance

От
"Guillaume Smet"
Дата:
Hi Steinar,

On 2/24/07, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
> I'm sorry, I can no longer remember where I needed pg_trgm. Simple testing of
> your patch seems to indicate that the GiN version is about 65% _slower_ (18ms
> vs. 30ms) for a test data set I found lying around, but I remember that on
> the data set I needed it, the GIST version was a lot slower than that (think
> 3-400ms). The 18 vs. 30ms test is a random Amarok database, on 8.2.3
> (Debian).

Could you post EXPLAIN ANALYZE for both queries (after 2 or 3 runs)?
And if you can provide  EXPLAIN ANALYZE for a couple of searches
(short length, medium length and long) in both cases, it could be nice
too.

The GiN version is not selective enough currently compared to GiST. It
generally finds the matching rows faster but it has a slower recheck
cond so it's sometimes interesting (in my case) and sometimes not that
interesting (it seems to be your case).

Thanks.

--
Guillaume

Re: pg_trgm performance

От
"Steinar H. Gunderson"
Дата:
On Sat, Feb 24, 2007 at 02:04:36AM +0100, Guillaume Smet wrote:
> Could you post EXPLAIN ANALYZE for both queries (after 2 or 3 runs)?

GIST version, short:

amarok=# explain analyze select count(*) from tags where title % 'foo';
                                                        QUERY PLAN
  

--------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=147.84..147.85 rows=1 width=0) (actual time=16.873..16.875 rows=1 loops=1)
   ->  Bitmap Heap Scan on tags  (cost=4.59..147.74 rows=41 width=0) (actual time=16.828..16.850 rows=7 loops=1)
         Recheck Cond: (title % 'foo'::text)
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..4.58 rows=41 width=0) (actual time=16.818..16.818 rows=7
loops=1)
               Index Cond: (title % 'foo'::text)
 Total runtime: 16.935 ms
(6 rows)

GiN version, short:

amarok=# explain analyze select count(*) from tags where title % 'foo';
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=151.89..151.90 rows=1 width=0) (actual time=30.197..30.199 rows=1 loops=1)
   ->  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) (actual time=5.555..30.157 rows=7 loops=1)
         Filter: (title % 'foo'::text)
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 width=0) (actual time=2.857..2.857 rows=5555
loops=1)
               Index Cond: (title % 'foo'::text)
 Total runtime: 30.292 ms
(6 rows)


GIST version, medium:

amarok=# explain analyze select count(*) from tags where title % 'chestnuts roasting on an 0pen fire';
                                                         QUERY PLAN
    

----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=147.84..147.85 rows=1 width=0) (actual time=216.149..216.151 rows=1 loops=1)
   ->  Bitmap Heap Scan on tags  (cost=4.59..147.74 rows=41 width=0) (actual time=216.135..216.137 rows=1 loops=1)
         Recheck Cond: (title % 'chestnuts roasting on an 0pen fire'::text)
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..4.58 rows=41 width=0) (actual time=216.124..216.124 rows=1
loops=1)
               Index Cond: (title % 'chestnuts roasting on an 0pen fire'::text)
 Total runtime: 216.214 ms
(6 rows)


amarok=# explain analyze select count(*) from tags where title % 'chestnuts roasting on an 0pen fire';
                                                         QUERY PLAN
     

-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=151.89..151.90 rows=1 width=0) (actual time=156.310..156.312 rows=1 loops=1)
   ->  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) (actual time=156.205..156.299 rows=1 loops=1)
         Filter: (title % 'chestnuts roasting on an 0pen fire'::text)
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 width=0) (actual time=155.748..155.748 rows=36
loops=1)
               Index Cond: (title % 'chestnuts roasting on an 0pen fire'::text)
 Total runtime: 156.376 ms
(6 rows)


GIST version, long:

amarok=# explain analyze select count(*) from tags where title % 'Donaueschingen (Peter Kruders
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'; 
;
                                                              QUERY PLAN
              

--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=147.84..147.85 rows=1 width=0) (actual time=597.115..597.117 rows=1 loops=1)
   ->  Bitmap Heap Scan on tags  (cost=4.59..147.74 rows=41 width=0) (actual time=597.102..597.104 rows=1 loops=1)
         Recheck Cond: (title % 'Donaueschingen (Peter Kruders
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text)
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..4.58 rows=41 width=0) (actual time=597.093..597.093 rows=1
loops=1)
               Index Cond: (title % 'Donaueschingen (Peter Kruders
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text)
 Total runtime: 597.173 ms
(6 rows)


GiN version, long:

amarok=# explain analyze select count(*) from tags where title % 'Donaueschingen (Peter Kruders
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'; 
;
                                                              QUERY PLAN
              

--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=151.89..151.90 rows=1 width=0) (actual time=435.789..435.791 rows=1 loops=1)
   ->  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) (actual time=435.777..435.779 rows=1 loops=1)
         Filter: (title % 'Donaueschingen (Peter Kruders
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text)
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 width=0) (actual time=435.729..435.729 rows=1
loops=1)
               Index Cond: (title % 'Donaueschingen (Peter Kruders
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text)
 Total runtime: 435.851 ms
(6 rows)


So, the GiN version seems to be a bit faster for long queries, but it's still
too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for these
three queries, so for the longer queries, the gain is only about a factor
two. (By the way, I would like to stress that this is not my personal music
collection! :-P)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: pg_trgm performance

От
"Guillaume Smet"
Дата:
On 2/24/07, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:

Thanks for your time.

> GiN version, short:
>    ->  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) (actual time=5.555..30.157 rows=7 loops=1)
>          Filter: (title % 'foo'::text)
>          ->  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 width=0) (actual time=2.857..2.857 rows=5555
loops=1)
>                Index Cond: (title % 'foo'::text)

This is currently the worst case in the gist - gin comparison because
in the index scan, gin version doesn't have the length of the indexed
string. So it returns a lot of rows which have every trigram of your
search string but has in fact a low similarity due to the length of
the indexed string (5555 rows -> 7 rows).
It cannot be fixed at the moment due to the way GIN indexes work.

> So, the GiN version seems to be a bit faster for long queries, but it's still
> too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for these
> three queries, so for the longer queries, the gain is only about a factor
> two. (By the way, I would like to stress that this is not my personal music
> collection! :-P)

The fact is that pg_trgm is designed to index words and not to index
long sentences. I'm not that surprised it's slow in your case.

It's also my case but following the instructions in README.pg_trgm I
created a dictionary of words using tsearch2 (stat function) and I use
pg_trgm on this dictionary to find similar words in my dictionary.

For example, I rewrite the search:
auberge cevenes
as:
(auberge | auberges | aubberge | auberg) & (ceven | cene | cevenol | cevennes)
using pg_trgm and my query can find Auberge des Cévennes (currently
it's limited to the 4th most similar words but I can change it
easily).

--
Guillaume

Re: pg_trgm performance

От
Oleg Bartunov
Дата:
On Mon, 26 Feb 2007, Guillaume Smet wrote:

> On 2/24/07, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
>
> Thanks for your time.
>
>> GiN version, short:
>>    ->  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0)
>> (actual time=5.555..30.157 rows=7 loops=1)
>>          Filter: (title % 'foo'::text)
>>          ->  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41
>> width=0) (actual time=2.857..2.857 rows=5555 loops=1)
>>                Index Cond: (title % 'foo'::text)
>
> This is currently the worst case in the gist - gin comparison because
> in the index scan, gin version doesn't have the length of the indexed
> string. So it returns a lot of rows which have every trigram of your
> search string but has in fact a low similarity due to the length of
> the indexed string (5555 rows -> 7 rows).
> It cannot be fixed at the moment due to the way GIN indexes work.
>
>> So, the GiN version seems to be a bit faster for long queries, but it's
>> still
>> too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for
>> these
>> three queries, so for the longer queries, the gain is only about a factor
>> two. (By the way, I would like to stress that this is not my personal music
>> collection! :-P)
>
> The fact is that pg_trgm is designed to index words and not to index
> long sentences. I'm not that surprised it's slow in your case.
>
> It's also my case but following the instructions in README.pg_trgm I
> created a dictionary of words using tsearch2 (stat function) and I use
> pg_trgm on this dictionary to find similar words in my dictionary.
>
> For example, I rewrite the search:
> auberge cevenes
> as:
> (auberge | auberges | aubberge | auberg) & (ceven | cene | cevenol |
> cevennes)
> using pg_trgm and my query can find Auberge des C?vennes (currently
> it's limited to the 4th most similar words but I can change it
> easily).

Did you rewrite query manually or use rewrite feature of tsearch2 ?

>
> --
> Guillaume
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: pg_trgm performance

От
"Guillaume Smet"
Дата:
On 2/26/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
> Did you rewrite query manually or use rewrite feature of tsearch2 ?

Currently, it's manual. I perform a pg_trgm query for each word of the
search words (a few stop words excluded) and I generate the ts_query
with the similar words instead of using the search words.
Is there any benefit of using rewrite() in this case?

--
Guillaume

Re: pg_trgm performance

От
Oleg Bartunov
Дата:
On Mon, 26 Feb 2007, Guillaume Smet wrote:

> On 2/26/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
>> Did you rewrite query manually or use rewrite feature of tsearch2 ?
>
> Currently, it's manual. I perform a pg_trgm query for each word of the
> search words (a few stop words excluded) and I generate the ts_query
> with the similar words instead of using the search words.
> Is there any benefit of using rewrite() in this case?

not really, just a matter of possible interesting architectural design.

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83