Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

Поиск
Список
Период
Сортировка
От Laurence Parry
Тема Workaround: Planner preference for tsquery filter vs. GIN index in fast text search
Дата
Msg-id BLU178-DS1116AC5CC07E7D50A4589CA55F0@phx.gbl
обсуждение исходный текст
Ответы Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search  (Oleg Bartunov <obartunov@gmail.com>)
Список pgsql-performance
As mentioned here and elsewhere (most recently in "How can I get the query
planner to use a bitmap index scap instead of an index scan ?" - 8 Mar
2014), estimation of the relative cost of text search operations using
GIN-indexed columns sometimes goes awry, particularly when there will be a
large number of matches.

The planner may choose to use a sequential or unrelated index scan with @@
as a filter, especially when incorporated as a subquery, incurring
significant cost (even without considering de-TOASTing). Pre-tsvectorizing
the column offers only a slight mitigation and can cause regressions (if
nothing else, it adds another large column).

What worked for me (and I'm hoping for others, though YMMV) was adding
'OFFSET 0' to the subquery involving the indexed column, e.g.

...
(SELECT sk1.submission_id
FROM submission_keywords sk1, keywords k1
WHERE sk1.keyword_id = k1.keyword_id
    AND
to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop',
'tails')
OFFSET 0)
...

The result is a bitmap scan:
------------------------------------------------------------------------------------------
Nested Loop
(cost=8.73..4740.29 rows=21348 width=4)
(actual time=0.621..13.661 rows=20097 loops=1)
    ->  Bitmap Heap Scan on keywords k1
        (cost=8.30..1028.72 rows=755 width=4)
        (actual time=0.603..2.276 rows=752 loops=1)
        Recheck Cond:
        (to_tsvector('english_nostop'::regconfig, keyword) @@
'''tail'''::tsquery)
        ->  Bitmap Index Scan on keyword_to_tsvector_keywords
            (cost=0.00..8.11 rows=755 width=0)
            (actual time=0.496..0.496 rows=756 loops=1)
            Index Cond:
            (to_tsvector('english_nostop'::regconfig, keyword) @@
'''tail'''::tsquery)
    ->  Index Only Scan using keyword_id_submission_id_submission_keywords
on submission_keywords sk1
        (cost=0.43..3.47 rows=145 width=8)
        (actual time=0.005..0.010 rows=27 loops=752)
        Index Cond: (keyword_id = k1.keyword_id)
        Heap Fetches: 99
Total runtime: 14.809 ms

Without this the test was moved to a filter inside a nested loop, with
disastrous results:
->  Hash Semi Join
    (cost=23.37..23.51 rows=1 width=8)
    (actual time=0.090..0.090 rows=0 loops=594670)
    Hash Cond: (s1.submission_id = sk1.submission_id)
    ->  Index Only Scan using submissions_pkey on submissions s1
        (cost=0.42..0.56 rows=1 width=4)
        (actual time=0.007..0.007 rows=1 loops=17352)
        Index Cond: (submission_id = s.submission_id)
        Heap Fetches: 8372
        ->  Hash
            (cost=22.94..22.94 rows=1 width=4)
            (actual time=0.086..0.086 rows=0 loops=594670)
            Buckets: 1024  Batches: 1  Memory Usage: 0kB
            ->  Nested Loop
                (cost=0.85..22.94 rows=1 width=4)
                (actual time=0.083..0.085 rows=0 loops=594670)
                ->  Index Only Scan using file_keyword on
submission_keywords sk1
                    (cost=0.43..0.80 rows=13 width=8)
                    (actual time=0.006..0.008 rows=9 loops=594670)
                    Index Cond: (submission_id = s.submission_id)
                    Heap Fetches: 21324
                    ->  Index Scan using keywords_pkey on keywords k1
                        (cost=0.42..1.69 rows=1 width=4)
                        (actual time=0.008..0.008 rows=0 loops=5329219)
                        Index Cond: (keyword_id = sk1.keyword_id)
                        Filter: (to_tsvector('english_nostop'::regconfig,
keyword) @@ '''tail'''::tsquery)
Total runtime: 55194.034 ms [there are other lines, but 50 sec is above]

Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we
get a lot of searches for the most common terms, and scans just get worse
the more you add.

I got the idea from Seamus Abshere:
http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/

I've heard it said that "any Postgres DBA worth his salt" knows this trick,
as well as the use of "WITH" to create a common table expression. Alas, many
of us are still learning . . . I beat my head over this for a week, and it's
affected our site for far longer. This kind of issue makes people think they
need to replace PostgreSQL with a dedicated search solution to be able to
scale, which is a shame.

I know hinting has a bad rep, but this is a localized fix, and what has been
said before leads me to believe that estimating the cost of such situations
is a hard nut to crack - one which is not on anyone's plate right now.

Incidentally, documentation section 7.6. "LIMIT and OFFSET" states that
"OFFSET 0 is the same as omitting the OFFSET clause" which is clearly not
the case here. I appreciate that this is an implementation detail which
might change, but it's an important one that I think deserves mentioning.

Hope this helps,
--
Laurence "GreenReaper" Parry
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?"



В списке pgsql-performance по дате отправления:

Предыдущее
От: Gary Warner
Дата:
Сообщение: IP addresses, NetBlocks, and ASNs
Следующее
От: Oleg Bartunov
Дата:
Сообщение: Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search