Обсуждение: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

Поиск
Список
Период
Сортировка

Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

От
"Laurence Parry"
Дата:
As mentioned here and elsewhere (most recently in "How can I get the query
planner to use a bitmap index scap instead of an index scan ?" - 8 Mar
2014), estimation of the relative cost of text search operations using
GIN-indexed columns sometimes goes awry, particularly when there will be a
large number of matches.

The planner may choose to use a sequential or unrelated index scan with @@
as a filter, especially when incorporated as a subquery, incurring
significant cost (even without considering de-TOASTing). Pre-tsvectorizing
the column offers only a slight mitigation and can cause regressions (if
nothing else, it adds another large column).

What worked for me (and I'm hoping for others, though YMMV) was adding
'OFFSET 0' to the subquery involving the indexed column, e.g.

...
(SELECT sk1.submission_id
FROM submission_keywords sk1, keywords k1
WHERE sk1.keyword_id = k1.keyword_id
    AND
to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop',
'tails')
OFFSET 0)
...

The result is a bitmap scan:
------------------------------------------------------------------------------------------
Nested Loop
(cost=8.73..4740.29 rows=21348 width=4)
(actual time=0.621..13.661 rows=20097 loops=1)
    ->  Bitmap Heap Scan on keywords k1
        (cost=8.30..1028.72 rows=755 width=4)
        (actual time=0.603..2.276 rows=752 loops=1)
        Recheck Cond:
        (to_tsvector('english_nostop'::regconfig, keyword) @@
'''tail'''::tsquery)
        ->  Bitmap Index Scan on keyword_to_tsvector_keywords
            (cost=0.00..8.11 rows=755 width=0)
            (actual time=0.496..0.496 rows=756 loops=1)
            Index Cond:
            (to_tsvector('english_nostop'::regconfig, keyword) @@
'''tail'''::tsquery)
    ->  Index Only Scan using keyword_id_submission_id_submission_keywords
on submission_keywords sk1
        (cost=0.43..3.47 rows=145 width=8)
        (actual time=0.005..0.010 rows=27 loops=752)
        Index Cond: (keyword_id = k1.keyword_id)
        Heap Fetches: 99
Total runtime: 14.809 ms

Without this the test was moved to a filter inside a nested loop, with
disastrous results:
->  Hash Semi Join
    (cost=23.37..23.51 rows=1 width=8)
    (actual time=0.090..0.090 rows=0 loops=594670)
    Hash Cond: (s1.submission_id = sk1.submission_id)
    ->  Index Only Scan using submissions_pkey on submissions s1
        (cost=0.42..0.56 rows=1 width=4)
        (actual time=0.007..0.007 rows=1 loops=17352)
        Index Cond: (submission_id = s.submission_id)
        Heap Fetches: 8372
        ->  Hash
            (cost=22.94..22.94 rows=1 width=4)
            (actual time=0.086..0.086 rows=0 loops=594670)
            Buckets: 1024  Batches: 1  Memory Usage: 0kB
            ->  Nested Loop
                (cost=0.85..22.94 rows=1 width=4)
                (actual time=0.083..0.085 rows=0 loops=594670)
                ->  Index Only Scan using file_keyword on
submission_keywords sk1
                    (cost=0.43..0.80 rows=13 width=8)
                    (actual time=0.006..0.008 rows=9 loops=594670)
                    Index Cond: (submission_id = s.submission_id)
                    Heap Fetches: 21324
                    ->  Index Scan using keywords_pkey on keywords k1
                        (cost=0.42..1.69 rows=1 width=4)
                        (actual time=0.008..0.008 rows=0 loops=5329219)
                        Index Cond: (keyword_id = sk1.keyword_id)
                        Filter: (to_tsvector('english_nostop'::regconfig,
keyword) @@ '''tail'''::tsquery)
Total runtime: 55194.034 ms [there are other lines, but 50 sec is above]

Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we
get a lot of searches for the most common terms, and scans just get worse
the more you add.

I got the idea from Seamus Abshere:
http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/

I've heard it said that "any Postgres DBA worth his salt" knows this trick,
as well as the use of "WITH" to create a common table expression. Alas, many
of us are still learning . . . I beat my head over this for a week, and it's
affected our site for far longer. This kind of issue makes people think they
need to replace PostgreSQL with a dedicated search solution to be able to
scale, which is a shame.

I know hinting has a bad rep, but this is a localized fix, and what has been
said before leads me to believe that estimating the cost of such situations
is a hard nut to crack - one which is not on anyone's plate right now.

Incidentally, documentation section 7.6. "LIMIT and OFFSET" states that
"OFFSET 0 is the same as omitting the OFFSET clause" which is clearly not
the case here. I appreciate that this is an implementation detail which
might change, but it's an important one that I think deserves mentioning.

Hope this helps,
--
Laurence "GreenReaper" Parry
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?"



Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

От
Oleg Bartunov
Дата:
btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN
fast scan feature.

On Sun, Apr 20, 2014 at 8:30 AM, Laurence Parry <greenreaper@hotmail.com> wrote:
> As mentioned here and elsewhere (most recently in "How can I get the query
> planner to use a bitmap index scap instead of an index scan ?" - 8 Mar
> 2014), estimation of the relative cost of text search operations using
> GIN-indexed columns sometimes goes awry, particularly when there will be a
> large number of matches.
>
> The planner may choose to use a sequential or unrelated index scan with @@
> as a filter, especially when incorporated as a subquery, incurring
> significant cost (even without considering de-TOASTing). Pre-tsvectorizing
> the column offers only a slight mitigation and can cause regressions (if
> nothing else, it adds another large column).
>
> What worked for me (and I'm hoping for others, though YMMV) was adding
> 'OFFSET 0' to the subquery involving the indexed column, e.g.
>
> ...
> (SELECT sk1.submission_id
> FROM submission_keywords sk1, keywords k1
> WHERE sk1.keyword_id = k1.keyword_id
>    AND
> to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop',
> 'tails')
> OFFSET 0)
> ...
>
> The result is a bitmap scan:
> ------------------------------------------------------------------------------------------
> Nested Loop
> (cost=8.73..4740.29 rows=21348 width=4)
> (actual time=0.621..13.661 rows=20097 loops=1)
>    ->  Bitmap Heap Scan on keywords k1
>        (cost=8.30..1028.72 rows=755 width=4)
>        (actual time=0.603..2.276 rows=752 loops=1)
>        Recheck Cond:
>        (to_tsvector('english_nostop'::regconfig, keyword) @@
> '''tail'''::tsquery)
>        ->  Bitmap Index Scan on keyword_to_tsvector_keywords
>            (cost=0.00..8.11 rows=755 width=0)
>            (actual time=0.496..0.496 rows=756 loops=1)
>            Index Cond:
>            (to_tsvector('english_nostop'::regconfig, keyword) @@
> '''tail'''::tsquery)
>    ->  Index Only Scan using keyword_id_submission_id_submission_keywords on
> submission_keywords sk1
>        (cost=0.43..3.47 rows=145 width=8)
>        (actual time=0.005..0.010 rows=27 loops=752)
>        Index Cond: (keyword_id = k1.keyword_id)
>        Heap Fetches: 99
> Total runtime: 14.809 ms
>
> Without this the test was moved to a filter inside a nested loop, with
> disastrous results:
> ->  Hash Semi Join
>    (cost=23.37..23.51 rows=1 width=8)
>    (actual time=0.090..0.090 rows=0 loops=594670)
>    Hash Cond: (s1.submission_id = sk1.submission_id)
>    ->  Index Only Scan using submissions_pkey on submissions s1
>        (cost=0.42..0.56 rows=1 width=4)
>        (actual time=0.007..0.007 rows=1 loops=17352)
>        Index Cond: (submission_id = s.submission_id)
>        Heap Fetches: 8372
>        ->  Hash
>            (cost=22.94..22.94 rows=1 width=4)
>            (actual time=0.086..0.086 rows=0 loops=594670)
>            Buckets: 1024  Batches: 1  Memory Usage: 0kB
>            ->  Nested Loop
>                (cost=0.85..22.94 rows=1 width=4)
>                (actual time=0.083..0.085 rows=0 loops=594670)
>                ->  Index Only Scan using file_keyword on submission_keywords
> sk1
>                    (cost=0.43..0.80 rows=13 width=8)
>                    (actual time=0.006..0.008 rows=9 loops=594670)
>                    Index Cond: (submission_id = s.submission_id)
>                    Heap Fetches: 21324
>                    ->  Index Scan using keywords_pkey on keywords k1
>                        (cost=0.42..1.69 rows=1 width=4)
>                        (actual time=0.008..0.008 rows=0 loops=5329219)
>                        Index Cond: (keyword_id = sk1.keyword_id)
>                        Filter: (to_tsvector('english_nostop'::regconfig,
> keyword) @@ '''tail'''::tsquery)
> Total runtime: 55194.034 ms [there are other lines, but 50 sec is above]
>
> Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we
> get a lot of searches for the most common terms, and scans just get worse
> the more you add.
>
> I got the idea from Seamus Abshere:
> http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/
>
> I've heard it said that "any Postgres DBA worth his salt" knows this trick,
> as well as the use of "WITH" to create a common table expression. Alas, many
> of us are still learning . . . I beat my head over this for a week, and it's
> affected our site for far longer. This kind of issue makes people think they
> need to replace PostgreSQL with a dedicated search solution to be able to
> scale, which is a shame.
>
> I know hinting has a bad rep, but this is a localized fix, and what has been
> said before leads me to believe that estimating the cost of such situations
> is a hard nut to crack - one which is not on anyone's plate right now.
>
> Incidentally, documentation section 7.6. "LIMIT and OFFSET" states that
> "OFFSET 0 is the same as omitting the OFFSET clause" which is clearly not
> the case here. I appreciate that this is an implementation detail which
> might change, but it's an important one that I think deserves mentioning.
>
> Hope this helps,
> --
> Laurence "GreenReaper" Parry
> greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
> "Eternity lies ahead of us, and behind. Have you drunk your fill?"
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

От
"Laurence Parry"
Дата:
> btw, 9.4 should be wiser in case of rare+common terms,
> thanks to GIN fast scan feature.

I'll look forward to it! We have a few other GIN indexes . . .

I don't want to misrepresent my impression of Postgres performance; the only
other case where I've made a significant improvement by tweaking was
pre-checking a couple of tables with count(*) > 0 before using them against
several thousand submissions (checking lists of blocked artists and keywords
against submission details). I've been pleasantly surprised at what it can
handle, especially after index-only scans came out.

--
Laurence "GreenReaper" Parry



Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

От
Heikki Linnakangas
Дата:
On 04/20/2014 07:46 AM, Oleg Bartunov wrote:
> btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN
> fast scan feature.

Indeed, although we didn't actually do anything to the planner to make
it understand when fast scan helps. Doing something about cost
estimation is still on the 9.4 Open Items list, but I don't have any
ideas on what to do about it, and I haven't heard anything from
Alexander about that either. That means that the cost estimation issue
Laurence saw is going to be even worse in 9.4, because GIN is going to
be faster than a seq scan in more cases than before and the planner
doesn't know about it.

- Heikki


Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

От
Oleg Bartunov
Дата:
On Tue, Apr 22, 2014 at 10:28 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 04/20/2014 07:46 AM, Oleg Bartunov wrote:
>>
>> btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN
>> fast scan feature.
>
>
> Indeed, although we didn't actually do anything to the planner to make it
> understand when fast scan helps. Doing something about cost estimation is
> still on the 9.4 Open Items list, but I don't have any ideas on what to do
> about it, and I haven't heard anything from Alexander about that either.
> That means that the cost estimation issue Laurence saw is going to be even
> worse in 9.4, because GIN is going to be faster than a seq scan in more
> cases than before and the planner doesn't know about it.
>
> - Heikki

You are right, we should return to that topic.


Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 04/20/2014 07:46 AM, Oleg Bartunov wrote:
>> btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN
>> fast scan feature.

> Indeed, although we didn't actually do anything to the planner to make
> it understand when fast scan helps.

The given query has nothing to do with rare+common terms, since there is
only one term in the search --- and what's more, the planner's estimate
for that term is spot on already (755 estimated matches vs 752 actual).

It looks to me like the complaint is more probably about inappropriate
choice of join order; but since we've been allowed to see only some small
portion of either the query or the plan, speculating about the root cause
is a fool's errand.

            regards, tom lane