Обсуждение: query not using GIN index

Поиск
Список
Период
Сортировка

query not using GIN index

От
"Guo, Yun"
Дата:
Hi,

We have a query on a column with GIN index, but query plan chooses not using the index but do an seq scan whichi is must slower

CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING gin (access_tokens); 

explain analyze SELECT "access_grants".* FROM "access_grants"  WHERE (access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..7.46 rows=1 width=157) (actual time=260.376..260.377 rows=1 loops=1)
   ->  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985 width=157) (actual time=260.373..260.373 rows=1 loops=1)
         Filter: (access_tokens @> '{124e5a1f9de325fc176a7c89152ac734}'::text[])
         Rows Removed by Filter: 796818
 Total runtime: 260.408 ms


We tested on smaller table in development region and it chooses to use the index there. However, in production size table it decides to ignore the index for unknown reasons.

Is the large number of tuples skewing the query planner’s decision or the index itself is larger than the table therefor it would decide to do table scan?

Any suggestions are greatly appreciated!

Yun

Re: query not using GIN index

От
Tomas Vondra
Дата:
Hi,

On 08/22/2015 03:55 AM, Guo, Yun wrote:
> Hi,
>
> We have a query on a column with GIN index, but query plan chooses not
> using the index but do an seq scan whichi is must slower
>
> CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING
> gin (access_tokens);
>
> explain analyze SELECT "access_grants".* FROM "access_grants"  WHERE
> (access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
>                                                          QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------
>   Limit  (cost=0.00..7.46 rows=1 width=157) (actual
> time=260.376..260.377 rows=1 loops=1)
>     ->  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985
> width=157) (actual time=260.373..260.373 rows=1 loops=1)
>           Filter: (access_tokens @>
> '{124e5a1f9de325fc176a7c89152ac734}'::text[])
>           Rows Removed by Filter: 796818
>   Total runtime: 260.408 ms
>

I find it very likely that the explain output actually comes from a
slightly different query, including a LIMIT 1 clause.

That might easily be the problem here, because the optimizer expects the
3985 "matches" to be uniformly distributed in the table, so it thinks
it'll scan just a tiny fraction of the table (1/3985) until the first
match. But it's quite possible all at rows are end of the table, and the
executor has to actually scan the whole table.

It's difficult to say without further details of the table and how the
data are generated.

> We tested on smaller table in development region and it chooses to use
> the index there. However, in production size table it decides to ignore
> the index for unknown reasons.

Please provide explain output from that table. It's difficult to say
what's different without seeing the details.

Also please provide important details about the system (e.g. which
PostgreSQL version, how much RAM, what work_mem/shared_buffers and such
stuff).

> Is the large number of tuples skewing the query planner’s decision
> or the index itself is larger than the table therefor it would decide
> to do table scan?

What large number of tuples? The indexes are supposed to be more
efficient the larger the table is.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: query not using GIN index

От
"Guo, Yun"
Дата:
Hi Tom,
Thanks for you valuable input. You¹re right, the plan was coming from
explain analyze SELECT "access_grants".* FROM "access_grants²  WHERE
(access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) limit 1;
We tried removing "limit 1², which did give us the benefit of using index
for sometime. However, after a while, it went back to the old behavior of
ignoring the index for " SELECT "access_grants".* FROM "access_grants²
WHERE (access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;² .
We had to turn off sequential scans (enable_seqscan) to force it use the
index. But I¹m not sure this should be the permanent fix.
The access_grants table has 797415 rows and the schema as below:
         Column          |            Type             |
      Modifiers
-------------------------+-----------------------------+-------------------
-----------------------------------------
 id                      | integer                     | not null default
nextval('access_grants_id_seq'::regclass)
 user_id                 | integer                     | not null
 code                    | text                        | not null
 client_application_name | text                        | not null
 access_tokens           | text[]                      | default
'{}'::text[]
 created_at              | timestamp without time zone | not null
 updated_at              | timestamp without time zone | not null
 mongo_id                | text                        |
Indexes:
    "access_grants_pkey" PRIMARY KEY, btree (id)
    "index_access_grants_on_code" UNIQUE, btree (code)
    "index_access_grants_on_mongo_id" UNIQUE, btree (mongo_id)
    "idx_access_grants_on_access_tokens" gin (access_tokens)
    "index_access_grants_on_user_id" btree (user_id)



The array length distribution of access_token is below:
309997 rows has only one element, 248334 rows has empty array, 432 rows
has array length >100, and 1 row has array length 3575.
The table size is 154MB, and the index size is 180MB.



 It¹s on AWS db.r3.xlarge instance with 4 virtual cores4, Memory30.5 GiB,
General purpose ssd, with shared_buffers 1048576 and work_mem 159744.






On 8/22/15, 12:36 AM, "pgsql-performance-owner@postgresql.org on behalf of
Tomas Vondra" <pgsql-performance-owner@postgresql.org on behalf of
tomas.vondra@2ndquadrant.com> wrote:

>Hi,
>
>On 08/22/2015 03:55 AM, Guo, Yun wrote:
>> Hi,
>>
>> We have a query on a column with GIN index, but query plan chooses not
>> using the index but do an seq scan whichi is must slower
>>
>> CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING
>> gin (access_tokens);
>>
>> explain analyze SELECT "access_grants".* FROM "access_grants"  WHERE
>> (access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
>>                                                          QUERY PLAN
>>
>>-------------------------------------------------------------------------
>>-------------------------------------------------
>>   Limit  (cost=0.00..7.46 rows=1 width=157) (actual
>> time=260.376..260.377 rows=1 loops=1)
>>     ->  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985
>> width=157) (actual time=260.373..260.373 rows=1 loops=1)
>>           Filter: (access_tokens @>
>> '{124e5a1f9de325fc176a7c89152ac734}'::text[])
>>           Rows Removed by Filter: 796818
>>   Total runtime: 260.408 ms
>>
>
>I find it very likely that the explain output actually comes from a
>slightly different query, including a LIMIT 1 clause.
>
>That might easily be the problem here, because the optimizer expects the
>3985 "matches" to be uniformly distributed in the table, so it thinks
>it'll scan just a tiny fraction of the table (1/3985) until the first
>match. But it's quite possible all at rows are end of the table, and the
>executor has to actually scan the whole table.
>
>It's difficult to say without further details of the table and how the
>data are generated.
>
>> We tested on smaller table in development region and it chooses to use
>> the index there. However, in production size table it decides to ignore
>> the index for unknown reasons.
>
>Please provide explain output from that table. It's difficult to say
>what's different without seeing the details.
>
>Also please provide important details about the system (e.g. which
>PostgreSQL version, how much RAM, what work_mem/shared_buffers and such
>stuff).
>
>> Is the large number of tuples skewing the query planner¹s decision
>> or the index itself is larger than the table therefor it would decide
>> to do table scan?
>
>What large number of tuples? The indexes are supposed to be more
>efficient the larger the table is.
>
>regards
>
>--
>Tomas Vondra                  http://www.2ndQuadrant.com
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>--
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance



Re: query not using GIN index

От
"Guo, Yun"
Дата:

Hi Tom,
Thanks for you valuable input.



>Hi,
>
>On 08/22/2015 03:55 AM, Guo, Yun wrote:
>> Hi,
>>
>> We have a query on a column with GIN index, but query plan chooses not
>> using the index but do an seq scan whichi is must slower
>>
>> CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING
>> gin (access_tokens);
>>
>> explain analyze SELECT "access_grants".* FROM "access_grants"  WHERE
>> (access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
>>                                                          QUERY PLAN
>> 
>>-------------------------------------------------------------------------
>>-------------------------------------------------
>>   Limit  (cost=0.00..7.46 rows=1 width=157) (actual
>> time=260.376..260.377 rows=1 loops=1)
>>     ->  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985
>> width=157) (actual time=260.373..260.373 rows=1 loops=1)
>>           Filter: (access_tokens @>
>> '{124e5a1f9de325fc176a7c89152ac734}'::text[])
>>           Rows Removed by Filter: 796818
>>   Total runtime: 260.408 ms
>>
>
>I find it very likely that the explain output actually comes from a
>slightly different query, including a LIMIT 1 clause.

 You¹re right, the plan was coming from
explain analyze SELECT "access_grants".* FROM "access_grants² WHERE
(access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) limit 1;
We tried removing "limit 1², which did give us the benefit of using index
for sometime. However, after a while, it went back to the old behavior of
ignoring the index for " SELECT "access_grants".* FROM "access_grants²
WHERE (access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;² .
We had to turn off sequential scans (enable_seqscan) to force it use the
index. But I¹m not sure this should be the permanent fix.


>
>That might easily be the problem here, because the optimizer expects the
>3985 "matches" to be uniformly distributed in the table, so it thinks
>it'll scan just a tiny fraction of the table (1/3985) until the first
>match. But it's quite possible all at rows are end of the table, and the
>executor has to actually scan the whole table.
>
>It's difficult to say without further details of the table and how the
>data are generated.

The access_grants table has 797415 rows and the schema as below:
Column | Type |
Modifiers
-------------------------+-----------------------------+-------------------
-----------------------------------------
id | integer | not null default
nextval('access_grants_id_seq'::regclass)
user_id | integer | not null
code | text | not null
client_application_name | text | not null
access_tokens | text[] | default
'{}'::text[]
created_at | timestamp without time zone | not null
updated_at | timestamp without time zone | not null
mongo_id | text |
Indexes:
"access_grants_pkey" PRIMARY KEY, btree (id)
"index_access_grants_on_code" UNIQUE, btree (code)
"index_access_grants_on_mongo_id" UNIQUE, btree (mongo_id)
"idx_access_grants_on_access_tokens" gin (access_tokens)
"index_access_grants_on_user_id" btree (user_id)



>
>> We tested on smaller table in development region and it chooses to use
>> the index there. However, in production size table it decides to ignore
>> the index for unknown reasons.
>
>Please provide explain output from that table. It's difficult to say
>what's different without seeing the details.
>
>Also please provide important details about the system (e.g. which
>PostgreSQL version, how much RAM, what work_mem/shared_buffers and such
>stuff).

The array length distribution of access_token is below:
309997 rows has only one element, 248334 rows has empty array, 432 rows
has array length >100, and 1 row has array length 3575.
The table size is 154MB, and the index size is 180MB.
  

It¹s on AWS db.r3.xlarge instance with 4 virtual cores4, Memory30.5 GiB,
General purpose ssd, with shared_buffers 1048576 and work_mem 159744.


>
>> Is the large number of tuples skewing the query planner’s decision
>> or the index itself is larger than the table therefor it would decide
>> to do table scan?
>
>What large number of tuples? The indexes are supposed to be more
>efficient the larger the table is.
>
>regards
>
>--
>Tomas Vondra                  http://www.2ndQuadrant.com
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>-- 
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance


Re: query not using GIN index

От
Jeff Janes
Дата:
On Mon, Aug 24, 2015 at 8:18 AM, Guo, Yun <YGuo@cvent.com> wrote:


From: Jeff Janes <jeff.janes@gmail.com>
Date: Friday, August 21, 2015 at 10:44 PM
To: Yun <yguo@cvent.com>
Subject: Re: [PERFORM] query not using GIN index

On Fri, Aug 21, 2015 at 6:55 PM, Guo, Yun <YGuo@cvent.com> wrote:
Hi,

We have a query on a column with GIN index, but query plan chooses not using the index but do an seq scan whichi is must slower

CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING gin (access_tokens); 

explain analyze SELECT "access_grants".* FROM "access_grants"  WHERE (access_tokens @> ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..7.46 rows=1 width=157) (actual time=260.376..260.377 rows=1 loops=1)
   ->  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985 width=157) (actual time=260.373..260.373 rows=1 loops=1)
         Filter: (access_tokens @> '{124e5a1f9de325fc176a7c89152ac734}'::text[])
         Rows Removed by Filter: 796818
 Total runtime: 260.408 ms


What version are you running?  What are your non-default configuration settings (particularly for the *_cost parameters)?

 select name,setting from pg_settings where name like '%cost';
         name         | setting
----------------------+---------
 cpu_index_tuple_cost | 0.005
 cpu_operator_cost    | 0.0025
 cpu_tuple_cost       | 0.01
 random_page_cost     | 4
 seq_page_cost        | 1


OK, thanks.  I had overlooked the "LIMIT" in the first plan you posted, and so thought you must have some pretty weird settings.  But noticing the LIMIT, it makes more sense with the normal settings, like the ones you show.
 

Can you turn track_io_timing on and then report a explain (analyze, buffers) of the same query? 

I didn’t try this as our prod instance is on AWS and setting this would require a reboot.

OK, but you can still do an "explain (analyze,buffers)".  It is less useful than with track_io_timing on, but it is still more useful than just "explain analyze".
 

Then do a "set enable_seqscan=off" and repeat.

This is the life saver!  After applying this, it’s able to use the index.  But should we consider it as the permanent solution?

No, probably not a permanent solution.  Or at least, I only do things like that in production as a last resort.  I suggested doing that so you can force it to use the index and so see what the explain (analyze,buffers) look like when it does use the index.  Sorry for not being more clear.

The seq scan thinks it is going to find a matching row pretty early in the scan and can stop at the first one, but based on "Rows Removed by Filter: 796818" it isn't actually finding a match until the end.  There probably isn't much you can do about this, other than not using a LIMIT.

The reason it thinks it will find a row soon is that it thinks 0.5% of the rows meet your criteria.  That is default selectivity estimate it uses when it has nothing better to use.  Raising the statistics target on the column might help.  But I doubt it, because access tokens are probably nearly unique, and so even the highest possible setting for statistics target is not going get it to record MCE statistics.  See https://commitfest.postgresql.org/6/323/ for a possible solution, but any fix for that won't be released to production for a long time.


If your gin index has a large pending list, that will make the index scan look very expensive.  vacuuming the table will clear that up.  Setting fastupdate off for the index will prevent it growing again.  Based on your description of most lists having 0 or 1 element in them, and my assumption that a table named "access_grants" isn't getting updated hundreds of times a second, I don't think fast_update being off is going to cause any problems at all.

Cheers,

Jeff