Обсуждение: Why is GIN index slowing down my query?

Поиск
Список
Период
Сортировка

Why is GIN index slowing down my query?

От
AlexK987
Дата:
I've created a GIN index on an INT[] column, but it slows down the selects.
Here is my table:

create table talent(person_id INT NOT NULL,
skills INT[] NOT NULL);

insert into talent(person_id, skills)
select generate_series, array[0, 1] || generate_series
from generate_series(3, 1048575);

create index talent_skills on talent using gin(skills);

analyze talent;

Here is my select:


explain analyze
select * from talent
where skills <@ array[1, 15]

"Bitmap Heap Scan on talent  (cost=52.00..56.01 rows=1 width=37) (actual
time=590.022..590.022 rows=0 loops=1)"
"  Recheck Cond: (skills <@ '{1,15}'::integer[])"
"  Rows Removed by Index Recheck: 1048573"
"  Heap Blocks: exact=8739"
"  ->  Bitmap Index Scan on talent_skills  (cost=0.00..52.00 rows=1 width=0)
(actual time=207.661..207.661 rows=1048573 loops=1)"
"        Index Cond: (skills <@ '{1,15}'::integer[])"
"Planning time: 1.310 ms"
"Execution time: 590.078 ms"


If I drop my GIN index, my select is faster:


drop index talent_skills

explain analyze
select * from talent
where skills <@ array[1, 15]

"Seq Scan on talent  (cost=0.00..21846.16 rows=1 width=37) (actual
time=347.442..347.442 rows=0 loops=1)"
"  Filter: (skills <@ '{1,15}'::integer[])"
"  Rows Removed by Filter: 1048573"
"Planning time: 0.130 ms"
"Execution time: 347.470 ms"

Am I missing something?




--
View this message in context: http://postgresql.nabble.com/Why-is-GIN-index-slowing-down-my-query-tp5836319.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Why is GIN index slowing down my query?

От
Tom Lane
Дата:
AlexK987 <alex.cue.987@gmail.com> writes:
> I've created a GIN index on an INT[] column, but it slows down the selects.
> Here is my table:

> create table talent(person_id INT NOT NULL,
> skills INT[] NOT NULL);

> insert into talent(person_id, skills)
> select generate_series, array[0, 1] || generate_series
> from generate_series(3, 1048575);

> create index talent_skills on talent using gin(skills);

> analyze talent;

> Here is my select:

> explain analyze
> select * from talent
> where skills <@ array[1, 15]

Well, that's pretty much going to suck given that data distribution.
Since "1" is a member of every last entry, the GIN scan will end up
examining every entry, and then rejecting all of them as not being
true subsets of [1,15].  I'm not sure whether it'd be practical to
teach GIN about negative proofs, ie noticing that rows containing "0"
could be eliminated based on the index contents.  But in any case it
does not know that today.

Another problem, at least with the default statistics target, is that
the entries for "0" and "1" swamp everything else so that the planner
doesn't know that eg "15" is really rare.

You'd be much better off if you could refactor the data representation so
that whatever you mean by "0" and "1" is stored separately from whatever
you mean by the other entries, ie, don't keep both extremely common and
extremely rare entries in the same array.

Also ... perhaps I'm reading too much into the name you chose for the
column, but I'm finding it hard to imagine why you'd care about the
performance of this query as opposed to "where skills @> array[1, 15]".
That is, wouldn't you typically be searching for people with at least
certain specified skills, rather than at most certain specified skills?

Another thing that maybe is a question for -hackers is why we consider
arraycontained to be an indexable operator at all.  As this example
demonstrates, the current GIN infrastructure isn't really capable
of coping efficiently, at least not in the general case.  It might be
all right in specific cases, but the planner doesn't have the smarts
to tell which is which.

            regards, tom lane


Re: Why is GIN index slowing down my query?

От
AlexK987
Дата:
Tom,

This is a realistic case: everyone have Python and Java skills, but PostGis
and Haskell and Closure are rare. If we are looking for a person that has
all the skills required for a task (array[1, 15]), that is "skills <@
array[1, 15] " and not the opposite, right?

Also can you explain why " entries for "0" and "1" swamp everything else so
that the planner
doesn't know that eg "15" is really rare. " I thought that if a value is not
found in the histogram, than clearly that value is rare, correct? What am I
missing here?

I hear what you are saying about "don't keep both extremely common and
extremely rare entries in the same array", but I cannot predict the future,
so I do not know which values are going to be common next year, or two years
later. So I think it would be very difficult to follow this advice.

What do you think?



--
View this message in context:
http://postgresql.nabble.com/Why-is-GIN-index-slowing-down-my-query-tp5836319p5836323.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Why is GIN index slowing down my query?

От
AlexK987
Дата:
Tom,

Oops, you were absolutely right: I needed to use @> instead of <@. Thanks
again!



--
View this message in context:
http://postgresql.nabble.com/Why-is-GIN-index-slowing-down-my-query-tp5836319p5836327.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Why is GIN index slowing down my query?

От
Tom Lane
Дата:
AlexK987 <alex.cue.987@gmail.com> writes:
> This is a realistic case: everyone have Python and Java skills, but PostGis
> and Haskell and Closure are rare. If we are looking for a person that has
> all the skills required for a task (array[1, 15]), that is "skills <@
> array[1, 15] " and not the opposite, right?

One of us has this backwards.  It might be me, but I don't think so.
Consider a person who has the two desired skills plus skill #42:

regression=# select array[1,15,42] <@ array[1,15];
 ?column?
----------
 f
(1 row)

regression=# select array[1,15,42] @> array[1,15];
 ?column?
----------
 t
(1 row)

> Also can you explain why " entries for "0" and "1" swamp everything else so
> that the planner
> doesn't know that eg "15" is really rare. " I thought that if a value is not
> found in the histogram, than clearly that value is rare, correct? What am I
> missing here?

The problem is *how* rare.  The planner will take the lowest frequency
seen among the most common elements as an upper bound for the frequency of
unlisted elements --- but if all you have in the stats array is 0 and 1,
and they both have frequency 1.0, that doesn't tell you anything.  And
that's what I see for this example:

regression=# select most_common_elems,most_common_elem_freqs from pg_stats where tablename = 'talent' and attname =
'skills';
 most_common_elems | most_common_elem_freqs
-------------------+------------------------
 {0,1}             | {1,1,1,1,0}
(1 row)

With a less skewed distribution, that rule of thumb would work better :-(

            regards, tom lane


Re: Why is GIN index slowing down my query?

От
Marc Mamin
Дата:
AlexK987 <alex.cue.987@gmail.com> writes:
>> I've created a GIN index on an INT[] column, but it slows down the selects.
>> Here is my table:
>
>> create table talent(person_id INT NOT NULL,
>> skills INT[] NOT NULL);
>
>> insert into talent(person_id, skills)
>> select generate_series, array[0, 1] || generate_series
>> from generate_series(3, 1048575);
>
>> create index talent_skills on talent using gin(skills);
>
>> analyze talent;
>
>> Here is my select:
>
>> explain analyze
>> select * from talent
>> where skills <@ array[1, 15]
>
>Well, that's pretty much going to suck given that data distribution.
>Since "1" is a member of every last entry, the GIN scan will end up
>examining every entry, and then rejecting all of them as not being
>true subsets of [1,15].

This is equivalent and fast:

explain analyze
WITH rare AS (
 select * from talent
 where skills @> array[15])
select * from rare
 where skills @> array[1]
 -- (with changed operator)

You might variate your query according to an additional table that keeps the occurrence count of all skills.
Not really pretty though.

regards,

Marc Mamin

Re: Why is GIN index slowing down my query?

От
Marc Mamin
Дата:
AlexK987 <alex.cue.987@gmail.com> writes:
>>> I've created a GIN index on an INT[] column, but it slows down the selects.
>>> Here is my table:
>>
>>> create table talent(person_id INT NOT NULL,
>>> skills INT[] NOT NULL);
>>
>>> insert into talent(person_id, skills)
>>> select generate_series, array[0, 1] || generate_series
>>> from generate_series(3, 1048575);
>>
>>> create index talent_skills on talent using gin(skills);
>>
>>> analyze talent;
>>
>>> Here is my select:
>>
>>> explain analyze
>>> select * from talent
>>> where skills <@ array[1, 15]
>>
>>Well, that's pretty much going to suck given that data distribution.
>>Since "1" is a member of every last entry, the GIN scan will end up
>>examining every entry, and then rejecting all of them as not being
>>true subsets of [1,15].
>
>This is equivalent and fast:
>
>explain analyze
>WITH rare AS (
> select * from talent
> where skills @> array[15])
>select * from rare
> where skills @> array[1]
> -- (with changed operator)
>
>You might variate your query according to an additional table that keeps the occurrence count of all skills.
>Not really pretty though.

I wonder if in such cases, the Bitmap Index Scan could discard entries that would result in a table scan
and use them only in the recheck part:

explain
 select * from talent
 where skills @> array[1]

 Seq Scan on talent  (cost=0.00..21846.16 rows=1048573 width=37)
   Filter: (skills @> '{1}'::integer[])