Обсуждение: Why is GIN index slowing down my query?
I've created a GIN index on an INT[] column, but it slows down the selects. Here is my table: create table talent(person_id INT NOT NULL, skills INT[] NOT NULL); insert into talent(person_id, skills) select generate_series, array[0, 1] || generate_series from generate_series(3, 1048575); create index talent_skills on talent using gin(skills); analyze talent; Here is my select: explain analyze select * from talent where skills <@ array[1, 15] "Bitmap Heap Scan on talent (cost=52.00..56.01 rows=1 width=37) (actual time=590.022..590.022 rows=0 loops=1)" " Recheck Cond: (skills <@ '{1,15}'::integer[])" " Rows Removed by Index Recheck: 1048573" " Heap Blocks: exact=8739" " -> Bitmap Index Scan on talent_skills (cost=0.00..52.00 rows=1 width=0) (actual time=207.661..207.661 rows=1048573 loops=1)" " Index Cond: (skills <@ '{1,15}'::integer[])" "Planning time: 1.310 ms" "Execution time: 590.078 ms" If I drop my GIN index, my select is faster: drop index talent_skills explain analyze select * from talent where skills <@ array[1, 15] "Seq Scan on talent (cost=0.00..21846.16 rows=1 width=37) (actual time=347.442..347.442 rows=0 loops=1)" " Filter: (skills <@ '{1,15}'::integer[])" " Rows Removed by Filter: 1048573" "Planning time: 0.130 ms" "Execution time: 347.470 ms" Am I missing something? -- View this message in context: http://postgresql.nabble.com/Why-is-GIN-index-slowing-down-my-query-tp5836319.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
AlexK987 <alex.cue.987@gmail.com> writes: > I've created a GIN index on an INT[] column, but it slows down the selects. > Here is my table: > create table talent(person_id INT NOT NULL, > skills INT[] NOT NULL); > insert into talent(person_id, skills) > select generate_series, array[0, 1] || generate_series > from generate_series(3, 1048575); > create index talent_skills on talent using gin(skills); > analyze talent; > Here is my select: > explain analyze > select * from talent > where skills <@ array[1, 15] Well, that's pretty much going to suck given that data distribution. Since "1" is a member of every last entry, the GIN scan will end up examining every entry, and then rejecting all of them as not being true subsets of [1,15]. I'm not sure whether it'd be practical to teach GIN about negative proofs, ie noticing that rows containing "0" could be eliminated based on the index contents. But in any case it does not know that today. Another problem, at least with the default statistics target, is that the entries for "0" and "1" swamp everything else so that the planner doesn't know that eg "15" is really rare. You'd be much better off if you could refactor the data representation so that whatever you mean by "0" and "1" is stored separately from whatever you mean by the other entries, ie, don't keep both extremely common and extremely rare entries in the same array. Also ... perhaps I'm reading too much into the name you chose for the column, but I'm finding it hard to imagine why you'd care about the performance of this query as opposed to "where skills @> array[1, 15]". That is, wouldn't you typically be searching for people with at least certain specified skills, rather than at most certain specified skills? Another thing that maybe is a question for -hackers is why we consider arraycontained to be an indexable operator at all. As this example demonstrates, the current GIN infrastructure isn't really capable of coping efficiently, at least not in the general case. It might be all right in specific cases, but the planner doesn't have the smarts to tell which is which. regards, tom lane
Tom, This is a realistic case: everyone have Python and Java skills, but PostGis and Haskell and Closure are rare. If we are looking for a person that has all the skills required for a task (array[1, 15]), that is "skills <@ array[1, 15] " and not the opposite, right? Also can you explain why " entries for "0" and "1" swamp everything else so that the planner doesn't know that eg "15" is really rare. " I thought that if a value is not found in the histogram, than clearly that value is rare, correct? What am I missing here? I hear what you are saying about "don't keep both extremely common and extremely rare entries in the same array", but I cannot predict the future, so I do not know which values are going to be common next year, or two years later. So I think it would be very difficult to follow this advice. What do you think? -- View this message in context: http://postgresql.nabble.com/Why-is-GIN-index-slowing-down-my-query-tp5836319p5836323.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Tom, Oops, you were absolutely right: I needed to use @> instead of <@. Thanks again! -- View this message in context: http://postgresql.nabble.com/Why-is-GIN-index-slowing-down-my-query-tp5836319p5836327.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
AlexK987 <alex.cue.987@gmail.com> writes: > This is a realistic case: everyone have Python and Java skills, but PostGis > and Haskell and Closure are rare. If we are looking for a person that has > all the skills required for a task (array[1, 15]), that is "skills <@ > array[1, 15] " and not the opposite, right? One of us has this backwards. It might be me, but I don't think so. Consider a person who has the two desired skills plus skill #42: regression=# select array[1,15,42] <@ array[1,15]; ?column? ---------- f (1 row) regression=# select array[1,15,42] @> array[1,15]; ?column? ---------- t (1 row) > Also can you explain why " entries for "0" and "1" swamp everything else so > that the planner > doesn't know that eg "15" is really rare. " I thought that if a value is not > found in the histogram, than clearly that value is rare, correct? What am I > missing here? The problem is *how* rare. The planner will take the lowest frequency seen among the most common elements as an upper bound for the frequency of unlisted elements --- but if all you have in the stats array is 0 and 1, and they both have frequency 1.0, that doesn't tell you anything. And that's what I see for this example: regression=# select most_common_elems,most_common_elem_freqs from pg_stats where tablename = 'talent' and attname = 'skills'; most_common_elems | most_common_elem_freqs -------------------+------------------------ {0,1} | {1,1,1,1,0} (1 row) With a less skewed distribution, that rule of thumb would work better :-( regards, tom lane
AlexK987 <alex.cue.987@gmail.com> writes: >> I've created a GIN index on an INT[] column, but it slows down the selects. >> Here is my table: > >> create table talent(person_id INT NOT NULL, >> skills INT[] NOT NULL); > >> insert into talent(person_id, skills) >> select generate_series, array[0, 1] || generate_series >> from generate_series(3, 1048575); > >> create index talent_skills on talent using gin(skills); > >> analyze talent; > >> Here is my select: > >> explain analyze >> select * from talent >> where skills <@ array[1, 15] > >Well, that's pretty much going to suck given that data distribution. >Since "1" is a member of every last entry, the GIN scan will end up >examining every entry, and then rejecting all of them as not being >true subsets of [1,15]. This is equivalent and fast: explain analyze WITH rare AS ( select * from talent where skills @> array[15]) select * from rare where skills @> array[1] -- (with changed operator) You might variate your query according to an additional table that keeps the occurrence count of all skills. Not really pretty though. regards, Marc Mamin
AlexK987 <alex.cue.987@gmail.com> writes: >>> I've created a GIN index on an INT[] column, but it slows down the selects. >>> Here is my table: >> >>> create table talent(person_id INT NOT NULL, >>> skills INT[] NOT NULL); >> >>> insert into talent(person_id, skills) >>> select generate_series, array[0, 1] || generate_series >>> from generate_series(3, 1048575); >> >>> create index talent_skills on talent using gin(skills); >> >>> analyze talent; >> >>> Here is my select: >> >>> explain analyze >>> select * from talent >>> where skills <@ array[1, 15] >> >>Well, that's pretty much going to suck given that data distribution. >>Since "1" is a member of every last entry, the GIN scan will end up >>examining every entry, and then rejecting all of them as not being >>true subsets of [1,15]. > >This is equivalent and fast: > >explain analyze >WITH rare AS ( > select * from talent > where skills @> array[15]) >select * from rare > where skills @> array[1] > -- (with changed operator) > >You might variate your query according to an additional table that keeps the occurrence count of all skills. >Not really pretty though. I wonder if in such cases, the Bitmap Index Scan could discard entries that would result in a table scan and use them only in the recheck part: explain select * from talent where skills @> array[1] Seq Scan on talent (cost=0.00..21846.16 rows=1048573 width=37) Filter: (skills @> '{1}'::integer[])