Re: Unexpected sequential scan on an indexed column

Поиск
Список
Период
Сортировка
От Eddy Escardo-Raffo
Тема Re: Unexpected sequential scan on an indexed column
Дата
Msg-id 4eaa4a5e0911151706s31b2f9e1w23dae77f2db8c001@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Unexpected sequential scan on an indexed column  (Eddy Escardo-Raffo <eescardo@kikini.com>)
Ответы Re: Unexpected sequential scan on an indexed column  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-performance
Yeah, that was it. Thanks! I do have one more question at the bottom, though, if anyone has enough time to read through my analysis
 
If I create the table as:
 
CREATE TABLE users
(
userid integer NOT NULL,
location integer NOT NULL,
CONSTRAINT pk_users PRIMARY KEY (userid)
)
WITH (
OIDS=FALSE
);
 
CREATE INDEX idx_users_location
  ON users
  USING btree
  (location);
 
INSERT INTO users (userid,location) SELECT GENERATE_SERIES(1,1000000) , GENERATE_SERIES(1,1000000)/100000;
UPDATE users SET location=76543 WHERE userid=100092;
UPDATE users SET location=76543 WHERE userid=997000;
 
So, now we have 10 distinct location values, evenly distributed, one value (10) with only one row and one value (76543) with 2 rows. If, after this setup I do statement C:
 
explain analyze SELECT userid FROM users, (VALUES (76543), (892), (10)) val (id) WHERE location = val.id;
 
 Nested Loop  (cost=0.00..17277.21 rows=300000 width=4) (actual time=0.023..0.06 rows=3 loops=1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=4) (actual time0.002..0.004 rows=3 loops=1)
   ->  Index Scan using idx_users_location on users  (cost=0.00..4509.06 rows=10000 width=8) (actual time=0.008..0.009 rows=1 loops=3)
         Index Cond: (users.location = "*VALUES*".column1)
 Total runtime: 0.078 ms
(5 rows)
 
and if I do statement D:
 
explain analyze SELECT userid FROM users WHERE location IN (VALUES (76543), (892), (10));
 Nested Loop  (cost=0.05..17277.24 rows=300000 width=4) (actual time=0.033..0.056 rows=3 loops=1)
   ->  HashAggregate  (cost=0.05..0.08 rows=3 width=4) (actual time=0.012..0.015 rows=3 loops=1)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=4) (actual time=0.002..0.004 rows=3 loops=1)
   ->  Index Scan using idx_users_location on users  (cost=0.00..4509.06 rows=100000 width=8) (actual time=0.007..0.009 rows=1 loops=3)
         Index Cond: (users.location = "*VALUES*".column1)
 Total runtime: 0.094 ms
(6 rows)
 
Where C has a slight edge over D (I ran them both about 5 times and it seems like C is approx. 20% faster for this specific data set). So, I think this will work pretty good. However, I'm still curious (for my own education) as to why something like the following has even more of an edge over the previous two alternatives. Statement E:
 
explain analyze SELECT userid FROM users WHERE location IN (76543, 892, 10);
 
 Bitmap Heap Scan on users  (cost=12.91..16.93 rows=1 width=4) (actual time=0.035..0.038 rows=3 loops=1)
   Recheck Cond: (location = ANY ('{76543,892,10}'::integer[]))
   ->  Bitmap Index Scan on idx_users_location  (cost=0.00..12.91 rows=1 width=0) (actual time=0.027..0.027 rows=3 loops=1)
         Index Cond: (location = ANY ('{76543,892,10}'::integer[]))
 Total runtime: 0.072 ms
(5 rows)
 
For C, the planner estimated 10 thousand rows. For D, the planner estimated 100 thousand rows, yet for E the planner estimated only 1 row, which is the closest to reality. So, is there any way to specify a query that has a SUB-SELECT that returns a small set of values so that the planner treats it similar to how it treats statement E, or does statement E get its additional edge precisely from the fact that the restriction is defined by integer literals? If so, I think it's ok, because it seems like statements C or D will work well enough when the location distribution is realistic, but I'd like to be educated for the future :)
 
Thanks again!
Eddy
 
On Sun, Nov 15, 2009 at 3:59 PM, Eddy Escardo-Raffo <eescardo@kikini.com> wrote:
Thanks, Tom. I had discarded the possibility of data type mismatch already, which was your first guess, but was wondering if the lopsided distribution of location values would lead the planner to make a decision that is good on average but bad for this particular query, as you point out in your second guess.
 
I'll try populating the test users with a more evenly distributed location field, which will be more realistic anyway, and see if that works out better.
 
BTW, the -1 is not really a dummy value, but it's just a value that we have been using in tests for "fake test location ID". I just started performance measurement for my application and so far had measured performance with every user being in the same default location and things seemed to be going well, so I tried to switch a couple users to a different location and see what happened, and that made performance drop significantly.
(even more detail: my queries also limit results to 10 approx, so DB quickly found 10 rows that match location -1, but it took a while to discover there weren't more than 2 rows with the other value).
 
Thanks!
Eddy

On Sun, Nov 15, 2009 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> The table used in this query is called "users", and it has columns "userid"
> (primary key) and "location".
> The "location" column is indexed.
> The users table has 1 million rows, and all rows have integer typed value
> '-1' for  "location" column, except for 2 rows that have the integer value
> '76543'.

Oh, after poking at it a bit more, I realize the problem: the planner
doesn't want to use an indexscan because it assumes there's a
significant probability that the search will be for -1 (in which case
the indexscan would be slower than a seqscan, as indeed your results
prove).  Even though it could know in this particular case that the
comparison value isn't -1, I doubt that teaching it that would help your
real queries where it will probably be impossible to determine the
comparison values in advance.

I would suggest considering using NULL rather than inventing a dummy
value for unknown locations.  The estimation heuristics will play a
lot nicer with that choice.

                       regards, tom lane


В списке pgsql-performance по дате отправления:

Предыдущее
От: Eddy Escardo-Raffo
Дата:
Сообщение: Re: Unexpected sequential scan on an indexed column
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Unexpected sequential scan on an indexed column