Обсуждение: Unexpected sequential scan on an indexed column

Поиск
Список
Период
Сортировка

Unexpected sequential scan on an indexed column

От
Eddy Escardo-Raffo
Дата:
Hi, everyone.
 
Between postres docs, forum posts, previous similar questions answered and random blogs, I've read as much as I could about why others have had similar problems in the past before turning to you guys for help, so I really hope this is not some completely obvious oversight on my part (I am fairly new to DB programming after all).
 
So, my postgres version is: PostgreSQL 8.4.1, compiled by Visual C++ build 1400, 32-bit
 
The table used in this query is called "users", and it has columns "userid" (primary key) and "location".
The "location" column is indexed.
The users table has 1 million rows, and all rows have integer typed value '-1' for  "location" column, except for 2 rows that have the integer value '76543'.
 
I've attached a file with SQL commands that will setup this condition.
 
Then I run statement A
SELECT userid FROM users, (VALUES (76543)) val (id) WHERE location = val.id;
 
and the correct 2 results are returned, but after much more time than I would expect, since the location column is indexed.
I know that if all I wanted was the results from this specific query I could simply do statement B
 
SELECT userid FROM users WHERE location = 76543;
 
and that works 100% of the time, at the speed that I would expect it to. However, the example I'm giving here is a simplification of significantly more complex statements that involve more joins and such, where I'm trying to minimize round trips to database, and I've narrowed things down to the point where I think that if I can figure out how to make something like statement A perform well, then the overall performance problem will be pretty easy to fix.
 
So, when I run explain analyze for statement A I get the following:
 
  Nested Loop  (cost=0.00..27906.01 rows=1000000 width=8) (actual time=15.670..5411.503 rows=2 loops=1)
   Join Filter: (users.location = "*VALUES*".column1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.01 rows=1 width=4) (actual time=0.005..0.007 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.028..2903.398 rows=1000000 loops=1)
 Total runtime: 5411.620 ms
(5 rows)
 
Note that I did run VACUUM ANALYZE before running EXPLAIN ANALYZE.
 
I was curious about why the index wasn't being used so I forced it to be used through "SET enable_seqscan TO off", and then saw the following EXPLAIN ANALYZE output:
 
 Nested Loop  (cost=0.00..43889.37 rows=1000000 width=8) (actual time=5813.875..5813.897 rows=2 loops=1)
   Join Filter: (users.location = "*VALUES*".column1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=1)
   ->  Index Scan using idx_users_location on users  (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.375..2967.249 rows=1000000 loops=1)
 Total runtime: 5814.029 ms
 
So, even when we use the index, the planner seems to force the query to scan through all rows, rather than stopping the scan once it can knows that there will be no more rows returned (given the presence of the index).
 
If I use a ORDER BY clause to force the table scan to happen in descending order by location, then the SQL statement C performs wonderfully:
 
postgres=# explain analyze SELECT userid FROM users2, (VALUES (76543)) val (id) WHERE location = val.id ORDER BY location DESC;
 
But that's only due to the specific values used in this example and wouldn't work in general. If we ORDER_BY ascendingly, then the performance is still really slow. So, basically the planner seems to always want to do a sequential scan of the entire index, without placing any restriction on the index, and it may abort the full index scan early under ordered conditions, if the scan gets lucky.
 
Do you guys have any idea why this is not working as I expect? What I hope to accomplish is to have a construct such as the table I labeled "val" obtained from a sub-select. Given the size of the pool from which I'm selecting these values, I very rarely expect the number of values in the sub-select results to exceed 10, so I was hoping that the database would try to do something like a bitmapped scan after restricting the user table to the values in the small value table. Maybe it's not doing so given the lopsided distribution of location values in the users table, but I'm just not sure.
 
Any help is appreciated.
 
Thanks!
Eddy
Вложения

Re: Unexpected sequential scan on an indexed column

От
Tom Lane
Дата:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> Do you guys have any idea why this is not working as I expect?

Datatype issue maybe?  When I try what seems to be the same case here
I get the expected indexscan, so I'm thinking the problem is that
the comparison isn't indexable, which is a possibility if the location
column isn't actually integer.

The fact that it's estimating 1000000 rows out is also extremely
suspicious --- it might or might not get the exact "2" estimate,
but I'd sure expect it to know that the majority of rows don't match.

            regards, tom lane

Re: Unexpected sequential scan on an indexed column

От
Tom Lane
Дата:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> The table used in this query is called "users", and it has columns "userid"
> (primary key) and "location".
> The "location" column is indexed.
> The users table has 1 million rows, and all rows have integer typed value
> '-1' for  "location" column, except for 2 rows that have the integer value
> '76543'.

Oh, after poking at it a bit more, I realize the problem: the planner
doesn't want to use an indexscan because it assumes there's a
significant probability that the search will be for -1 (in which case
the indexscan would be slower than a seqscan, as indeed your results
prove).  Even though it could know in this particular case that the
comparison value isn't -1, I doubt that teaching it that would help your
real queries where it will probably be impossible to determine the
comparison values in advance.

I would suggest considering using NULL rather than inventing a dummy
value for unknown locations.  The estimation heuristics will play a
lot nicer with that choice.

            regards, tom lane

Re: Unexpected sequential scan on an indexed column

От
Eddy Escardo-Raffo
Дата:
Thanks, Tom. I had discarded the possibility of data type mismatch already, which was your first guess, but was wondering if the lopsided distribution of location values would lead the planner to make a decision that is good on average but bad for this particular query, as you point out in your second guess.
 
I'll try populating the test users with a more evenly distributed location field, which will be more realistic anyway, and see if that works out better.
 
BTW, the -1 is not really a dummy value, but it's just a value that we have been using in tests for "fake test location ID". I just started performance measurement for my application and so far had measured performance with every user being in the same default location and things seemed to be going well, so I tried to switch a couple users to a different location and see what happened, and that made performance drop significantly.
(even more detail: my queries also limit results to 10 approx, so DB quickly found 10 rows that match location -1, but it took a while to discover there weren't more than 2 rows with the other value).
 
Thanks!
Eddy

On Sun, Nov 15, 2009 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> The table used in this query is called "users", and it has columns "userid"
> (primary key) and "location".
> The "location" column is indexed.
> The users table has 1 million rows, and all rows have integer typed value
> '-1' for  "location" column, except for 2 rows that have the integer value
> '76543'.

Oh, after poking at it a bit more, I realize the problem: the planner
doesn't want to use an indexscan because it assumes there's a
significant probability that the search will be for -1 (in which case
the indexscan would be slower than a seqscan, as indeed your results
prove).  Even though it could know in this particular case that the
comparison value isn't -1, I doubt that teaching it that would help your
real queries where it will probably be impossible to determine the
comparison values in advance.

I would suggest considering using NULL rather than inventing a dummy
value for unknown locations.  The estimation heuristics will play a
lot nicer with that choice.

                       regards, tom lane

Re: Unexpected sequential scan on an indexed column

От
Eddy Escardo-Raffo
Дата:
Yeah, that was it. Thanks! I do have one more question at the bottom, though, if anyone has enough time to read through my analysis
 
If I create the table as:
 
CREATE TABLE users
(
userid integer NOT NULL,
location integer NOT NULL,
CONSTRAINT pk_users PRIMARY KEY (userid)
)
WITH (
OIDS=FALSE
);
 
CREATE INDEX idx_users_location
  ON users
  USING btree
  (location);
 
INSERT INTO users (userid,location) SELECT GENERATE_SERIES(1,1000000) , GENERATE_SERIES(1,1000000)/100000;
UPDATE users SET location=76543 WHERE userid=100092;
UPDATE users SET location=76543 WHERE userid=997000;
 
So, now we have 10 distinct location values, evenly distributed, one value (10) with only one row and one value (76543) with 2 rows. If, after this setup I do statement C:
 
explain analyze SELECT userid FROM users, (VALUES (76543), (892), (10)) val (id) WHERE location = val.id;
 
 Nested Loop  (cost=0.00..17277.21 rows=300000 width=4) (actual time=0.023..0.06 rows=3 loops=1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=4) (actual time0.002..0.004 rows=3 loops=1)
   ->  Index Scan using idx_users_location on users  (cost=0.00..4509.06 rows=10000 width=8) (actual time=0.008..0.009 rows=1 loops=3)
         Index Cond: (users.location = "*VALUES*".column1)
 Total runtime: 0.078 ms
(5 rows)
 
and if I do statement D:
 
explain analyze SELECT userid FROM users WHERE location IN (VALUES (76543), (892), (10));
 Nested Loop  (cost=0.05..17277.24 rows=300000 width=4) (actual time=0.033..0.056 rows=3 loops=1)
   ->  HashAggregate  (cost=0.05..0.08 rows=3 width=4) (actual time=0.012..0.015 rows=3 loops=1)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=4) (actual time=0.002..0.004 rows=3 loops=1)
   ->  Index Scan using idx_users_location on users  (cost=0.00..4509.06 rows=100000 width=8) (actual time=0.007..0.009 rows=1 loops=3)
         Index Cond: (users.location = "*VALUES*".column1)
 Total runtime: 0.094 ms
(6 rows)
 
Where C has a slight edge over D (I ran them both about 5 times and it seems like C is approx. 20% faster for this specific data set). So, I think this will work pretty good. However, I'm still curious (for my own education) as to why something like the following has even more of an edge over the previous two alternatives. Statement E:
 
explain analyze SELECT userid FROM users WHERE location IN (76543, 892, 10);
 
 Bitmap Heap Scan on users  (cost=12.91..16.93 rows=1 width=4) (actual time=0.035..0.038 rows=3 loops=1)
   Recheck Cond: (location = ANY ('{76543,892,10}'::integer[]))
   ->  Bitmap Index Scan on idx_users_location  (cost=0.00..12.91 rows=1 width=0) (actual time=0.027..0.027 rows=3 loops=1)
         Index Cond: (location = ANY ('{76543,892,10}'::integer[]))
 Total runtime: 0.072 ms
(5 rows)
 
For C, the planner estimated 10 thousand rows. For D, the planner estimated 100 thousand rows, yet for E the planner estimated only 1 row, which is the closest to reality. So, is there any way to specify a query that has a SUB-SELECT that returns a small set of values so that the planner treats it similar to how it treats statement E, or does statement E get its additional edge precisely from the fact that the restriction is defined by integer literals? If so, I think it's ok, because it seems like statements C or D will work well enough when the location distribution is realistic, but I'd like to be educated for the future :)
 
Thanks again!
Eddy
 
On Sun, Nov 15, 2009 at 3:59 PM, Eddy Escardo-Raffo <eescardo@kikini.com> wrote:
Thanks, Tom. I had discarded the possibility of data type mismatch already, which was your first guess, but was wondering if the lopsided distribution of location values would lead the planner to make a decision that is good on average but bad for this particular query, as you point out in your second guess.
 
I'll try populating the test users with a more evenly distributed location field, which will be more realistic anyway, and see if that works out better.
 
BTW, the -1 is not really a dummy value, but it's just a value that we have been using in tests for "fake test location ID". I just started performance measurement for my application and so far had measured performance with every user being in the same default location and things seemed to be going well, so I tried to switch a couple users to a different location and see what happened, and that made performance drop significantly.
(even more detail: my queries also limit results to 10 approx, so DB quickly found 10 rows that match location -1, but it took a while to discover there weren't more than 2 rows with the other value).
 
Thanks!
Eddy

On Sun, Nov 15, 2009 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> The table used in this query is called "users", and it has columns "userid"
> (primary key) and "location".
> The "location" column is indexed.
> The users table has 1 million rows, and all rows have integer typed value
> '-1' for  "location" column, except for 2 rows that have the integer value
> '76543'.

Oh, after poking at it a bit more, I realize the problem: the planner
doesn't want to use an indexscan because it assumes there's a
significant probability that the search will be for -1 (in which case
the indexscan would be slower than a seqscan, as indeed your results
prove).  Even though it could know in this particular case that the
comparison value isn't -1, I doubt that teaching it that would help your
real queries where it will probably be impossible to determine the
comparison values in advance.

I would suggest considering using NULL rather than inventing a dummy
value for unknown locations.  The estimation heuristics will play a
lot nicer with that choice.

                       regards, tom lane


Re: Unexpected sequential scan on an indexed column

От
Tom Lane
Дата:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> For C, the planner estimated 10 thousand rows. For D, the planner estimated
> 100 thousand rows, yet for E the planner estimated only 1 row, which is the
> closest to reality. So, is there any way to specify a query that has a
> SUB-SELECT that returns a small set of values so that the planner treats it
> similar to how it treats statement E, or does statement E get its additional
> edge precisely from the fact that the restriction is defined by integer
> literals?

Currently there is no attempt to look at the exact contents of a VALUES
construct for planning purposes.  For the examples you're showing it
seems like the IN (list) notation is more compact and more widely used,
so improving the VALUES alternative doesn't seem that exciting.

            regards, tom lane

Re: Unexpected sequential scan on an indexed column

От
Eddy Escardo-Raffo
Дата:
OK, I think that after reading this doc (which I hadn't encountered before) about the optimizer, something clicked in my brain and I think I can answer my own question. I was basically thinking from my own perspective rather than from the query planner's perspective:
- From my perspective I know that the subselect will return very few values, so naively I expected that the planner would be able to do a bitmap index scan with the small set of values returned, without needing to do a join (such as the nested loop join it ended up choosing).
- However (and this is probably obvious to all of you), the query planner doesn't really know for a fact that a sub-select will result in a small number of rows, so it guesses based on its statistics what the best kind of join would be. A 'bitmap index scan' is not one of the choices for a join, I'm guessing because a 'nested loop join with inner index scan' is a more generally applicable strategy that can get the same order of magnitude of performance in restriction cases that end up being as simple as an IN (list) restriction. However, there are more competing possibilities for picking an appropriate join strategy than for picking a strategy to apply an IN (list) restriction, so the planner may not pick the 'nested loop join with inner index scan' if the ANALYZE statistics don't guide it that way, even if that would be the best strategy in the end.
I guess the only way I can think of to make a generic planner that would have performend well even in the lopsided statistics case is to create some plan nodes with contingency conditions. E.g.:
 
Plan: Nested loop join with sequential scan
Assumption: all table values are the same
Contingency plan: nested loop join with index scan
 
Then, if the assumption for the plan is violated early enough while executing the plan, the query executor would abort that plan node execution and start over with the contingency plan.
 
I guess implementing this kind of system in a generic way could get pretty hairy, and given my limited experience I don't know if the proportion of query plans that would be improved by having these kinds of contingency plans is significant enough to warrant the cost of developing this system, but I'm gathering that most query planners (including the postgres planner) don't do this kind of contingency planning :)
 
Thanks!
Eddy
On Sun, Nov 15, 2009 at 5:46 PM, Eddy Escardo-Raffo <eescardo@kikini.com> wrote:
I was using VALUES in my examples to more closely mirror the results of a sub-select (I abstracted everything else away and noticed that even just using VALUES syntax instead of a sub-select, the performance was bad). The full statement I had that led me into this more narrow investigation in the first place looks more like:
 
explain analyze SELECT u.userid FROM users u, (SELECT locid FROM locations WHERE ...) l WHERE u.location = l.locid LIMIT 10;
 
Based on the investigation so far, it seems like this kind of statement will perform well when the users.location distribution is not overwhelmingly lopsided, but not otherwise. However, using the IN (list) notation with a list of integer literals seems to perform well no matter what is the specific distribution of values in the users.location column.
 
I would like to understand why this is so, to help me write better queries in the future.
 
Thanks,
Eddy
On Sun, Nov 15, 2009 at 5:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> For C, the planner estimated 10 thousand rows. For D, the planner estimated
> 100 thousand rows, yet for E the planner estimated only 1 row, which is the
> closest to reality. So, is there any way to specify a query that has a
> SUB-SELECT that returns a small set of values so that the planner treats it
> similar to how it treats statement E, or does statement E get its additional
> edge precisely from the fact that the restriction is defined by integer
> literals?

Currently there is no attempt to look at the exact contents of a VALUES
construct for planning purposes.  For the examples you're showing it
seems like the IN (list) notation is more compact and more widely used,
so improving the VALUES alternative doesn't seem that exciting.

                       regards, tom lane


Re: Unexpected sequential scan on an indexed column

От
Dave Crooke
Дата:
Hi Eddy

Perhaps a slightly naive suggestion .... have you considered
converting the query to a small stored procedure ('function' in
Postgres speak)? You can pull the location values, and then iterate
over a query like this:

select userid from users where location=:x

which is more-or-less guaranteed to use the index.


I had a somewhat similar situation recently, where I was passing in a
list of id's (from outwith Postgres) and it would on occasion avoid
the index in favour of a full table scan .... I changed this to
iterate over the id's with separate queries (in Java, but using a
function will achieve the same thing) and went from one 5 minute query
doing full table scan to a handful of queries doing sub-millisecond
direct index lookups.

Cheers
Dave

Re: Unexpected sequential scan on an indexed column

От
Eddy Escardo-Raffo
Дата:
Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion.
 
Eddy

On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke <dcrooke@gmail.com> wrote:
Hi Eddy

Perhaps a slightly naive suggestion .... have you considered
converting the query to a small stored procedure ('function' in
Postgres speak)? You can pull the location values, and then iterate
over a query like this:

select userid from users where location=:x

which is more-or-less guaranteed to use the index.


I had a somewhat similar situation recently, where I was passing in a
list of id's (from outwith Postgres) and it would on occasion avoid
the index in favour of a full table scan .... I changed this to
iterate over the id's with separate queries (in Java, but using a
function will achieve the same thing) and went from one 5 minute query
doing full table scan to a handful of queries doing sub-millisecond
direct index lookups.

Cheers
Dave

Re: Unexpected sequential scan on an indexed column

От
Kenneth Marshall
Дата:
On Mon, Nov 16, 2009 at 12:45:46PM -0800, Eddy Escardo-Raffo wrote:
> Yeah this kind of thing would probably work. Doing this in java with
> separate queries would be easy to code but require multiple round trips.
> Doing it as a stored procedure would be nicer but I'd have to think a little
> more about how to refactor the java code around the query to make this
> happen. Thanks for the suggestion.
>
> Eddy
>

Hi Eddy,

Here is a lookup wrapper that is used in DSPAM to work around
a similar problem. Maybe you can use it as a template for your
function:

create function lookup_tokens(integer,bigint[])
  returns setof dspam_token_data
  language plpgsql stable
  as '
declare
  v_rec record;
begin
  for v_rec in select * from dspam_token_data
    where uid=$1
      and token in (select $2[i]
        from generate_series(array_lower($2,1),array_upper($2,1)) s(i))
  loop
    return next v_rec;
  end loop;
  return;
end;';

Regards,
Ken

> On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke <dcrooke@gmail.com> wrote:
>
> > Hi Eddy
> >
> > Perhaps a slightly naive suggestion .... have you considered
> > converting the query to a small stored procedure ('function' in
> > Postgres speak)? You can pull the location values, and then iterate
> > over a query like this:
> >
> > select userid from users where location=:x
> >
> > which is more-or-less guaranteed to use the index.
> >
> >
> > I had a somewhat similar situation recently, where I was passing in a
> > list of id's (from outwith Postgres) and it would on occasion avoid
> > the index in favour of a full table scan .... I changed this to
> > iterate over the id's with separate queries (in Java, but using a
> > function will achieve the same thing) and went from one 5 minute query
> > doing full table scan to a handful of queries doing sub-millisecond
> > direct index lookups.
> >
> > Cheers
> > Dave
> >

Re: Unexpected sequential scan on an indexed column

От
Eddy Escardo-Raffo
Дата:
This is incredibly helpful, Kenneth. I didn't know about the SETOF syntax at all. This could help minimize the amount of refactoring I need to do.
 
Thanks!
Eddy

On Mon, Nov 16, 2009 at 12:55 PM, Kenneth Marshall <ktm@rice.edu> wrote:
On Mon, Nov 16, 2009 at 12:45:46PM -0800, Eddy Escardo-Raffo wrote:
> Yeah this kind of thing would probably work. Doing this in java with
> separate queries would be easy to code but require multiple round trips.
> Doing it as a stored procedure would be nicer but I'd have to think a little
> more about how to refactor the java code around the query to make this
> happen. Thanks for the suggestion.
>
> Eddy
>

Hi Eddy,

Here is a lookup wrapper that is used in DSPAM to work around
a similar problem. Maybe you can use it as a template for your
function:

create function lookup_tokens(integer,bigint[])
 returns setof dspam_token_data
 language plpgsql stable
 as '
declare
 v_rec record;
begin
 for v_rec in select * from dspam_token_data
   where uid=$1
     and token in (select $2[i]
       from generate_series(array_lower($2,1),array_upper($2,1)) s(i))
 loop
   return next v_rec;
 end loop;
 return;
end;';

Regards,
Ken

> On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke <dcrooke@gmail.com> wrote:
>
> > Hi Eddy
> >
> > Perhaps a slightly naive suggestion .... have you considered
> > converting the query to a small stored procedure ('function' in
> > Postgres speak)? You can pull the location values, and then iterate
> > over a query like this:
> >
> > select userid from users where location=:x
> >
> > which is more-or-less guaranteed to use the index.
> >
> >
> > I had a somewhat similar situation recently, where I was passing in a
> > list of id's (from outwith Postgres) and it would on occasion avoid
> > the index in favour of a full table scan .... I changed this to
> > iterate over the id's with separate queries (in Java, but using a
> > function will achieve the same thing) and went from one 5 minute query
> > doing full table scan to a handful of queries doing sub-millisecond
> > direct index lookups.
> >
> > Cheers
> > Dave
> >

Re: Unexpected sequential scan on an indexed column

От
Dave Crooke
Дата:
With Postgres, you can transparently replace a regular select with a function that takes the same types and returns a record iterator with the same columns. The only change needed is the SQL used to invoke it, you won't need any logic changes in your app code (Java or whatever), e.g.

select ............ where x=:x ......(select ...... where ..... y=:y)

Becomes

select myfunction(:x, :y)

On Mon, Nov 16, 2009 at 2:45 PM, Eddy Escardo-Raffo <eescardo@kikini.com> wrote:
Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion.
 
Eddy

On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke <dcrooke@gmail.com> wrote:
Hi Eddy

Perhaps a slightly naive suggestion .... have you considered
converting the query to a small stored procedure ('function' in
Postgres speak)? You can pull the location values, and then iterate
over a query like this:

select userid from users where location=:x

which is more-or-less guaranteed to use the index.


I had a somewhat similar situation recently, where I was passing in a
list of id's (from outwith Postgres) and it would on occasion avoid
the index in favour of a full table scan .... I changed this to
iterate over the id's with separate queries (in Java, but using a
function will achieve the same thing) and went from one 5 minute query
doing full table scan to a handful of queries doing sub-millisecond
direct index lookups.

Cheers
Dave


Re: Unexpected sequential scan on an indexed column

От
Eddy Escardo-Raffo
Дата:
Thanks, Dave.
Eddy

On Mon, Nov 16, 2009 at 1:52 PM, Dave Crooke <dcrooke@gmail.com> wrote:
With Postgres, you can transparently replace a regular select with a function that takes the same types and returns a record iterator with the same columns. The only change needed is the SQL used to invoke it, you won't need any logic changes in your app code (Java or whatever), e.g.

select ............ where x=:x ......(select ...... where ..... y=:y)

Becomes

select myfunction(:x, :y)

On Mon, Nov 16, 2009 at 2:45 PM, Eddy Escardo-Raffo <eescardo@kikini.com> wrote:
Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion.
 
Eddy

On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke <dcrooke@gmail.com> wrote:
Hi Eddy

Perhaps a slightly naive suggestion .... have you considered
converting the query to a small stored procedure ('function' in
Postgres speak)? You can pull the location values, and then iterate
over a query like this:

select userid from users where location=:x

which is more-or-less guaranteed to use the index.


I had a somewhat similar situation recently, where I was passing in a
list of id's (from outwith Postgres) and it would on occasion avoid
the index in favour of a full table scan .... I changed this to
iterate over the id's with separate queries (in Java, but using a
function will achieve the same thing) and went from one 5 minute query
doing full table scan to a handful of queries doing sub-millisecond
direct index lookups.

Cheers
Dave