Обсуждение: Nested query performance issue
(This is related to an earlier post on -sql.) I'm querying for the N high scores for each game, with two tables: scores and games. CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY); CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL, game_id INTEGER REFERENCES game (id)); -- test data: 1000 games, 100000 scores INSERT INTO game (id) select generate_series(1,1000); INSERT INTO score (game_id, score) select game.id, random() from game, generate_series(1,100); CREATE INDEX score_idx1 ON score (game_id, score desc); ANALYZE; This query retrieves the single highest score for each game, but doesn't allow any more than that--I can't get the top five scores for each game. However, it's very fast: on the above test data, it runs in 25ms on my system. With 1000000 scores, it takes 40ms. SELECT s.* FROM score s WHERE s.id IN ( -- Get the high scoring score ID for each game: SELECT ( -- Get the high score for game g: SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY s2.score DESC LIMIT 1 ) FROM game g ); This rewrite allows getting the top N scores. Unfortunately, this one takes 950ms for the same data. With 1000000 scores, it takes 14800ms. SELECT s.* FROM score s, game g WHERE s.game_id = g.id AND s.id IN ( SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score DESC LIMIT 1 ); This seems simple: for each game, search for the highest score, and then scan the tree to get the next N-1 highest scores. The first version does just that, but the second one is doing a seq scan over score. I do want to be able to use a LIMIT higher than 1, which only works with the second form. Any suggestions of how to get the efficient scanning of the first while being able to use a LIMIT greater than 1? (It'd even be faster to make several calls to the first version, varying an OFFSET to get each high score--but that's terrible.) -- Glenn Maynard
2009/4/9 Glenn Maynard <glennfmaynard@gmail.com>
(This is related to an earlier post on -sql.)
I'm querying for the N high scores for each game, with two tables:
scores and games.
CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL,
game_id INTEGER REFERENCES game (id));
-- test data: 1000 games, 100000 scores
INSERT INTO game (id) select generate_series(1,1000);
INSERT INTO score (game_id, score) select game.id, random() from game,
generate_series(1,100);
CREATE INDEX score_idx1 ON score (game_id, score desc);
ANALYZE;
How about
select s1.*
from score s1 join score s2 on s1.game_id=s2.game_id and s2.score >= s1.score
group by s1.*
having count(s2.*) <= N
Note: you can have problems if you have same scores - you will loose last group that overlap N
In any case, you don't need to join game since all you need is game_id you already have in score.
P.S. EXPLAIN ANALYZE could help
Best regards, Vitalii Tymchyshyn
(I didn't notice that I ended up with "score.score" in this test case. Oops.) 2009/4/8 Віталій Тимчишин <tivv00@gmail.com>: > How about > > select s1.* > from score s1 join score s2 on s1.game_id=s2.game_id and s2.score >= > s1.score > group by s1.* > having count(s2.*) <= N I can see what this is doing, but I'm getting: ERROR: could not identify an ordering operator for type score HINT: Use an explicit ordering operator or modify the query. I'm not sure why; if I replace s1.* and s2.* with s1.id and s2.id it works, but then I only get IDs. Unfortunately, with N = 1 this takes 8100ms (vs. 950ms and 25ms)... -- Glenn Maynard
OK, got to my postgres. Here you are:
create or replace function explode_array(in_array anyarray) returns setof anyelement as
$$
select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
$$
language sql immutable;
SELECT s.* FROM score s
WHERE s.id IN (
select
-- Get the high scoring score ID for each game:
explode_array(ARRAY(
-- Get the high score for game g:
SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY
s2.score DESC LIMIT 5
))
FROM game g
);
It takes ~64ms for me
Best regards, Vitaliy Tymchyshyn
create or replace function explode_array(in_array anyarray) returns setof anyelement as
$$
select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
$$
language sql immutable;
SELECT s.* FROM score s
WHERE s.id IN (
select
-- Get the high scoring score ID for each game:
explode_array(ARRAY(
-- Get the high score for game g:
SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY
s2.score DESC LIMIT 5
))
FROM game g
);
It takes ~64ms for me
Best regards, Vitaliy Tymchyshyn
Glenn Maynard wrote: > This rewrite allows getting the top N scores. Unfortunately, this one > takes 950ms for the same data. With 1000000 scores, it takes 14800ms. > > SELECT s.* FROM score s, game g > WHERE s.game_id = g.id AND > s.id IN ( > SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score > DESC LIMIT 1 > ); You don't really need the join with game here, simplifying this into: SELECT s.* FROM score s WHERE s.id IN ( SELECT s2.id FROM score s2 WHERE s2.game_id=s.game_id ORDER BY s2.score DESC LIMIT 1 ); I don't think it makes it any faster, though. You can also do this in a very nice and clean fashion using the upcoming PG 8.4 window functions: SELECT * FROM ( SELECT s.*, rank() OVER (PARTITION BY s.game_id ORDER BY score DESC) AS rank FROM score s ) AS sub WHERE rank <= 5; but I'm not sure how much faster it is. At least here on my laptop it does a full index scan on score, which may or may not be faster than just picking the top N values for each game using the index. > This seems simple: for each game, search for the highest score, and > then scan the tree to get the next N-1 highest scores. The first > version does just that, but the second one is doing a seq scan over > score. You can do that approach with a SQL function: CREATE FUNCTION topnscores(game_id int , n int) RETURNS SETOF score LANGUAGE SQL AS $$ SELECT * FROM score s WHERE s.game_id = $1 ORDER BY score DESC LIMIT $2 $$; SELECT (sub.ts).id, (sub.ts).score, (sub.ts).game_id FROM (SELECT topnscores(g.id, 5) ts FROM game g) sub; -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Apr 9, 2009 at 7:29 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> SELECT s.* FROM score s, game g >> WHERE s.game_id = g.id AND >> s.id IN ( >> SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score >> DESC LIMIT 1 >> ); > > You don't really need the join with game here, simplifying this into: > > SELECT s.* FROM score s > WHERE s.id IN ( > SELECT s2.id FROM score s2 WHERE s2.game_id=s.game_id ORDER BY s2.score > DESC LIMIT 1 > ); > > I don't think it makes it any faster, though. It's about 10% faster for me. I'm surprised the planner can't figure out that this join is redundant. > SELECT * FROM ( > SELECT s.*, rank() OVER (PARTITION BY s.game_id ORDER BY score DESC) AS > rank FROM score s > ) AS sub WHERE rank <= 5; > > but I'm not sure how much faster it is. At least here on my laptop it does a > full index scan on score, which may or may not be faster than just picking > the top N values for each game using the index. I'll definitely check this out when 8.4 is released. > You can do that approach with a SQL function: > > CREATE FUNCTION topnscores(game_id int , n int) RETURNS SETOF score LANGUAGE > SQL AS $$ > SELECT * FROM score s WHERE s.game_id = $1 ORDER BY score DESC LIMIT $2 > $$; > > SELECT (sub.ts).id, (sub.ts).score, (sub.ts).game_id > FROM (SELECT topnscores(g.id, 5) ts FROM game g) sub; ("as ts", for anyone trying this at home) Thanks--this one runs in 32ms, which seems about right compared against the original fast LIMIT 1 version. I see a slight improvement if I mark the function stable: 31.9ms to 31.2; minor but consistent. Just out of curiosity, any explanations for this difference? I don't see any change in the resulting query plan, but the plan doesn't enter the function call. -- Glenn Maynard
2009/4/9 Віталій Тимчишин <tivv00@gmail.com>: > create or replace function explode_array(in_array anyarray) returns setof > anyelement as > $$ > select ($1)[s] from generate_series(1,array_upper($1, 1)) as s; > $$ > language sql immutable; I tried using an ARRAY like this, but didn't quite figure out the explode_array piece. -- Glenn Maynard
On Thu, 9 Apr 2009, tiv00 wrote: > create or replace function explode_array(in_array anyarray) returns setof anyelement as > $$ > select ($1)[s] from generate_series(1,array_upper($1, 1)) as s; > $$ > language sql immutable; Note that you can make this function a bit more general by using array_lower as the bottom bound: create or replace function explode_array(in_array anyarray) returns setof anyelement as $$ select ($1)[s] from generate_series (array_lower($1, 1), array_upper($1, 1)) as s; $$ language sql immutable; While you won't run into them in most situations, it is possible to create arrays where the lower bound isn't 1 by using the subscript syntax. The example in the manual even shows that somewhat odd possibilities like assigning something to "myarray[-2:7]" works. As already pointed out, once you're in 8.4 the windowing functions might be a better fit here, but 8.4 does have "unnest" built-in that replaces the need to code this sort of thing yourself. You might want to name this function accordingly to match that upcoming standard (or not, depending on whether you want to avoid or be reminding of the potential for using the built-in). See http://www.depesz.com/index.php/2008/11/14/waiting-for-84-array-aggregate-and-array-unpacker/ for some examples. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Apr 9, 2009 at 7:29 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > CREATE FUNCTION topnscores(game_id int , n int) RETURNS SETOF score LANGUAGE > SQL AS $$ > SELECT * FROM score s WHERE s.game_id = $1 ORDER BY score DESC LIMIT $2 > $$; > > SELECT (sub.ts).id, (sub.ts).score, (sub.ts).game_id > FROM (SELECT topnscores(g.id, 5) ts FROM game g) sub; The inner query: SELECT topnscores(g.id, 5) ts FROM game g http://www.postgresql.org/docs/8.3/static/xfunc-sql.html says this is deprecated (though no deprecation warning is being generated): > Currently, functions returning sets can also be called in the select list of a query. For each row that the query generatesby itself, the function returning set is invoked, and an output row is generated for each element of the function'sresult set. Note, however, that this capability is deprecated and might be removed in future releases. It doesn't say how else to write this, though, and it's not obvious to me. "SELECT ts.* FROM topnscores(g.id, 5) AS ts, game g" doesn't work ("function expression in FROM cannot refer to other relations of same query level"). Is there an equivalent way to do this so I won't have deprecation looming over my back? I'm likely to become very dependent on this pattern. -- Glenn Maynard
Glenn Maynard <glennfmaynard@gmail.com> writes: > http://www.postgresql.org/docs/8.3/static/xfunc-sql.html says this is > deprecated (though no deprecation warning is being generated): >> Currently, functions returning sets can also be called in the select list of a query. For each row that the query generatesby itself, the function returning set is invoked, and an output row is generated for each element of the function'sresult set. Note, however, that this capability is deprecated and might be removed in future releases. The way to parse that is "we don't like this and we will get rid of it if we can ever figure out a good substitute". Right now there is no 100% substitute, so it stays. (In fact, 8.4 will extend the feature so it works in cases that don't work today, like for PL functions.) There are, however, good reasons not to like it, such as the rather questionable behavior if there's more than one SRF in the same select list. Don't complain if you run into that wart. regards, tom lane
On Thu, 9 Apr 2009, Glenn Maynard wrote: > On Thu, Apr 9, 2009 at 7:29 AM, Heikki Linnakangas wrote: >>> SELECT s.* FROM score s, game g >>> WHERE s.game_id = g.id AND >>> s.id IN ( >>> SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score >>> DESC LIMIT 1 >>> ); >> >> You don't really need the join with game here, simplifying this into: >> >> SELECT s.* FROM score s >> WHERE s.id IN ( >> SELECT s2.id FROM score s2 WHERE s2.game_id=s.game_id ORDER BY s2.score >> DESC LIMIT 1 >> ); >> >> I don't think it makes it any faster, though. > > It's about 10% faster for me. I'm surprised the planner can't figure > out that this join is redundant. Because the join isn't redundant? You're making the assumption that for every score.game_id there is exactly one game.id that matches. Of course, you may have a unique constraint and foreign key/trigger that ensures this. Matthew -- The third years are wandering about all worried at the moment because they have to hand in their final projects. Please be sympathetic to them, say things like "ha-ha-ha", but in a sympathetic tone of voice -- Computer Science Lecturer
On Tue, Apr 14, 2009 at 5:33 AM, Matthew Wakeling <matthew@flymine.org> wrote: >> It's about 10% faster for me. I'm surprised the planner can't figure >> out that this join is redundant. > > Because the join isn't redundant? You're making the assumption that for > every score.game_id there is exactly one game.id that matches. Of course, > you may have a unique constraint and foreign key/trigger that ensures this. That's the definition of the tables I gave. CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY); -- pk implies unique CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL, game_id INTEGER REFERENCES game (id)); (I don't think it makes any difference to whether this can be optimized, but adding NOT NULL back to game_id doesn't change it, either.) -- Glenn Maynard
2009/4/9 Віталій Тимчишин <tivv00@gmail.com>: > OK, got to my postgres. Here you are: > > create or replace function explode_array(in_array anyarray) returns setof > anyelement as > $$ > select ($1)[s] from generate_series(1,array_upper($1, 1)) as s; > $$ > language sql immutable; > in 8.4, this will be replaced by the built in 'unnest'. Also we have array_agg. merlin