Обсуждение: Best way to get the latest revision from a table
I'm using 8.3 and I have a table that contains many revisions of the same entity and I have a query that is super slow. Please help! I'm going to paste in some SQL to set up test cases and some plans below. If that isn't the right way to post to this list please let me know and I'll revise.
My table looks kind of like this but wider:
CREATE TEMPORARY TABLE test (revision SERIAL NOT NULL PRIMARY KEY, a INTEGER NOT NULL, b INTEGER NOT NULL, c INTEGER NOT NULL);
INSERT INTO test (a, b, c) SELECT a, 1, 25 FROM generate_series(1, 100000) AS t1(a), generate_series(1, 10) as t2(b);
CREATE INDEX test_a ON test (a);
ANALYZE test;
I need to SELECT all the columns with the latest revision for a subset of As. What is the right way to do this quickly?
When I do it like this:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT *
FROM request
JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a)
JOIN test USING (a, b);
DROP TABLE request;
The plan for the SELECT is pretty bad:
"Hash Join (cost=32792.50..77907.29 rows=62288 width=20) (actual time=769.570..2222.050 rows=199 loops=1)"
" Hash Cond: ((max(pg_temp_7.test.revision)) = pg_temp_7.test.revision)"
" -> Hash Join (cost=5.48..38659.23 rows=62288 width=8) (actual time=20.621..830.235 rows=199 loops=1)"
" Hash Cond: (pg_temp_7.test.a = request.a)"
" -> GroupAggregate (cost=0.00..37170.11 rows=62601 width=8) (actual time=16.847..808.475 rows=100000 loops=1)"
" -> Index Scan using test_a on test (cost=0.00..31388.04 rows=999912 width=8) (actual time=16.826..569.035 rows=1000000 loops=1)"
" -> Hash (cost=2.99..2.99 rows=199 width=4) (actual time=3.736..3.736 rows=199 loops=1)"
" -> Seq Scan on request (cost=0.00..2.99 rows=199 width=4) (actual time=3.658..3.689 rows=199 loops=1)"
" -> Hash (cost=15405.12..15405.12 rows=999912 width=16) (actual time=723.673..723.673 rows=1000000 loops=1)"
" -> Seq Scan on test (cost=0.00..15405.12 rows=999912 width=16) (actual time=0.006..290.313 rows=1000000 loops=1)"
"Total runtime: 2222.267 ms"
If I instead issue the query as:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL, revision INTEGER);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
UPDATE request SET revision = (SELECT MAX(revision) FROM test WHERE request.a = test.a);
ANALYZE request;
SELECT *
FROM request
JOIN test USING (revision)
DROP TABLE request;
The whole thing runs tons faster. The UPDATE uses the right index and is way sub second and the SELECT's plan is fine:
"Merge Join (cost=11.66..76.09 rows=199 width=20) (actual time=0.131..0.953 rows=199 loops=1)"
" Merge Cond: (test.revision = request.revision)"
" -> Index Scan using test_pkey on test (cost=0.00..31388.04 rows=999912 width=16) (actual time=0.017..0.407 rows=2001 loops=1)"
" -> Sort (cost=11.59..12.09 rows=199 width=8) (actual time=0.102..0.133 rows=199 loops=1)"
" Sort Key: request.revision"
" Sort Method: quicksort Memory: 34kB"
" -> Seq Scan on request (cost=0.00..3.99 rows=199 width=8) (actual time=0.020..0.050 rows=199 loops=1)"
"Total runtime: 1.005 ms"
Am I missing something or is this really the best way to do this in 8.3?
Thanks for slogging through all this,
Nik Everett
Nikolas Everett <nik9000@gmail.com> wrote: > Am I missing something or is this really the best way to do this in 8.3? How about this?: SELECT y.* from (select a, max(revision) as revision from test where a between 2 and 200 group by a) x join test y using (a, revision); -Kevin
On Fri, Jan 14, 2011 at 5:30 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
SELECT y.*from (select a, max(revision) as revision
from test where a between 2 and 200
group by a) x
join test y using (a, revision);
While certainly simpler than my temp table this really just exposes a flaw in my example - I'm really going to be doing this with an arbitrary list of As.
Nikolas Everett <nik9000@gmail.com> wrote: > I'm really going to be doing this with an arbitrary list of As. OK, how about this?: CREATE TEMPORARY TABLE request (a INTEGER NOT NULL); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); ANALYZE request; SELECT y.* from (select a, max(revision) as revision from test join request using (a) group by a) x join test y using (a, revision); DROP TABLE request; -Kevin
On 01/14/2011 03:17 PM, Nikolas Everett wrote: > SELECT * > FROM request > JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a) > JOIN test USING (a, b); This actually looks like a perfect candidate for DISTINCT ON. SELECT DISTINCT ON (a, b) a, b, revision FROM test ORDER BY a, b DESC; Maybe I'm just misunderstanding your situation, though. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Shaun Thomas <sthomas@peak6.com> wrote: > This actually looks like a perfect candidate for DISTINCT ON. > > SELECT DISTINCT ON (a, b) a, b, revision > FROM test > ORDER BY a, b DESC; I wouldn't say perfect. It runs about eight times slower than what I suggested and returns a fairly random value for revision instead of the max(revision). -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Shaun Thomas <sthomas@peak6.com> wrote: >> This actually looks like a perfect candidate for DISTINCT ON. >> >> SELECT DISTINCT ON (a, b) a, b, revision >> FROM test >> ORDER BY a, b DESC; > I wouldn't say perfect. It runs about eight times slower than what > I suggested and returns a fairly random value for revision instead > of the max(revision). Shaun's example is a bit off: normally, when using DISTINCT ON, you want an ORDER BY key that uses all the given DISTINCT keys and then some more. To get the max revision for each a/b combination it ought to be SELECT DISTINCT ON (a, b) a, b, revision FROM test ORDER BY a, b, revision DESC; As for speed, either one might be faster in a particular situation. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Shaun's example is a bit off > As for speed, either one might be faster in a particular > situation. After fixing a mistake in my testing and learning from Tom's example I generated queries against the OP's test data which produce identical results, and I'm finding no significant difference between run times for the two versions. The OP should definitely try both against the real tables. Here are the queries which run against the test set: DROP TABLE IF EXISTS request; CREATE TEMPORARY TABLE request (a INTEGER NOT NULL); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); ANALYZE request; SELECT y.* from (select a, max(revision) as revision from test join request using (a) group by a) x join test y using (a, revision) order by a, revision DESC; DROP TABLE IF EXISTS request; CREATE TEMPORARY TABLE request (a INTEGER NOT NULL); INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a); ANALYZE request; SELECT DISTINCT ON (a, b, c) revision, a, b, c FROM test join request using (a) ORDER BY a, b, c, revision DESC; Sorry for not sorting it out better initially. -Kevin
On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> As for speed, either one might be faster in a particularAfter fixing a mistake in my testing and learning from Tom's example
> situation.
I generated queries against the OP's test data which produce
identical results, and I'm finding no significant difference between
run times for the two versions. The OP should definitely try both
against the real tables.
<snip>
-Kevin
After trying both against the real tables DISTINCT ON seems to be about two orders of magnitude faster than the other options.
Thanks so much!
Nik Everett
> Shaun's example is a bit off: normally, when using DISTINCT ON, you want > an ORDER BY key that uses all the given DISTINCT keys and then some > more. To get the max revision for each a/b combination it ought to be Hah, well i figured I was doing something wrong. I just thought about it a little bit, said to myself: "Hey, I've used thisbefore to get the most recent x for a bunch of y without a sub-query. We always used it to get the newest update to anevent log. But that's why I said I was probably misunderstanding something. :) I was trying to pick apart the logic to his temp tablesand saw the max(b) and it threw me off. Glad you're around to set it straight. Heh. ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
> After trying both against the real tables DISTINCT ON seems to be > about two orders of magnitude faster than the other options. Glad it worked. It looked at least naively similar to situations I've run into and DISTINCT ON always helped me out. It'sall the fun of GROUP BY with the ability to discard non-aggregate results just by screwing around with your sorting.Still one of my favorite tricks. -- Shaun Thomas Peak6 | 141 W. Jackson Blvd. | Suite 800 | Chicago, IL 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
On Fri, Jan 14, 2011 at 8:50 PM, Nikolas Everett <nik9000@gmail.com> wrote: > > > On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> >> Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> > Shaun's example is a bit off >> >> > As for speed, either one might be faster in a particular >> > situation. >> >> After fixing a mistake in my testing and learning from Tom's example >> I generated queries against the OP's test data which produce >> identical results, and I'm finding no significant difference between >> run times for the two versions. The OP should definitely try both >> against the real tables. >> > <snip> >> >> -Kevin > > After trying both against the real tables DISTINCT ON seems to be about two > orders of magnitude faster than the other options. What I've often done in these situations is add a Boolean to the table that defaults to true, and an ON INSERT trigger that flips the Boolean for any existing row with the same key to false. Then you can just do something like "SELECT * FROM tab WHERE latest". And you can create partial indexes etc: CREATE INDEX foo ON tab (a) WHERE latest. Although if using DISTINCT ON is working, no reason to do anything more complicated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Distinct on is working really well! If I need to be able to index something I might start thinking along those lines.
On Fri, Jan 21, 2011 at 12:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
What I've often done in these situations is add a Boolean to the tableOn Fri, Jan 14, 2011 at 8:50 PM, Nikolas Everett <nik9000@gmail.com> wrote:
>
>
> On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>>
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> > Shaun's example is a bit off
>>
>> > As for speed, either one might be faster in a particular
>> > situation.
>>
>> After fixing a mistake in my testing and learning from Tom's example
>> I generated queries against the OP's test data which produce
>> identical results, and I'm finding no significant difference between
>> run times for the two versions. The OP should definitely try both
>> against the real tables.
>>
> <snip>
>>
>> -Kevin
>
> After trying both against the real tables DISTINCT ON seems to be about two
> orders of magnitude faster than the other options.
that defaults to true, and an ON INSERT trigger that flips the Boolean
for any existing row with the same key to false. Then you can just do
something like "SELECT * FROM tab WHERE latest". And you can create
partial indexes etc: CREATE INDEX foo ON tab (a) WHERE latest.
Although if using DISTINCT ON is working, no reason to do anything
more complicated.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company