Обсуждение: Best way to get the latest revision from a table

Поиск
Список
Период
Сортировка

Best way to get the latest revision from a table

От
Nikolas Everett
Дата:
I'm using 8.3 and I have a table that contains many revisions of the same entity and I have a query that is super slow.  Please help!  I'm going to paste in some SQL to set up test cases and some plans below.  If that isn't the right way to post to this list please let me know and I'll revise.

My table looks kind of like this but wider:
CREATE TEMPORARY TABLE test (revision SERIAL NOT NULL PRIMARY KEY, a INTEGER NOT NULL, b INTEGER NOT NULL, c INTEGER NOT NULL);
INSERT INTO test (a, b, c) SELECT a, 1, 25 FROM generate_series(1, 100000) AS t1(a), generate_series(1, 10) as t2(b);
CREATE INDEX test_a ON test (a);
ANALYZE test;

I need to SELECT all the columns with the latest revision for a subset of As.  What is the right way to do this quickly?

When I do it like this:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT *
  FROM request
  JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a)
  JOIN test USING (a, b);
DROP TABLE request;

The plan for the SELECT is pretty bad:
"Hash Join  (cost=32792.50..77907.29 rows=62288 width=20) (actual time=769.570..2222.050 rows=199 loops=1)"
"  Hash Cond: ((max(pg_temp_7.test.revision)) = pg_temp_7.test.revision)"
"  ->  Hash Join  (cost=5.48..38659.23 rows=62288 width=8) (actual time=20.621..830.235 rows=199 loops=1)"
"        Hash Cond: (pg_temp_7.test.a = request.a)"
"        ->  GroupAggregate  (cost=0.00..37170.11 rows=62601 width=8) (actual time=16.847..808.475 rows=100000 loops=1)"
"              ->  Index Scan using test_a on test  (cost=0.00..31388.04 rows=999912 width=8) (actual time=16.826..569.035 rows=1000000 loops=1)"
"        ->  Hash  (cost=2.99..2.99 rows=199 width=4) (actual time=3.736..3.736 rows=199 loops=1)"
"              ->  Seq Scan on request  (cost=0.00..2.99 rows=199 width=4) (actual time=3.658..3.689 rows=199 loops=1)"
"  ->  Hash  (cost=15405.12..15405.12 rows=999912 width=16) (actual time=723.673..723.673 rows=1000000 loops=1)"
"        ->  Seq Scan on test  (cost=0.00..15405.12 rows=999912 width=16) (actual time=0.006..290.313 rows=1000000 loops=1)"
"Total runtime: 2222.267 ms"

If I instead issue the query as:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL, revision INTEGER);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
UPDATE request SET revision = (SELECT MAX(revision) FROM test WHERE request.a = test.a);
ANALYZE request;
SELECT *
  FROM request
  JOIN test USING (revision)
DROP TABLE request;

The whole thing runs tons faster.  The UPDATE uses the right index and is way sub second and the SELECT's plan is fine:
"Merge Join  (cost=11.66..76.09 rows=199 width=20) (actual time=0.131..0.953 rows=199 loops=1)"
"  Merge Cond: (test.revision = request.revision)"
"  ->  Index Scan using test_pkey on test  (cost=0.00..31388.04 rows=999912 width=16) (actual time=0.017..0.407 rows=2001 loops=1)"
"  ->  Sort  (cost=11.59..12.09 rows=199 width=8) (actual time=0.102..0.133 rows=199 loops=1)"
"        Sort Key: request.revision"
"        Sort Method:  quicksort  Memory: 34kB"
"        ->  Seq Scan on request  (cost=0.00..3.99 rows=199 width=8) (actual time=0.020..0.050 rows=199 loops=1)"
"Total runtime: 1.005 ms"

Am I missing something or is this really the best way to do this in 8.3?

Thanks for slogging through all this,

Nik Everett

Re: Best way to get the latest revision from a table

От
"Kevin Grittner"
Дата:
Nikolas Everett <nik9000@gmail.com> wrote:

> Am I missing something or is this really the best way to do this in
8.3?

How about this?:

SELECT y.*
  from (select a, max(revision) as revision
          from test where a between 2 and 200
          group by a) x
  join test y using (a, revision);

-Kevin

Re: Best way to get the latest revision from a table

От
Nikolas Everett
Дата:
On Fri, Jan 14, 2011 at 5:30 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
SELECT y.*
 from (select a, max(revision) as revision
         from test where a between 2 and 200
         group by a) x
 join test y using (a, revision);

While certainly simpler than my temp table this really just exposes a flaw in my example - I'm really going to be doing this with an arbitrary list of As.

Re: Best way to get the latest revision from a table

От
"Kevin Grittner"
Дата:
Nikolas Everett <nik9000@gmail.com> wrote:

> I'm really going to be doing this with an arbitrary list of As.

OK, how about this?:

CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT y.*
  from (select a, max(revision) as revision
          from test join request using (a)
          group by a) x
  join test y using (a, revision);
DROP TABLE request;

-Kevin

Re: Best way to get the latest revision from a table

От
Shaun Thomas
Дата:
On 01/14/2011 03:17 PM, Nikolas Everett wrote:

> SELECT *
>    FROM request
>    JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a)
>    JOIN test USING (a, b);

This actually looks like a perfect candidate for DISTINCT ON.

SELECT DISTINCT ON (a, b) a, b, revision
   FROM test
  ORDER BY a, b DESC;

Maybe I'm just misunderstanding your situation, though.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Best way to get the latest revision from a table

От
"Kevin Grittner"
Дата:
Shaun Thomas <sthomas@peak6.com> wrote:

> This actually looks like a perfect candidate for DISTINCT ON.
>
> SELECT DISTINCT ON (a, b) a, b, revision
>    FROM test
>   ORDER BY a, b DESC;

I wouldn't say perfect.  It runs about eight times slower than what
I suggested and returns a fairly random value for revision instead
of the max(revision).

-Kevin

Re: Best way to get the latest revision from a table

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Shaun Thomas <sthomas@peak6.com> wrote:
>> This actually looks like a perfect candidate for DISTINCT ON.
>>
>> SELECT DISTINCT ON (a, b) a, b, revision
>> FROM test
>> ORDER BY a, b DESC;

> I wouldn't say perfect.  It runs about eight times slower than what
> I suggested and returns a fairly random value for revision instead
> of the max(revision).

Shaun's example is a bit off: normally, when using DISTINCT ON, you want
an ORDER BY key that uses all the given DISTINCT keys and then some
more.  To get the max revision for each a/b combination it ought to be

SELECT DISTINCT ON (a, b) a, b, revision
   FROM test
   ORDER BY a, b, revision DESC;

As for speed, either one might be faster in a particular situation.

            regards, tom lane

Re: Best way to get the latest revision from a table

От
"Kevin Grittner"
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Shaun's example is a bit off

> As for speed, either one might be faster in a particular
> situation.

After fixing a mistake in my testing and learning from Tom's example
I generated queries against the OP's test data which produce
identical results, and I'm finding no significant difference between
run times for the two versions.  The OP should definitely try both
against the real tables.

Here are the queries which run against the test set:

DROP TABLE IF EXISTS request;
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT y.*
  from (select a, max(revision) as revision
          from test join request using (a)
          group by a) x
  join test y using (a, revision)
  order by a, revision DESC;

DROP TABLE IF EXISTS request;
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT DISTINCT ON (a, b, c) revision, a, b, c
   FROM test join request using (a)
   ORDER BY a, b, c, revision DESC;

Sorry for not sorting it out better initially.

-Kevin

Re: Best way to get the latest revision from a table

От
Nikolas Everett
Дата:


On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Shaun's example is a bit off

> As for speed, either one might be faster in a particular
> situation.

After fixing a mistake in my testing and learning from Tom's example
I generated queries against the OP's test data which produce
identical results, and I'm finding no significant difference between
run times for the two versions.  The OP should definitely try both
against the real tables.

<snip> 
-Kevin

After trying both against the real tables DISTINCT ON seems to be about two orders of magnitude faster than the other options.

Thanks so much!

Nik Everett

Re: Best way to get the latest revision from a table

От
Shaun Thomas
Дата:
> Shaun's example is a bit off: normally, when using DISTINCT ON, you want
> an ORDER BY key that uses all the given DISTINCT keys and then some
> more.  To get the max revision for each a/b combination it ought to be

Hah, well i figured I was doing something wrong. I just thought about it a little bit, said to myself: "Hey, I've used
thisbefore to get the most recent x for a bunch of y without a sub-query. We always used it to get the newest update to
anevent log. 

But that's why I said I was probably misunderstanding something. :) I was trying to pick apart the logic to his temp
tablesand saw the max(b) and it threw me off. Glad you're around to set it straight. Heh. 


______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Best way to get the latest revision from a table

От
Shaun Thomas
Дата:
> After trying both against the real tables DISTINCT ON seems to be
> about two orders of magnitude faster than the other options.

Glad it worked. It looked at least naively similar to situations I've run into and DISTINCT ON always helped me out.
It'sall the fun of GROUP BY with the ability to discard non-aggregate results just by screwing around with your
sorting.Still one of my favorite tricks. 

--
Shaun Thomas
Peak6 | 141 W. Jackson Blvd. | Suite 800 | Chicago, IL 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Best way to get the latest revision from a table

От
Robert Haas
Дата:
On Fri, Jan 14, 2011 at 8:50 PM, Nikolas Everett <nik9000@gmail.com> wrote:
>
>
> On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>>
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> > Shaun's example is a bit off
>>
>> > As for speed, either one might be faster in a particular
>> > situation.
>>
>> After fixing a mistake in my testing and learning from Tom's example
>> I generated queries against the OP's test data which produce
>> identical results, and I'm finding no significant difference between
>> run times for the two versions.  The OP should definitely try both
>> against the real tables.
>>
> <snip>
>>
>> -Kevin
>
> After trying both against the real tables DISTINCT ON seems to be about two
> orders of magnitude faster than the other options.

What I've often done in these situations is add a Boolean to the table
that defaults to true, and an ON INSERT trigger that flips the Boolean
for any existing row with the same key to false.  Then you can just do
something like "SELECT * FROM tab WHERE latest".  And you can create
partial indexes etc: CREATE INDEX foo ON tab (a) WHERE latest.

Although if using DISTINCT ON is working, no reason to do anything
more complicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Best way to get the latest revision from a table

От
Nikolas Everett
Дата:
Distinct on is working really well!  If I need to be able to index something I might start thinking along those lines.

On Fri, Jan 21, 2011 at 12:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 14, 2011 at 8:50 PM, Nikolas Everett <nik9000@gmail.com> wrote:
>
>
> On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>>
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> > Shaun's example is a bit off
>>
>> > As for speed, either one might be faster in a particular
>> > situation.
>>
>> After fixing a mistake in my testing and learning from Tom's example
>> I generated queries against the OP's test data which produce
>> identical results, and I'm finding no significant difference between
>> run times for the two versions.  The OP should definitely try both
>> against the real tables.
>>
> <snip>
>>
>> -Kevin
>
> After trying both against the real tables DISTINCT ON seems to be about two
> orders of magnitude faster than the other options.

What I've often done in these situations is add a Boolean to the table
that defaults to true, and an ON INSERT trigger that flips the Boolean
for any existing row with the same key to false.  Then you can just do
something like "SELECT * FROM tab WHERE latest".  And you can create
partial indexes etc: CREATE INDEX foo ON tab (a) WHERE latest.

Although if using DISTINCT ON is working, no reason to do anything
more complicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company