Обсуждение: small table, huge table, and a join = slow and tough query. cake inside!

Поиск
Список
Период
Сортировка

small table, huge table, and a join = slow and tough query. cake inside!

От
edfialk
Дата:
Hey everybody, I'm having an issue executing the query I want.  I've
let a couple queries run a good 20 minutes and it still hasn't
finished.  For my app, this is way too long to be useful, and I'm not
real sure what else I can do, let alone if my query is even what I
want.

So I'm really hoping the great internets will help me out.

example rows from huge table (almost 900,000 rows):
fips, pollutant, value
1, co, 0.1
1, co, 0.2
1, co, 0.3
1, so, 1.0
1, so, 2.0
2, co, 0.5
2, co, 0.3
etc.

example from small table (3233 rows)
fips, geom, name
1, some geometry, "Some County, Some State"
2, some geometry, "Some Other County, Some State"

Every fips in the small table has MULTIPLE rows in the huge table
(roughly 300 per).

So, what I need is all kinds of things, but to start (pseudo-query):

SELECT small.fips, small.geom, small.name, SUM(huge.value) from small
JOIN huge on huge.fips = small.fips WHERE (SUM(huge.value)) > 500 AND
huge.pollutant='co';

wonder if that makes sense.  Obviously, can't have an aggregate in
where clause, so I've tried a couple WHERE (SELECT) kind of things,
nothing working out too well.

So first, if anyone has any idea on the best way I can do a WHERE
(sum(huge.value) > 500)
or...
any ideas on how I could speed up the query, I would be so extremely
grateful.

Thanks in advance!
-Ed

P.S. Sorry, the cake was a lie.

Re: small table, huge table, and a join = slow and tough query. cake inside!

От
"Stephen Denne"
Дата:
edfialk wrote:
> SELECT small.fips, small.geom, small.name, SUM(huge.value) from small
> JOIN huge on huge.fips = small.fips WHERE (SUM(huge.value)) > 500 AND
> huge.pollutant='co';
>
> wonder if that makes sense.  Obviously, can't have an aggregate in
> where clause, so I've tried a couple WHERE (SELECT) kind of things,
> nothing working out too well.
>
> So first, if anyone has any idea on the best way I can do a WHERE
> (sum(huge.value) > 500)
> or...
> any ideas on how I could speed up the query, I would be so extremely
> grateful.

SELECT small.fips, small.geom, small.name, SUM(huge.value)
from small JOIN huge on huge.fips = small.fips
WHERE
huge.pollutant='co';
GROUP BY small.fips, small.geom, small.name
HAVING SUM(huge.value) > 500;

Regards,
Stephen Denne
At the Datamail Group we value teamwork, respect, achievement, client focus, and courage.
This email with any attachments is confidential and may be subject to legal privilege.
If it is not intended for you please advise by replying immediately, destroy it and do not
copy, disclose or use it in any way.

The Datamail Group, through our GoGreen programme, is committed to environmental sustainability.
Help us in our efforts by not printing this email.
__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________



Re: small table, huge table, and a join = slow and tough query. cake inside!

От
Gregory Stark
Дата:
"edfialk" <edfialk@gmail.com> writes:

> So, what I need is all kinds of things, but to start (pseudo-query):
>
> SELECT small.fips, small.geom, small.name, SUM(huge.value) from small
> JOIN huge on huge.fips = small.fips WHERE (SUM(huge.value)) > 500 AND
> huge.pollutant='co';
>
> wonder if that makes sense.  Obviously, can't have an aggregate in
> where clause, so I've tried a couple WHERE (SELECT) kind of things,
> nothing working out too well.

For this case HAVING will suffice:

select small.*, sum(huge.value)
  from small
  join huge on (huge.fips = small.fips)
 where huge.pollutant='co'
 having sum(huge.value) > 500

But in more complex cases you may have to use a subquery and further where
clauses or even joins outside the subquery.

You could write this, for example, as:

select *
  from small join (
        select fips,sum(huge.value) as sum
          from huge
         where pollutant='co'
       ) as huge_sum using (fips)
  where huge_sum.sum > 500

Which may actually run faster (Unfortunately Postgres doesn't use the foreign
key relationship when planning so it can't reorder the join and the where
clause because it doesn't know that every "huge" record will have a matching
"small" record)

> any ideas on how I could speed up the query, I would be so extremely
> grateful.

You could try an index on <pollutant> or <fips,pollutant> but with the numbers
you're talking about they probably won't help unless you have a *lot* of
different pollutants and not all that many records for each pollutant.

To get anything more you would have to post an EXPLAIN output and preferably
an EXPLAIN ANALYZE output if you can find a query which completes.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Re: small table, huge table, and a join = slow and tough query. cake inside!

От
Klint Gore
Дата:
edfialk wrote:
> So, what I need is all kinds of things, but to start (pseudo-query):
>
> SELECT small.fips, small.geom, small.name, SUM(huge.value) from small
> JOIN huge on huge.fips = small.fips WHERE (SUM(huge.value)) > 500 AND
> huge.pollutant='co';
>
> wonder if that makes sense.  Obviously, can't have an aggregate in
> where clause, so I've tried a couple WHERE (SELECT) kind of things,
> nothing working out too well.
> So first, if anyone has any idea on the best way I can do a WHERE
> (sum(huge.value) > 500)
>
See GROUP BY and HAVING

> or...
> any ideas on how I could speed up the query, I would be so extremely
> grateful.
>
>
What columns are primary keys or indexed?

Run this and post what it says

  vacuum;
  analyze;
  explain <insert query here>

If you can wait for the query to finish, change the last line to
"explain analyze ..."

klint.

--
Klint Gore
Database Manager
Sheep CRC
A.G.B.U.
University of New England
Armidale NSW 2350

Ph: 02 6773 3789
Fax: 02 6773 3266
EMail: kgore4@une.edu.au


Re: small table, huge table, and a join = slow and tough query. cake inside!

От
Alban Hertroys
Дата:
On May 28, 2008, at 9:27 PM, edfialk wrote:

> example rows from huge table (almost 900,000 rows):
> fips, pollutant, value

That's not really huge in comparison with other pgsql databases.
There are terabyte installations out there ;)

> SELECT small.fips, small.geom, small.name, SUM(huge.value) from small
> JOIN huge on huge.fips = small.fips WHERE (SUM(huge.value)) > 500 AND
> huge.pollutant='co';
>
> wonder if that makes sense.  Obviously, can't have an aggregate in
> where clause, so I've tried a couple WHERE (SELECT) kind of things,
> nothing working out too well.
>
> So first, if anyone has any idea on the best way I can do a WHERE
> (sum(huge.value) > 500)
> or...
> any ideas on how I could speed up the query, I would be so extremely
> grateful.

That's what GROUP BY and HAVING are for:

SELECT fips, small.geom, small.name, SUM(huge.value)
FROM small
JOIN huge USING (fips)
WHERE huge.pollutant='co'
GROUP BY fips, small.geom, small.name
HAVING SUM(huge.value) > 500;

Guessing from your performance problem you may not have an index on
huge.fips? And did you vacuum/analyse those tables anytime recently?

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,483e4a67927663141972859!



Re: small table, huge table, and a join = slow and tough query. cake inside!

От
"Ed Fialkowski"
Дата:
Hey guys, thanks so much for all the help.  I had never seen the HAVING clause anywhere.  That solved so many problems.

As for speed, I do not have indexing (most of my sql experience so far has been on fairly 'small' tables, which I guess would be considered tiny for everyone else).

test=# VACUUM nei_area_val;
VACUUM
test=# ANALYZE nei_area_val;
ANALYZE

and..I think one of those two helped out quite a bit.  Here's an explain analyze for one of my more complicated queries: (nei_area_val = "huge", nei_area = "small")

test=# EXPLAIN ANALYZE SELECT nei_area.fips, AsText(nei_area.the_geom) as fs_text_geom, nei_area.name, nei_area_val.strpollutantcode, SUM(nei_area_val.dblemissionnumericvalue_ton) FROM nei_area INNER JOIN nei_area_val ON nei_area.fips = nei_area_val.fips WHERE nei_area_val.strpollutantcode='CO' AND the_geom && SetSRID('BOX3D(-100.000000 40.000000,-90.000000 50.000000)'::box3d, 4269) and intersects(the_geom, SetSRID('BOX3D(-100.000000 40.000000,-90.000000 5
0.000000)'::box3d, 4269)) GROUP BY nei_area.fips, nei_area.the_geom, nei_area.name, nei_area_val.strpollutantcode HAVING SUM(nei_area_val.dblemissionnumericvalue_ton) > 500;
                                                                                                                                                                                                                                                QUERY PLAN                                                                                                                                                                                                                                                
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=106998.22..107035.89 rows=39 width=3797) (actual time=4712.708..5743.313 rows=230 loops=1)
   Filter: (sum(dblemissionnumericvalue_ton) > 500::double precision)
   ->  Sort  (cost=106998.22..107003.49 rows=2108 width=3797) (actual time=4708.411..5330.771 rows=15679 loops=1)
         Sort Key: nei_area.fips, nei_area.the_geom, nei_area.name, nei_area_val.strpollutantcode
         ->  Hash Join  (cost=30627.47..103430.84 rows=2108 width=3797) (actual time=2555.057..3938.329 rows=15679 loops=1)
               Hash Cond: (nei_area_val.fips = nei_area.fips)
               ->  Seq Scan on nei_area_val  (cost=0.00..72346.21 rows=116288 width=25) (actual time=46.964..2446.264 rows=122885 loops=1)
                     Filter: (strpollutantcode = 'CO'::text)
               ->  Hash  (cost=30626.84..30626.84 rows=50 width=3781) (actual time=1193.834..1193.834 rows=415 loops=1)
                     ->  Seq Scan on nei_area  (cost=0.00..30626.84 rows=50 width=3781) (actual time=1038.950..1187.324 rows=415 loops=1)
                           Filter: ((the_geom && '0103000020AD100000010000000500000000000000000059C0000000000000444000000000000059C0000000000000494000000000008056C0000000000000494000000000008056C0000000000000444000000000000059C00000000000004440'::geometry) AND intersects(the_geom, '0103000020AD100000010000000500000000000000000059C0000000000000444000000000000059C0000000000000494000000000008056C0000000000000494000000000008056C0000000000000444000000000000059C00000000000004440'::geometry))
 Total runtime: 5762.061 ms
(12 rows)



5.7 seconds!  I can live with that! So, maybe it was just the vacuum I needed? 

Anyway, I don't have indexing, I'll read about it and try not to screw anything up when I add it, but seriously thanks so much all of you!

-Ed