Обсуждение: optimizing query with multiple aggregates
I have a reporting query that is taking nearly all of it's time in aggregate functions and I'm trying to figure out how to optimize it. The query takes approximately 170ms when run with "select *", but when run with all the aggregate functions the query takes 18 seconds. The slowness comes from our attempt to find distribution data using selects of the form:
SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
repeated across many different x,y values and fields to build out several histograms of the data. The main culprit appears to be the CASE statement, but I'm not sure what to use instead. I'm sure other people have had similar queries and I was wondering what methods they used to build out data like this?
Thanks for your help,
Doug
SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
repeated across many different x,y values and fields to build out several histograms of the data. The main culprit appears to be the CASE statement, but I'm not sure what to use instead. I'm sure other people have had similar queries and I was wondering what methods they used to build out data like this?
Thanks for your help,
Doug
On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote: > I have a reporting query that is taking nearly all of it's time in aggregate > functions and I'm trying to figure out how to optimize it. The query takes > approximately 170ms when run with "select *", but when run with all the > aggregate functions the query takes 18 seconds. The slowness comes from our > attempt to find distribution data using selects of the form: > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > repeated across many different x,y values and fields to build out several > histograms of the data. The main culprit appears to be the CASE statement, > but I'm not sure what to use instead. I'm sure other people have had > similar queries and I was wondering what methods they used to build out data > like this? have you tried: count(*) where field >= x AND field < y; ?? merlin
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > > On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote: > > I have a reporting query that is taking nearly all of it's time in aggregate > > functions and I'm trying to figure out how to optimize it. The query takes > > approximately 170ms when run with "select *", but when run with all the > > aggregate functions the query takes 18 seconds. The slowness comes from our > > attempt to find distribution data using selects of the form: > > > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > > > repeated across many different x,y values and fields to build out several > > histograms of the data. The main culprit appears to be the CASE statement, > > but I'm not sure what to use instead. I'm sure other people have had > > similar queries and I was wondering what methods they used to build out data > > like this? > > have you tried: > > count(*) where field >= x AND field < y; > > ?? > > merlin Unless I'm misunderstanding you, that would require breaking each bin into a separate sql statement and since I'm trying to calculate more than 100 bins between the different fields any improvement in the aggregate functions would be overwhelmed by the cost of the actual query, which is about 170ms. Thanks, Doug
On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
repeated across many different x,y values and fields to build out several histograms of the data. The main culprit appears to be the CASE statement, but I'm not sure what to use instead. I'm sure other people have had similar queries and I was wondering what methods they used to build out data like this?
Use group by with an appropriate division/rounding to create the appropriate buckets, if they're all the same size.
select round(field/100) as bucket, count(*) as cnt from foo group by round(field/100);
--
- David T. Wilson
david.t.wilson@gmail.com
So you've got a query like:
SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as zeroToTen,
SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as tenToTwenty,
...
FROM bigtable
My guess is this forcing a whole bunch of if checks and your getting cpu bound. Could you try something like:
SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as zeroToTen,
SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0 END) as tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0 END) as tenToTwenty,
...
FROM (SELECT field, count(*) FROM bigtable GROUP BY field)
which will allow a hash aggregate? You'd do a hash aggregate on the whole table which should be quick and then you'd summarize your bins.
This all supposes that you don't want to just query postgres's column statistics.
On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@gmail.com> wrote:
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:Unless I'm misunderstanding you, that would require breaking each bin
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
> > I have a reporting query that is taking nearly all of it's time in aggregate
> > functions and I'm trying to figure out how to optimize it. The query takes
> > approximately 170ms when run with "select *", but when run with all the
> > aggregate functions the query takes 18 seconds. The slowness comes from our
> > attempt to find distribution data using selects of the form:
> >
> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> >
> > repeated across many different x,y values and fields to build out several
> > histograms of the data. The main culprit appears to be the CASE statement,
> > but I'm not sure what to use instead. I'm sure other people have had
> > similar queries and I was wondering what methods they used to build out data
> > like this?
>
> have you tried:
>
> count(*) where field >= x AND field < y;
>
> ??
>
> merlin
into a separate sql statement and since I'm trying to calculate more
than 100 bins between the different fields any improvement in the
aggregate functions would be overwhelmed by the cost of the actual
query, which is about 170ms.
Thanks,
Doug
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote: > I have a reporting query that is taking nearly all of it's time in aggregate > functions and I'm trying to figure out how to optimize it. The query takes > approximately 170ms when run with "select *", but when run with all the > aggregate functions the query takes 18 seconds. The slowness comes from our > attempt to find distribution data using selects of the form: > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > repeated across many different x,y values and fields to build out several > histograms of the data. The main culprit appears to be the CASE statement, > but I'm not sure what to use instead. I'm sure other people have had > similar queries and I was wondering what methods they used to build out data > like this? > Thanks for your help, > Doug Hi Doug, Have you tried using the width_bucket() function? Here is a nice article describing its use for making histograms: http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html Regards, Ken
On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@gmail.com> wrote: > On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> >> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote: >> > I have a reporting query that is taking nearly all of it's time in aggregate >> > functions and I'm trying to figure out how to optimize it. The query takes >> > approximately 170ms when run with "select *", but when run with all the >> > aggregate functions the query takes 18 seconds. The slowness comes from our >> > attempt to find distribution data using selects of the form: >> > >> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) >> > >> > repeated across many different x,y values and fields to build out several >> > histograms of the data. The main culprit appears to be the CASE statement, >> > but I'm not sure what to use instead. I'm sure other people have had >> > similar queries and I was wondering what methods they used to build out data >> > like this? >> >> have you tried: >> >> count(*) where field >= x AND field < y; >> >> ?? >> >> merlin > > Unless I'm misunderstanding you, that would require breaking each bin > into a separate sql statement and since I'm trying to calculate more > than 100 bins between the different fields any improvement in the > aggregate functions would be overwhelmed by the cost of the actual > query, which is about 170ms. Well, you might be able to use subselects to fetch all the results in a single query, but it might still be slow. ...Robert
On 10/21/09 3:51 PM, "Doug Cole" <dougcole@gmail.com> wrote: > I have a reporting query that is taking nearly all of it's time in aggregate > functions and I'm trying to figure out how to optimize it. The query takes > approximately 170ms when run with "select *", but when run with all the > aggregate functions the query takes 18 seconds. The slowness comes from our > attempt to find distribution data using selects of the form: > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > repeated across many different x,y values and fields to build out several > histograms of the data. The main culprit appears to be the CASE statement, > but I'm not sure what to use instead. I'm sure other people have had similar > queries and I was wondering what methods they used to build out data like > this? You might be able to do this with plain aggregates. Define a function that generates your partitions that you can group by, then aggregate functions for the outputs In either case, rather than each result being a column in one result row, each result will be its own row. Each row would have a column that defines the type of the result (that you grouped on), and one with the result value. If each is just a sum, its easy. If there are lots of different calculation types, it would be harder. Potentially, you could wrap that in a subselect to pull out each into its own column but that is a bit messy. Also, in 8.4 window functions could be helpful. PARTITION BY something that represents your buckets perhaps? http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html This will generally force a sort, but shouldn't be that bad. The function used for the group by or partition by could just be a big case statement to generate a unique int per bucket, or a truncate/rounding function. It just needs to spit out a unique result for each bucket for the group or partition. > Thanks for your help, > Doug >
Hello,
I didn't try it, but following should be slightly faster:
COUNT( CASE WHEN field >= x AND field < y THEN true END)
intead of
intead of
SUM( CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
HTH,
Marc Mamin
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Nikolas Everett
Sent: Thursday, October 22, 2009 4:48 AM
To: Doug Cole
Cc: pgsql-performance
Subject: Re: [PERFORM] optimizing query with multiple aggregates
SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as zeroToTen,
SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as tenToTwenty,
...
FROM bigtable
My guess is this forcing a whole bunch of if checks and your getting cpu bound. Could you try something like:
SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as zeroToTen,
SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0 END) as tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0 END) as tenToTwenty,
...
FROM (SELECT field, count(*) FROM bigtable GROUP BY field)
which will allow a hash aggregate? You'd do a hash aggregate on the whole table which should be quick and then you'd summarize your bins.
This all supposes that you don't want to just query postgres's column statistics.
On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@gmail.com> wrote:
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:Unless I'm misunderstanding you, that would require breaking each bin
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
> > I have a reporting query that is taking nearly all of it's time in aggregate
> > functions and I'm trying to figure out how to optimize it. The query takes
> > approximately 170ms when run with "select *", but when run with all the
> > aggregate functions the query takes 18 seconds. The slowness comes from our
> > attempt to find distribution data using selects of the form:
> >
> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> >
> > repeated across many different x,y values and fields to build out several
> > histograms of the data. The main culprit appears to be the CASE statement,
> > but I'm not sure what to use instead. I'm sure other people have had
> > similar queries and I was wondering what methods they used to build out data
> > like this?
>
> have you tried:
>
> count(*) where field >= x AND field < y;
>
> ??
>
> merlin
into a separate sql statement and since I'm trying to calculate more
than 100 bins between the different fields any improvement in the
aggregate functions would be overwhelmed by the cost of the actual
query, which is about 170ms.
Thanks,
Doug
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Thu, Oct 22, 2009 at 6:22 AM, Kenneth Marshall <ktm@rice.edu> wrote: > On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote: >> I have a reporting query that is taking nearly all of it's time in aggregate >> functions and I'm trying to figure out how to optimize it. The query takes >> approximately 170ms when run with "select *", but when run with all the >> aggregate functions the query takes 18 seconds. The slowness comes from our >> attempt to find distribution data using selects of the form: >> >> SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) >> >> repeated across many different x,y values and fields to build out several >> histograms of the data. The main culprit appears to be the CASE statement, >> but I'm not sure what to use instead. I'm sure other people have had >> similar queries and I was wondering what methods they used to build out data >> like this? >> Thanks for your help, >> Doug > > Hi Doug, > > Have you tried using the width_bucket() function? Here is a nice > article describing its use for making histograms: > > http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html > > Regards, > Ken > Thanks Ken, I ended up going with this approach - it meant I had to break it into a lot more queries, one for each histogram, but even with that added overhead I cut the time down from 18 seconds to right around 1 second. Doug