Обсуждение: Calculating statistic via function rather than with query is slowing my query

Поиск
Список
Период
Сортировка

Calculating statistic via function rather than with query is slowing my query

От
Anish Kejariwal
Дата:
Hi everyone,

I'm using postgres 9.0.3, and here's the OS I'm running this on:
Linux 2.6.18-238.12.1.el5xen #1 SMP Tue May 31 14:02:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

I have a fairly straight forward query.  I'm doing a group by on an ID, and then calculating some a statistic on the resulting data.  The problem I'm running into is that when I'm calculating the statistics via a function, it's twice as slow as when I'm calculating the statistics directly in my query.  I want to be able to use a function, since I'll be using this particular calculation in many places.

Any idea of what's going on?  Below, I've included my function, and both queries (I removed the type_ids, and just wrote …ids…

Here's my function (I also tried stable):
CREATE OR REPLACE FUNCTION calc_test(a double precision, b integer, c integer)
RETURNS double precision AS $$
BEGIN
        return a/b/c* 1000000000::double precision;
END;
$$ LANGUAGE plpgsql immutable;


The query that takes 7.6 seconds, when I calculate the statistic from within the query:
explain analyze 
select
   agg.primary_id,
   avg(agg.a / agg.b / agg.c * 1000000000::double precision) foo,
   stddev(agg.a / agg.b / agg.c * 1000000000::double precision) bar
from mytable agg
where agg.type_id in (....ids....)
group by agg.primary_id;

The execution plan:
 HashAggregate  (cost=350380.58..350776.10 rows=9888 width=20) (actual time=7300.414..7331.659 rows=20993 loops=1)
   ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63 rows=1716127 width=20) (actual time=200.064..2861.600 rows=2309230 loops=1)
         Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
         ->  Bitmap Index Scan on mytable_type_id_idx  (cost=0.00..28238.87 rows=1716127 width=0) (actual time=192.725..192.725 rows=2309230 loops=1)
               Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
 Total runtime: 7358.337 ms
(6 rows)




The same query, but now I'm calling the function.  When I call the function it's taking 15.5 seconds.
explain analyze select
   agg.primary_id,
   avg(calc_test(agg.a,agg.b,agg.c)) foo,
   stddev(calc_test(agg.a,agg.b,agg.c)) bar
from mytable agg
where agg.type_id in (....ids....)
group by agg.primary_id;

and, here's the execution plan:

 HashAggregate  (cost=350380.58..355472.90 rows=9888 width=20) (actual time=13660.838..13686.618 rows=20993 loops=1)
   ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63 rows=1716127 width=20) (actual time=170.385..2881.122 rows=2309230 loops=1)
         Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
         ->  Bitmap Index Scan on mytable_type_id_idx  (cost=0.00..28238.87 rows=1716127 width=0) (actual time=162.834..162.834 rows=2309230 loops=1)
               Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
 Total runtime: 13707.560 ms


Thanks!

Anish

Re: Calculating statistic via function rather than with query is slowing my query

От
Pavel Stehule
Дата:
Hello

2011/8/17 Anish Kejariwal <anishkej@gmail.com>:
> Hi everyone,
> I'm using postgres 9.0.3, and here's the OS I'm running this on:
> Linux 2.6.18-238.12.1.el5xen #1 SMP Tue May 31 14:02:29 EDT 2011 x86_64
> x86_64 x86_64 GNU/Linux
> I have a fairly straight forward query.  I'm doing a group by on an ID, and
> then calculating some a statistic on the resulting data.  The problem I'm
> running into is that when I'm calculating the statistics via a function,
> it's twice as slow as when I'm calculating the statistics directly in my
> query.  I want to be able to use a function, since I'll be using this
> particular calculation in many places.
> Any idea of what's going on?  Below, I've included my function, and both
> queries (I removed the type_ids, and just wrote …ids…
> Here's my function (I also tried stable):
> CREATE OR REPLACE FUNCTION calc_test(a double precision, b integer, c
> integer)
> RETURNS double precision AS $$
> BEGIN
>         return a/b/c* 1000000000::double precision;
> END;
> $$ LANGUAGE plpgsql immutable;
>

this is overhead of plpgsql call. For this simple functions use a SQL
functions instead

CREATE OR REPLACE FUNCTION calc_test(a double precision, b integer, c
 integer)
 RETURNS double precision AS $$
>         SELECT $1/$2/$3* 1000000000::double precision;
> $$ LANGUAGE sql;

Regards

Pavel Stehule

> The query that takes 7.6 seconds, when I calculate the statistic from within
> the query:
> explain analyze
> select
>    agg.primary_id,
>    avg(agg.a / agg.b / agg.c * 1000000000::double precision) foo,
>    stddev(agg.a / agg.b / agg.c * 1000000000::double precision) bar
> from mytable agg
> where agg.type_id in (....ids....)
> group by agg.primary_id;
> The execution plan:
>  HashAggregate  (cost=350380.58..350776.10 rows=9888 width=20) (actual
> time=7300.414..7331.659 rows=20993 loops=1)
>    ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63
> rows=1716127 width=20) (actual time=200.064..2861.600 rows=2309230 loops=1)
>          Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
>          ->  Bitmap Index Scan on mytable_type_id_idx  (cost=0.00..28238.87
> rows=1716127 width=0) (actual time=192.725..192.725 rows=2309230 loops=1)
>                Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
>  Total runtime: 7358.337 ms
> (6 rows)
>
>
>
> The same query, but now I'm calling the function.  When I call the function
> it's taking 15.5 seconds.
> explain analyze select
>    agg.primary_id,
>    avg(calc_test(agg.a,agg.b,agg.c)) foo,
>    stddev(calc_test(agg.a,agg.b,agg.c)) bar
> from mytable agg
> where agg.type_id in (....ids....)
> group by agg.primary_id;
> and, here's the execution plan:
>  HashAggregate  (cost=350380.58..355472.90 rows=9888 width=20) (actual
> time=13660.838..13686.618 rows=20993 loops=1)
>    ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63
> rows=1716127 width=20) (actual time=170.385..2881.122 rows=2309230 loops=1)
>          Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
>          ->  Bitmap Index Scan on mytable_type_id_idx  (cost=0.00..28238.87
> rows=1716127 width=0) (actual time=162.834..162.834 rows=2309230 loops=1)
>                Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
>  Total runtime: 13707.560 ms
>
> Thanks!
> Anish

Re: Calculating statistic via function rather than with query is slowing my query

От
Anish Kejariwal
Дата:
Thanks Pavel! that definitely solved it.  

Unfortunately, the function I gave you was a simple/short version of what the actual function is going to be.  The actual function is going to get parameters passed to it, and based on the parameters will go through some if...else conditions, and maybe even call another function.  Based on that, I was definitely hoping to use plpgsql, and the overhead is unfortunate.  

Is there any way to get around this overhead?  Will I still have the same overhead if I use plperl, plpython, pljava, or write the function in C?

Anish


On Wed, Aug 17, 2011 at 11:27 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hello

2011/8/17 Anish Kejariwal <anishkej@gmail.com>:
> Hi everyone,
> I'm using postgres 9.0.3, and here's the OS I'm running this on:
> Linux 2.6.18-238.12.1.el5xen #1 SMP Tue May 31 14:02:29 EDT 2011 x86_64
> x86_64 x86_64 GNU/Linux
> I have a fairly straight forward query.  I'm doing a group by on an ID, and
> then calculating some a statistic on the resulting data.  The problem I'm
> running into is that when I'm calculating the statistics via a function,
> it's twice as slow as when I'm calculating the statistics directly in my
> query.  I want to be able to use a function, since I'll be using this
> particular calculation in many places.
> Any idea of what's going on?  Below, I've included my function, and both
> queries (I removed the type_ids, and just wrote …ids…
> Here's my function (I also tried stable):
> CREATE OR REPLACE FUNCTION calc_test(a double precision, b integer, c
> integer)
> RETURNS double precision AS $$
> BEGIN
>         return a/b/c* 1000000000::double precision;
> END;
> $$ LANGUAGE plpgsql immutable;
>

this is overhead of plpgsql call. For this simple functions use a SQL
functions instead

CREATE OR REPLACE FUNCTION calc_test(a double precision, b integer, c
 integer)
 RETURNS double precision AS $$
>         SELECT $1/$2/$3* 1000000000::double precision;
> $$ LANGUAGE sql;

Regards

Pavel Stehule

> The query that takes 7.6 seconds, when I calculate the statistic from within
> the query:
> explain analyze
> select
>    agg.primary_id,
>    avg(agg.a / agg.b / agg.c * 1000000000::double precision) foo,
>    stddev(agg.a / agg.b / agg.c * 1000000000::double precision) bar
> from mytable agg
> where agg.type_id in (....ids....)
> group by agg.primary_id;
> The execution plan:
>  HashAggregate  (cost=350380.58..350776.10 rows=9888 width=20) (actual
> time=7300.414..7331.659 rows=20993 loops=1)
>    ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63
> rows=1716127 width=20) (actual time=200.064..2861.600 rows=2309230 loops=1)
>          Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
>          ->  Bitmap Index Scan on mytable_type_id_idx  (cost=0.00..28238.87
> rows=1716127 width=0) (actual time=192.725..192.725 rows=2309230 loops=1)
>                Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
>  Total runtime: 7358.337 ms
> (6 rows)
>
>
>
> The same query, but now I'm calling the function.  When I call the function
> it's taking 15.5 seconds.
> explain analyze select
>    agg.primary_id,
>    avg(calc_test(agg.a,agg.b,agg.c)) foo,
>    stddev(calc_test(agg.a,agg.b,agg.c)) bar
> from mytable agg
> where agg.type_id in (....ids....)
> group by agg.primary_id;
> and, here's the execution plan:
>  HashAggregate  (cost=350380.58..355472.90 rows=9888 width=20) (actual
> time=13660.838..13686.618 rows=20993 loops=1)
>    ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63
> rows=1716127 width=20) (actual time=170.385..2881.122 rows=2309230 loops=1)
>          Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
>          ->  Bitmap Index Scan on mytable_type_id_idx  (cost=0.00..28238.87
> rows=1716127 width=0) (actual time=162.834..162.834 rows=2309230 loops=1)
>                Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
>  Total runtime: 13707.560 ms
>
> Thanks!
> Anish

Re: Calculating statistic via function rather than with query is slowing my query

От
Pavel Stehule
Дата:
2011/8/17 Anish Kejariwal <anishkej@gmail.com>:
> Thanks Pavel! that definitely solved it.
> Unfortunately, the function I gave you was a simple/short version of what
> the actual function is going to be.  The actual function is going to get
> parameters passed to it, and based on the parameters will go through some
> if...else conditions, and maybe even call another function.  Based on that,
> I was definitely hoping to use plpgsql, and the overhead is unfortunate.
> Is there any way to get around this overhead?  Will I still have the same
> overhead if I use plperl, plpython, pljava, or write the function in C?

only SQL and C has zero overhead - SQL because uses inlining and C is
just readable assambler.

I am thinking,  overhead of PL/pgSQL is minimal from languages from your list.

Regards

Pavel

>
> Anish
>
> On Wed, Aug 17, 2011 at 11:27 AM, Pavel Stehule <pavel.stehule@gmail.com>
> wrote:
>>
>> Hello
>>
>> 2011/8/17 Anish Kejariwal <anishkej@gmail.com>:
>> > Hi everyone,
>> > I'm using postgres 9.0.3, and here's the OS I'm running this on:
>> > Linux 2.6.18-238.12.1.el5xen #1 SMP Tue May 31 14:02:29 EDT 2011 x86_64
>> > x86_64 x86_64 GNU/Linux
>> > I have a fairly straight forward query.  I'm doing a group by on an ID,
>> > and
>> > then calculating some a statistic on the resulting data.  The problem
>> > I'm
>> > running into is that when I'm calculating the statistics via a function,
>> > it's twice as slow as when I'm calculating the statistics directly in my
>> > query.  I want to be able to use a function, since I'll be using this
>> > particular calculation in many places.
>> > Any idea of what's going on?  Below, I've included my function, and both
>> > queries (I removed the type_ids, and just wrote …ids…
>> > Here's my function (I also tried stable):
>> > CREATE OR REPLACE FUNCTION calc_test(a double precision, b integer, c
>> > integer)
>> > RETURNS double precision AS $$
>> > BEGIN
>> >         return a/b/c* 1000000000::double precision;
>> > END;
>> > $$ LANGUAGE plpgsql immutable;
>> >
>>
>> this is overhead of plpgsql call. For this simple functions use a SQL
>> functions instead
>>
>> CREATE OR REPLACE FUNCTION calc_test(a double precision, b integer, c
>>  integer)
>>  RETURNS double precision AS $$
>> >         SELECT $1/$2/$3* 1000000000::double precision;
>> > $$ LANGUAGE sql;
>>
>> Regards
>>
>> Pavel Stehule
>>
>> > The query that takes 7.6 seconds, when I calculate the statistic from
>> > within
>> > the query:
>> > explain analyze
>> > select
>> >    agg.primary_id,
>> >    avg(agg.a / agg.b / agg.c * 1000000000::double precision) foo,
>> >    stddev(agg.a / agg.b / agg.c * 1000000000::double precision) bar
>> > from mytable agg
>> > where agg.type_id in (....ids....)
>> > group by agg.primary_id;
>> > The execution plan:
>> >  HashAggregate  (cost=350380.58..350776.10 rows=9888 width=20) (actual
>> > time=7300.414..7331.659 rows=20993 loops=1)
>> >    ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63
>> > rows=1716127 width=20) (actual time=200.064..2861.600 rows=2309230
>> > loops=1)
>> >          Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
>> >          ->  Bitmap Index Scan on mytable_type_id_idx
>> >  (cost=0.00..28238.87
>> > rows=1716127 width=0) (actual time=192.725..192.725 rows=2309230
>> > loops=1)
>> >                Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
>> >  Total runtime: 7358.337 ms
>> > (6 rows)
>> >
>> >
>> >
>> > The same query, but now I'm calling the function.  When I call the
>> > function
>> > it's taking 15.5 seconds.
>> > explain analyze select
>> >    agg.primary_id,
>> >    avg(calc_test(agg.a,agg.b,agg.c)) foo,
>> >    stddev(calc_test(agg.a,agg.b,agg.c)) bar
>> > from mytable agg
>> > where agg.type_id in (....ids....)
>> > group by agg.primary_id;
>> > and, here's the execution plan:
>> >  HashAggregate  (cost=350380.58..355472.90 rows=9888 width=20) (actual
>> > time=13660.838..13686.618 rows=20993 loops=1)
>> >    ->  Bitmap Heap Scan on mytable agg  (cost=28667.90..337509.63
>> > rows=1716127 width=20) (actual time=170.385..2881.122 rows=2309230
>> > loops=1)
>> >          Recheck Cond: (type_id = ANY ('{....ids....}'::integer[]))
>> >          ->  Bitmap Index Scan on mytable_type_id_idx
>> >  (cost=0.00..28238.87
>> > rows=1716127 width=0) (actual time=162.834..162.834 rows=2309230
>> > loops=1)
>> >                Index Cond: (type_id = ANY ('{....ids....}'::integer[]))
>> >  Total runtime: 13707.560 ms
>> >
>> > Thanks!
>> > Anish
>
>

Re: Calculating statistic via function rather than with query is slowing my query

От
Craig Ringer
Дата:
On 18/08/2011 3:00 AM, Anish Kejariwal wrote:
Thanks Pavel! that definitely solved it.  

Unfortunately, the function I gave you was a simple/short version of what the actual function is going to be.  The actual function is going to get parameters passed to it, and based on the parameters will go through some if...else conditions, and maybe even call another function.  Based on that, I was definitely hoping to use plpgsql, and the overhead is unfortunate.  

Is there any way to get around this overhead?  Will I still have the same overhead if I use plperl, plpython, pljava, or write the function in C?

You can probably still write it as an SQL function if you use CASE WHEN appropriately.

--
Craig Ringer

Re: Calculating statistic via function rather than with query is slowing my query

От
Anish Kejariwal
Дата:
Thanks for the help Pavel and Craig.  I really appreciate it.  I'm going to try a couple of these different options (write a c function, use a sql function with case statements, and use plperl), so I can see which gives me the realtime performance that I need, and works best for clean code in my particular case.

thanks!

Anish

On Wed, Aug 17, 2011 at 5:05 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 18/08/2011 3:00 AM, Anish Kejariwal wrote:
Thanks Pavel! that definitely solved it.  

Unfortunately, the function I gave you was a simple/short version of what the actual function is going to be.  The actual function is going to get parameters passed to it, and based on the parameters will go through some if...else conditions, and maybe even call another function.  Based on that, I was definitely hoping to use plpgsql, and the overhead is unfortunate.  

Is there any way to get around this overhead?  Will I still have the same overhead if I use plperl, plpython, pljava, or write the function in C?

You can probably still write it as an SQL function if you use CASE WHEN appropriately.

--
Craig Ringer

Re: Calculating statistic via function rather than with query is slowing my query

От
Craig Ringer
Дата:
On 18/08/2011 9:03 AM, Anish Kejariwal wrote:
> Thanks for the help Pavel and Craig.  I really appreciate it.  I'm
> going to try a couple of these different options (write a c function,
> use a sql function with case statements, and use plperl), so I can see
> which gives me the realtime performance that I need, and works best
> for clean code in my particular case.
Do you really mean "realtime"? Or just "fast"?

If you have strongly bounded latency requirements, any SQL-based,
disk-based system is probably not for you. Especially not one that
relies on a statics-based query planner, caching, and periodic
checkpoints. I'd be looking into in-memory databases designed for
realtime environments where latency is critical.

Hard realtime: If this system fails to respond within <x> milliseconds,
all the time, every time, then something will go "smash" or "boom"
expensively and unrecoverably.

Soft realtime: If this system responds late, the late response is
expensive or less useful. Frequent late responses are unacceptable but
the occasional one might be endurable.

Just needs to be fast: If it responds late, the user gets irritated
because they're sitting and waiting for a response. Regular long stalls
are unacceptable, but otherwise the user can put up with it. You're more
concerned with average latency than maximum latency.

--
Craig Ringer

Re: Calculating statistic via function rather than with query is slowing my query

От
Anish Kejariwal
Дата:
Hi Craig,

Fair point.  For now, I mean "just fast" - which is 5-15 seconds, but I'd like to get it down to the 1-2 second range.

From the query I provided, I have approximately 30,000 unique keys (what I called primary_id) that I'm grouping by, and each key has a series of numerical values for each of the type_ids.  I'm looking at averages, stddev and other statistics across a few hundred type_ids (where agg.type_id in ....).  The part of the query that varies is the user specified type_ids, which makes it impossible to precalculate my statistics.

I'd like this to eventually scale to a million unique keys, and a thousand type_ids.

For now Postgres been great for modeling the data, understanding where I hit performance bottle necks, and providing a fast enough user interface.  But, I'm definitely starting to think about whether I can cache my data (with millions of keys and thousands of type_ids, the data might be too large), and whether to look into distributed databases (even thought I can't precompute the stats, my queries are easily distributable across multiple processors since each processor could take a batch of keys).  I might even want to consider a column oriented database  - since my keys don't change often, I could potentially add new columns when there are new type_ids.

I've been thinking of looking into memcached or hbase.  If you have any suggestions on which options I should explore, I'd greatly appreciate it.

Sorry, for veering off topic a bit from postgres.

thanks,
Anish




 


On Wed, Aug 17, 2011 at 10:32 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 18/08/2011 9:03 AM, Anish Kejariwal wrote:
Thanks for the help Pavel and Craig.  I really appreciate it.  I'm going to try a couple of these different options (write a c function, use a sql function with case statements, and use plperl), so I can see which gives me the realtime performance that I need, and works best for clean code in my particular case.
Do you really mean "realtime"? Or just "fast"?

If you have strongly bounded latency requirements, any SQL-based, disk-based system is probably not for you. Especially not one that relies on a statics-based query planner, caching, and periodic checkpoints. I'd be looking into in-memory databases designed for realtime environments where latency is critical.

Hard realtime: If this system fails to respond within <x> milliseconds, all the time, every time, then something will go "smash" or "boom" expensively and unrecoverably.

Soft realtime: If this system responds late, the late response is expensive or less useful. Frequent late responses are unacceptable but the occasional one might be endurable.

Just needs to be fast: If it responds late, the user gets irritated because they're sitting and waiting for a response. Regular long stalls are unacceptable, but otherwise the user can put up with it. You're more concerned with average latency than maximum latency.

--
Craig Ringer