Обсуждение: Update with Subquery Performance

Поиск
Список
Период
Сортировка

Update with Subquery Performance

От
"Linux Guru"
Дата:
We have a large datawarehouse stored in postgres and temp tables are created based on user query. The process of temp table creation involves selecting data from main fact table, this includes several select and update statements and one of the following update statement is having performance issues.

The newly temp table created for this scenario contains 22712 rows. Here is the query

alter table dummy add column gp numeric(40,15);
update dummy set gp=(select (
case when sum(temp.pd) <> 0 then sum(temp.gd)/sum(temp.pd)*100 else 0 end  )   from dummy as temp
where temp.product=dummy.product)

Now this query basically updates a table using values within itself in the subquery but it takes tooooo much time i.e. approx 5 mins. The whole temp table creation process is stucked in this query (there are 4 additional such updates with same problem). Index creation is useless here since its only a one time process.

Here is the strip down version (the part making performance issue) of above query i.e. only select statement
-------------------------------
select (case when sum(temp.pd) <> 0 then sum(temp.gd)/sum(temp.pd)*100 else 0 end  )   from dummy as temp, dummy as temp2
where temp.product=temp2.product group by temp.product

"HashAggregate  (cost=1652480.98..1652481.96 rows=39 width=39)"
"  ->  Hash Join  (cost=1636.07..939023.13 rows=71345785 width=39)"
"        Hash Cond: (("temp".product)::text = (temp2.product)::text)"
"        ->  Seq Scan on dummy "temp"  (cost=0.00..1311.03 rows=26003 width=39)"
"        ->  Hash  (cost=1311.03..1311.03 rows=26003 width=21)"
"              ->  Seq Scan on dummy temp2  (cost=0.00..1311.03 rows=26003 width=21)"
-------------------------------


Whats the solution of this problem, or any alternate way to write this query?


Re: Update with Subquery Performance

От
Tom Lane
Дата:
"Linux Guru" <linux.binary@gmail.com> writes:
> We have a large datawarehouse stored in postgres and temp tables are created
> based on user query. The process of temp table creation involves selecting
> data from main fact table, this includes several select and update
> statements and one of the following update statement is having performance
> issues.

Try ANALYZEing the temp table before the step that's too slow.

If that doesn't help, let's see EXPLAIN ANALYZE (not just EXPLAIN)
output.

            regards, tom lane

Re: Update with Subquery Performance

От
"Scott Marlowe"
Дата:
On Feb 11, 2008 5:06 AM, Linux Guru <linux.binary@gmail.com> wrote:
> We have a large datawarehouse stored in postgres and temp tables are created
> based on user query. The process of temp table creation involves selecting
> data from main fact table, this includes several select and update
> statements and one of the following update statement is having performance
> issues.
>
> The newly temp table created for this scenario contains 22712 rows. Here is
> the query
>
> alter table dummy add column gp numeric(40,15);
> update dummy set gp=(select (
> case when sum(temp.pd) <> 0 then sum(temp.gd)/sum(temp.pd)*100 else 0 end  )
> from dummy as temp
>  where temp.product=dummy.product)

Is this supposed to be updating every single row with one value?
Cause I'm guessing it's running that sub select over and over instead
of one time.  I'm guessing that with more work_mem the planner might
use a more efficient plan.  Try adding

analyze;
set work_mem = 128000;
 between the alter and update and see if that helps.

Also, as Tom said, post explain analyze output of the statement.


>
> Now this query basically updates a table using values within itself in the
> subquery but it takes tooooo much time i.e. approx 5 mins. The whole temp
> table creation process is stucked in this query (there are 4 additional such
> updates with same problem). Index creation is useless here since its only a
> one time process.
>
> Here is the strip down version (the part making performance issue) of above
> query i.e. only select statement
> -------------------------------
> select (case when sum(temp.pd) <> 0 then sum(temp.gd)/sum(temp.pd)*100 else
> 0 end  )   from dummy as temp, dummy as temp2
>  where temp.product=temp2.product group by temp.product
>
> "HashAggregate  (cost=1652480.98..1652481.96 rows=39 width=39)"
> "  ->  Hash Join  (cost=1636.07..939023.13 rows=71345785 width=39)"
>  "        Hash Cond: (("temp".product)::text = (temp2.product)::text)"
> "        ->  Seq Scan on dummy "temp"  (cost=0.00..1311.03 rows=26003
> width=39)"
> "        ->  Hash  (cost=1311.03..1311.03 rows=26003 width=21)"
>  "              ->  Seq Scan on dummy temp2  (cost=0.00..1311.03 rows=26003
> width=21)"
> -------------------------------
>
>
> Whats the solution of this problem, or any alternate way to write this
> query?
>
>
>

Re: Update with Subquery Performance

От
"Linux Guru"
Дата:
Analyzing did not help, here is the out of EXPLAIN ANALYZE of update query

"Seq Scan on dummy  (cost=0.00..56739774.24 rows=23441 width=275) (actual time=18.927..577929.014 rows=22712 loops=1)"
"  SubPlan"
"    ->  Aggregate  (cost=2420.41..2420.43 rows=1 width=19) (actual time=25.423..25.425 rows=1 loops=22712)"
"          ->  Seq Scan on dummy "temp"  (cost=0.00..2416.01 rows=586 width=19) (actual time=0.049..17.834 rows=2414 loops=22712)"
"                Filter: ((product)::text = ($0)::text)"
"Total runtime: 578968.885 ms"


On Feb 11, 2008 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Linux Guru" <linux.binary@gmail.com> writes:
> We have a large datawarehouse stored in postgres and temp tables are created
> based on user query. The process of temp table creation involves selecting
> data from main fact table, this includes several select and update
> statements and one of the following update statement is having performance
> issues.

Try ANALYZEing the temp table before the step that's too slow.

If that doesn't help, let's see EXPLAIN ANALYZE (not just EXPLAIN)
output.

                       regards, tom lane

Re: Update with Subquery Performance

От
"Linux Guru"
Дата:
See, its calculating sum by grouping the product field. Here is an example

Product      GP
---------      -------
A                  30
B                   40
A                  30
C                 50
C                 50

Now the query calculates aggregated sum and divide by grouping product so all A's must have same the result, so with B's and C's.
Is this supposed to be updating every single row with one value?
Cause I'm guessing it's running that sub select over and over instead
of one time.
yes you are right  that its calculating every time for all elements in each group i.e. GP(A) is calculated twice for A, where it should only calculated once for each group. Is there any  way to achieve this?

analyze;
set work_mem = 128000;
 between the alter and update and see if that helps.
that did not help
 
Also, as Tom said, post explain analyze output of the statement.
 
"Seq Scan on dummy  (cost=0.00..56739774.24 rows=23441 width=275) (actual time=18.927..577929.014 rows=22712 loops=1)"
"  SubPlan"
"    ->  Aggregate  (cost=2420.41..2420.43 rows=1 width=19) (actual time=25.423..25.425 rows=1 loops=22712)"
"          ->  Seq Scan on dummy "temp"  (cost=0.00..2416.01 rows=586 width=19) (actual time=0.049..17.834 rows=2414 loops=22712)"
"                Filter: ((product)::text = ($0)::text)"
"Total runtime: 578968.885 ms"

Thanks

On Feb 12, 2008 2:29 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Feb 11, 2008 5:06 AM, Linux Guru <linux.binary@gmail.com> wrote:
> We have a large datawarehouse stored in postgres and temp tables are created
> based on user query. The process of temp table creation involves selecting
> data from main fact table, this includes several select and update
> statements and one of the following update statement is having performance
> issues.
>
> The newly temp table created for this scenario contains 22712 rows. Here is
> the query
>
> alter table dummy add column gp numeric(40,15);
> update dummy set gp=(select (
> case when sum(temp.pd) <> 0 then sum(temp.gd)/sum(temp.pd)*100 else 0 end  )
> from dummy as temp
>  where temp.product=dummy.product)

Is this supposed to be updating every single row with one value?
Cause I'm guessing it's running that sub select over and over instead
of one time.  I'm guessing that with more work_mem the planner might
use a more efficient plan.  Try adding

analyze;
set work_mem = 128000;
 between the alter and update and see if that helps.

Also, as Tom said, post explain analyze output of the statement.


>
> Now this query basically updates a table using values within itself in the
> subquery but it takes tooooo much time i.e. approx 5 mins. The whole temp
> table creation process is stucked in this query (there are 4 additional such
> updates with same problem). Index creation is useless here since its only a
> one time process.
>
> Here is the strip down version (the part making performance issue) of above
> query i.e. only select statement
> -------------------------------
> select (case when sum(temp.pd) <> 0 then sum(temp.gd)/sum(temp.pd)*100 else
> 0 end  )   from dummy as temp, dummy as temp2
>  where temp.product=temp2.product group by temp.product
>
> "HashAggregate  (cost=1652480.98..1652481.96 rows=39 width=39)"
> "  ->  Hash Join  (cost=1636.07..939023.13 rows=71345785 width=39)"
>  "        Hash Cond: (("temp".product)::text = (temp2.product)::text)"
> "        ->  Seq Scan on dummy "temp"  (cost=0.00..1311.03 rows=26003
> width=39)"
> "        ->  Hash  (cost=1311.03..1311.03 rows=26003 width=21)"
>  "              ->  Seq Scan on dummy temp2  (cost=0.00..1311.03 rows=26003
> width=21)"
> -------------------------------
>
>
> Whats the solution of this problem, or any alternate way to write this
> query?
>
>
>

Re: Update with Subquery Performance

От
Tom Lane
Дата:
"Linux Guru" <linux.binary@gmail.com> writes:
> Analyzing did not help, here is the out of EXPLAIN ANALYZE of update query
> "Seq Scan on dummy  (cost=0.00..56739774.24 rows=23441 width=275) (actual
> time=18.927..577929.014 rows=22712 loops=1)"
> "  SubPlan"
> "    ->  Aggregate  (cost=2420.41..2420.43 rows=1 width=19) (actual time=
> 25.423..25.425 rows=1 loops=22712)"
> "          ->  Seq Scan on dummy "temp"  (cost=0.00..2416.01 rows=586
> width=19) (actual time=0.049..17.834 rows=2414 loops=22712)"
> "                Filter: ((product)::text = ($0)::text)"
> "Total runtime: 578968.885 ms"

Yeah, that's just not going to be fast.  An index on the product column
might help a bit, but the real issue is that you're repetitively
calculating the same aggregates.  I think you need a separate temp
table, along the lines of

create temp table dummy_agg as
  select product,
         (case when sum(pd) <> 0 then sum(gd)/sum(pd)*100 else 0 end) as s
  from dummy
  group by product;

create index dummy_agg_i on dummy_agg(product); -- optional

update dummy
  set gp= (select s from dummy_agg where dummy_agg.product = dummy.product);

The index would only be needed if you expect a lot of rows (lot of
different product values).

            regards, tom lane

Re: Update with Subquery Performance

От
"Linux Guru"
Дата:
yes, I also thought of this method and tested it before I got your mail and this solution seems workable.

Thanks for the help

On Feb 12, 2008 9:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Linux Guru" <linux.binary@gmail.com> writes:
> Analyzing did not help, here is the out of EXPLAIN ANALYZE of update query
> "Seq Scan on dummy  (cost=0.00..56739774.24 rows=23441 width=275) (actual
> time=18.927..577929.014 rows=22712 loops=1)"
> "  SubPlan"
> "    ->  Aggregate  (cost=2420.41..2420.43 rows=1 width=19) (actual time=
> 25.423..25.425 rows=1 loops=22712)"
> "          ->  Seq Scan on dummy "temp"  (cost=0.00..2416.01 rows=586
> width=19) (actual time=0.049..17.834 rows=2414 loops=22712)"
> "                Filter: ((product)::text = ($0)::text)"
> "Total runtime: 578968.885 ms"

Yeah, that's just not going to be fast.  An index on the product column
might help a bit, but the real issue is that you're repetitively
calculating the same aggregates.  I think you need a separate temp
table, along the lines of

create temp table dummy_agg as
 select product,
        (case when sum(pd) <> 0 then sum(gd)/sum(pd)*100 else 0 end) as s
 from dummy
 group by product;

create index dummy_agg_i on dummy_agg(product); -- optional

update dummy
 set gp= (select s from dummy_agg where dummy_agg.product = dummy.product);

The index would only be needed if you expect a lot of rows (lot of
different product values).

                       regards, tom lane