Re: weird execution plan
От | Huang, Suya |
---|---|
Тема | Re: weird execution plan |
Дата | |
Msg-id | D83E55F5F4D99B4A9B4C4E259E6227CD01EBF653@AUX1EXC02.apac.experian.local обсуждение исходный текст |
Ответ на | Re: weird execution plan (David G Johnston <david.g.johnston@gmail.com>) |
Список | pgsql-performance |
-----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of David G Johnston Sent: Saturday, September 13, 2014 7:34 AM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] weird execution plan >Not everyone does so its nice to make certain - especially since I'm not all that familiar with the code involved. Butsince no one else has answered I will theorize. > >SELECT count(*) FROM ( SELECT DISTINCT col FROM tbl ) > >vs > >SELECT count(DISTINCT col) FROM tbl > >The code for "SELECT DISTINCT col" is likely highly efficient because it works on complete sets of records. > >The code for "SELECT count(DISTINCT col)" is at a relative disadvantage since it must evaluate one row at a time and rememberwhether it had seen the same value previously before deciding whether to >increment a counter. > >With a large number of duplicate rows the process of making the row set smaller before counting the end result will performbetter since fewer rows must be evaluated in the less efficient count(DISTINCT) >expression - the time saved theremore than offset by the fact that you are effectively passing over that subset of the data a second time. > >HashAggregate(1M rows) + Aggregate(200k rows) < Aggregate(1M rows) > >David J. Thanks David! I am so surprised to the findings you put here. Just did an explain plan on the example you gave and pasted the result below,you're correct. "Select count(distinct col1)" is really a very common SQL statement we write daily, in Postgres, we need to rewrite it sothat the aggregate doesn't happen on a very large data sets... I am wondering if this is something to be improved fromthe optimizer ifself, instead of developers to rewrite SQL. Like having the optimizer just do the counting in the endinstead of doing it each time. I used Oracle before, never saw this issue... But really thank you for pointing this out, very valuable lesson-learnt in PG SQL writing for me and our developers. dev=# explain analyze select count(*) from (select distinct wid from terms_weekly) foo; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1278656.00..1278656.01 rows=1 width=0) (actual time=24316.335..24316.336 rows=1 loops=1) -> HashAggregate (cost=1278651.50..1278653.50 rows=200 width=42) (actual time=23899.916..24242.010 rows=1298124 loops=1) -> Append (cost=0.00..1171738.20 rows=42765321 width=42) (actual time=0.028..13631.898 rows=42765320 loops=1) -> Seq Scan on search_terms_weekly (cost=0.00..0.00 rows=1 width=516) (actual time=0.001..0.001 rows=0 loops=1) -> Seq Scan on search_terms_weekly_20140503 (cost=0.00..293352.90 rows=10702190 width=42) (actual time=0.026..2195.460rows=10702190 loops=1) -> Seq Scan on search_terms_weekly_20140510 (cost=0.00..298773.53 rows=10878953 width=42) (actual time=8.244..3163.087rows=10878953 loops=1) -> Seq Scan on search_terms_weekly_20140517 (cost=0.00..288321.17 rows=10537717 width=41) (actual time=7.345..2520.531rows=10537717 loops=1) -> Seq Scan on search_terms_weekly_20140524 (cost=0.00..291290.60 rows=10646460 width=41) (actual time=8.543..2693.833rows=10646460 loops=1) Total runtime: 24333.830 ms (9 rows) dev=# explain analyze select count(distinct wid) from terms_weekly; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1278651.50..1278651.51 rows=1 width=42) (actual time=585774.511..585774.511 rows=1 loops=1) -> Append (cost=0.00..1171738.20 rows=42765321 width=42) (actual time=0.019..10656.782 rows=42765320 loops=1) -> Seq Scan on search_terms_weekly (cost=0.00..0.00 rows=1 width=516) (actual time=0.002..0.002 rows=0 loops=1) -> Seq Scan on search_terms_weekly_20140503 (cost=0.00..293352.90 rows=10702190 width=42) (actual time=0.017..2225.397rows=10702190 loops=1) -> Seq Scan on search_terms_weekly_20140510 (cost=0.00..298773.53 rows=10878953 width=42) (actual time=0.009..2244.918rows=10878953 loops=1) -> Seq Scan on search_terms_weekly_20140517 (cost=0.00..288321.17 rows=10537717 width=41) (actual time=0.008..1822.088rows=10537717 loops=1) -> Seq Scan on search_terms_weekly_20140524 (cost=0.00..291290.60 rows=10646460 width=41) (actual time=0.006..1561.229rows=10646460 loops=1) Total runtime: 585774.568 ms (8 rows)
В списке pgsql-performance по дате отправления: