Re: Performance Issues with count()
От | Grant Johnson |
---|---|
Тема | Re: Performance Issues with count() |
Дата | |
Msg-id | 3CC856B0.8060402@amadensor.com обсуждение исходный текст |
Ответ на | Re: Performance Issues with count() ("S Grannis" <sjg@email.com>) |
Список | pgsql-general |
Index the yb col. It might help the sort and group run better. create index data_table_yb on data_table (yb); vacuum analyze; EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb; SELECT yb, count(1) FROM data_table GROUP BY yb; Also try a count(*) instead, it will count any row with any non-null values. S Grannis wrote: >Thanks for all of the useful comments. > >Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to2 hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time. > >Others have suggested the "fix" is in the future. > >Stephan Szabo wrote: >"I didn't see anything about your settings in postgresql.conf, >but increasing the sort_mem parameter may help that really >expensive sort step. I think the desired fix for this would >probably be the TODO entry on hash based aggregates but that's >still in the future..." > >The "non-default" postgresql.conf settings are as follows: > >shared_buffers = 240000 # uses ~2GB of shared mem >sort_mem = 512 >wal_files = 64 >enable_seqscan = 0 # per a recommendation >enable_indexscan = true >enable_tidscan = true >enable_sort = true >enable_nestloop = true >enable_mergejoin = true >enable_hashjoin = true > >I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl scriptto do the actual counting. > >Thanks again for the feedback, > >Shaun Grannis > >----- Original Message ----- >From: Michael Loftis <mloftis@wgops.com> >Date: Wed, 24 Apr 2002 10:14:04 -0700 >To: asdf asdasfa <sjg@email.com> >Subject: Re: [GENERAL] Performance Issues with count() > > >>Can you humour me and >>set enable_seqscan=0 >>And retry the query? >> >>Thanks :) >> >>S Grannis wrote: >> >>>Hi, >>> >>>I've found some performance issues with Postgres that >>>I'm hoping people on this list can help resolve. We're >>>working with a 65 million record table that includes year >>>of birth (data type INT). To count the frequency of dates >>>in the table, it takes 2 hours 26 minutes to execute. >>>(There's an approximately 100-year range of dates in the >>>65 million records). >>> >>># EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb; >>> >>>NOTICE: QUERY PLAN: >>>Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4) >>>-> Group (cost=16397434.27..16560491.48 rows=65222884 width=4) >>>-> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4) >>>-> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4) >>>I can count data from the flat text data file with this >>>Perl script: >>> >>>#!/usr/bin/perl >>># script to count YB frequencies in flat data file >>>open (IN, "$ARGV[0]"); >>>open (OUT, ">$ARGV[0]\_cnt"); >>>while (<IN>) { >>>chomp; >>>$years{$_}++;} >>>foreach $key (keys %years) { >>>print OUT "$key,$years{$key}\n";} >>> >>>The Perl script takes *1 minute*, 31 seconds to run. >>>Why is there such a discrepancy in times? I've noticed >>>that the Postgres count() function takes what seems to >>>be "longer than it should" in other cases as well. For >>>instance, counting the frequency of last names in the >>>same 65 million record table took *1 hour* and 31 >>>minutes: >>> >>># EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln; >>>NOTICE: QUERY PLAN: >>>Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19) >>>-> Group (cost=19538149.27..19701206.48 rows=65222884 width=19) >>>-> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19) >>>-> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19) >>> >>>The last name ( ln) and the year of birth ( yb) is >>>indexed, but that shouldn't matter because it's doing a >>>sequential scan, correct? Am I running into the >>>limitations of Postgres? We'd like to eventually get this >>>system into production, but if we can't get Postgres to >>>count() faster, we may not be able to use it. >>> >>>Here's the data_table schema: >>> >>># \d data_table >>> >>>Table "data_table" >>> Column | Type | Modifiers >>>--------+---------------+----------- >>> ss | character(9) | >>> ln | character(15) | >>> fn | character(15) | >>> mi | character(1) | >>> ns | character(15) | >>> lny | character(15) | >>> fny | character(15) | >>> sny | character(15) | >>> g | character(1) | >>> mb | integer | >>> db | integer | >>> yb | integer | >>> md | integer | >>> dd | integer | >>> yd | integer | >>>Indexes: ssdi_ss_idx >>> ssdi_ln_idx >>> >>>We're working with Postgres v 7.2. The machine is a >>>dual-processor Athlon MP1900 (Tyan Tiger board) with >>>3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives >>>configured in a software RAID 0 Array running under >>>RedHat Linux v. 7.2. >>> >>>We've VACUUM ANALYZE'd the tables after creating the >>>indices. Is there something I'm missing here? >>> >>>Thanks for your suggestions. >>> >>>Shaun Grannis >>>
В списке pgsql-general по дате отправления: