Re: Performance Issues with count()

Поиск
Список
Период
Сортировка
От Grant Johnson
Тема Re: Performance Issues with count()
Дата
Msg-id 3CC856B0.8060402@amadensor.com
обсуждение исходный текст
Ответ на Re: Performance Issues with count()  ("S Grannis" <sjg@email.com>)
Список pgsql-general
Index the yb col.  It might help the sort and group run better.


create index data_table_yb on data_table (yb);
vacuum analyze;

EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;

SELECT yb, count(1) FROM data_table GROUP BY yb;



Also try a count(*) instead, it will count any row with any non-null values.


S Grannis wrote:

>Thanks for all of the useful comments.
>
>Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes
to2 hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time. 
>
>Others have suggested the "fix" is in the future.
>
>Stephan Szabo wrote:
>"I didn't see anything about your settings in postgresql.conf,
>but increasing the sort_mem parameter may help that really
>expensive sort step. I think the desired fix for this would
>probably be the TODO entry on hash based aggregates but that's
>still in the future..."
>
>The "non-default" postgresql.conf settings are as follows:
>
>shared_buffers = 240000 # uses ~2GB of shared mem
>sort_mem = 512
>wal_files = 64
>enable_seqscan = 0      # per a recommendation
>enable_indexscan = true
>enable_tidscan = true
>enable_sort = true
>enable_nestloop = true
>enable_mergejoin = true
>enable_hashjoin = true
>
>I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl
scriptto do the actual counting. 
>
>Thanks again for the feedback,
>
>Shaun Grannis
>
>----- Original Message -----
>From: Michael Loftis <mloftis@wgops.com>
>Date: Wed, 24 Apr 2002 10:14:04 -0700
>To: asdf asdasfa <sjg@email.com>
>Subject: Re: [GENERAL] Performance Issues with count()
>
>
>>Can you humour me and
>>set enable_seqscan=0
>>And retry the query?
>>
>>Thanks :)
>>
>>S Grannis wrote:
>>
>>>Hi,
>>>
>>>I've found some performance issues with Postgres that
>>>I'm hoping people on this list can help resolve. We're
>>>working with a 65 million record table that includes year
>>>of birth (data type INT). To count the frequency of dates
>>>in the table, it takes 2 hours 26 minutes to execute.
>>>(There's an approximately 100-year range of dates in the
>>>65 million records).
>>>
>>># EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
>>>
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
>>>-> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
>>>-> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
>>>-> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
>>>I can count data from the flat text data file with this
>>>Perl script:
>>>
>>>#!/usr/bin/perl
>>># script to count YB frequencies in flat data file
>>>open (IN, "$ARGV[0]");
>>>open (OUT, ">$ARGV[0]\_cnt");
>>>while (<IN>) {
>>>chomp;
>>>$years{$_}++;}
>>>foreach $key (keys %years) {
>>>print OUT "$key,$years{$key}\n";}
>>>
>>>The Perl script takes *1 minute*, 31 seconds to run.
>>>Why is there such a discrepancy in times? I've noticed
>>>that the Postgres count() function takes what seems to
>>>be "longer than it should" in other cases as well. For
>>>instance, counting the frequency of last names in the
>>>same 65 million record table took *1 hour* and 31
>>>minutes:
>>>
>>># EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
>>>-> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
>>>-> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
>>>-> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
>>>
>>>The last name ( ln) and the year of birth ( yb) is
>>>indexed, but that shouldn't matter because it's doing a
>>>sequential scan, correct? Am I running into the
>>>limitations of Postgres? We'd like to eventually get this
>>>system into production, but if we can't get Postgres to
>>>count() faster, we may not be able to use it.
>>>
>>>Here's the data_table schema:
>>>
>>># \d data_table
>>>
>>>Table "data_table"
>>> Column | Type          | Modifiers
>>>--------+---------------+-----------
>>>     ss | character(9)  |
>>>     ln | character(15) |
>>>     fn | character(15) |
>>>     mi | character(1)  |
>>>     ns | character(15) |
>>>    lny | character(15) |
>>>    fny | character(15) |
>>>    sny | character(15) |
>>>      g | character(1)  |
>>>     mb | integer       |
>>>     db | integer       |
>>>     yb | integer       |
>>>     md | integer       |
>>>     dd | integer       |
>>>     yd | integer       |
>>>Indexes: ssdi_ss_idx
>>>         ssdi_ln_idx
>>>
>>>We're working with Postgres v 7.2. The machine is a
>>>dual-processor Athlon MP1900 (Tyan Tiger board) with
>>>3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
>>>configured in a software RAID 0 Array running under
>>>RedHat Linux v. 7.2.
>>>
>>>We've VACUUM ANALYZE'd the tables after creating the
>>>indices. Is there something I'm missing here?
>>>
>>>Thanks for your suggestions.
>>>
>>>Shaun Grannis
>>>



В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: explicitly casting return value of avg() to float4
Следующее
От: Jon Lapham
Дата:
Сообщение: Re: explicitly casting return value of avg() to float4