Performance Issues

Поиск
Список
Период
Сортировка
От Shaun Grannis
Тема Performance Issues
Дата
Msg-id OE65WiqJmdTcjJxWY5p00000c87@hotmail.com
обсуждение исходный текст
Ответы Re: Performance Issues  (Bruno Wolff III <bruno@wolff.to>)
Re: Performance Issues  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Re: Performance Issues  (Jean-Michel POURE <jm.poure@freesurf.fr>)
Список pgsql-general

Hi,

 

I’ve found some performance issues with Postgres that I’m hoping people on this list can help resolve.  We’re working with a 65 million record table that includes year of birth (data type INT). To count the frequency of dates in the table, it takes 2 hours 26 minutes to execute. (There’s an approximately 100-year range of dates in the 65 million records).

 

# EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;

NOTICE:  QUERY PLAN:

Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)

  -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)

      -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)

          -> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)

 

I can count data from the flat text data file with this Perl script:

 

   #!/usr/bin/perl

   # script to count YB frequencies in flat data file

   open (IN, "$ARGV[0]");

   open (OUT, ">$ARGV[0]\_cnt");

   while (<IN>) {

       chomp;

       $years{$_}++;

   }

   foreach $key (keys %years) {

       print OUT "$key,$years{$key}\n";

   }

 

The Perl script takes *1 minute*, 31 seconds to run.  Why is there such a discrepancy in times?  I’ve noticed that the Postgres count() function takes what seems to be “longer than it should” in other cases as well.  For instance, counting the frequency of last names in the same 65 million record table took *1 hour* and 31 minutes:

 

# EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;

NOTICE:  QUERY PLAN:

Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)

  ->  Group (cost=19538149.27..19701206.48 rows=65222884 width=19)

       ->  Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)

            ->  Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)

 

The last name (ln) and the year of birth (yb) is indexed, but that shouldn’t matter because it’s doing a sequential scan, correct?  Am I running into the limitations of Postgres?  We’d like to eventually get this system into production, but if we can’t get Postgres to count() faster, we may not be able to use it.

 

Here’s the data_table schema:

 

# \d data_table

         Table "data_table"

 Column |     Type      | Modifiers

--------+---------------+-----------

 ss     | character(9)  |

 ln     | character(15) |

 fn     | character(15) |

 mi     | character(1)  |

 ns     | character(15) |

 lny    | character(15) |

 fny    | character(15) |

 sny    | character(15) |

 g      | character(1)  |

 mb     | integer       |

 db     | integer       |

 yb     | integer       |

 md     | integer       |

 dd     | integer       |

 yd     | integer       |

Indexes: ssdi_ss_idx

         ssdi_ln_idx

 

 

We’re working with Postgres v 7.2. The machine is a dual-processor Athlon MP1900 (Tyan Tiger board) with 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives configured in a software RAID 0 Array running under RedHat Linux v. 7.2.

 

We’ve VACUUM ANALYZE’d the tables after creating the indices.  Is there something I’m missing here?

 

Thanks for you suggestions.

В списке pgsql-general по дате отправления:

Предыдущее
От: Gregory Seidman
Дата:
Сообщение: Re: pid gets overwritten in OSX
Следующее
От: Uros Gruber
Дата:
Сообщение: Help with performance