Re: Why does adding SUM and GROUP BY destroy performance?

Поиск
Список
Период
Сортировка
От Christopher Browne
Тема Re: Why does adding SUM and GROUP BY destroy performance?
Дата
Msg-id 60smmuylkj.fsf@dev6.int.libertyrms.info
обсуждение исходный текст
Ответ на Re: Why does adding SUM and GROUP BY destroy performance?  (David Link <dvlink@yahoo.com>)
Список pgsql-general
dvlink@yahoo.com (David Link) writes:
> Thanks Ron, Thanks Christopher for your excellent feedback.
>
> I guess it's back to the drawing board.  This is a very late hour
> business requirement change.  And we need quick real-time results.
>
> Two things are being considered:
>
>    1. loading the non aggregate query entirely into memory (using perl
> cgi, currently, but looking at the possiblity of doing this in the
> database with either PL/perl or PL/plsql, though I don't know what
> would be gained by doing it that way).  And handling the summing and
> then the sort ourselves in the program, -- or --

I wouldn't think that this would actually help, either way.

The problem with the aggregates is that it requires that the
individual records all get visited, and the stats added together.

Suppose there are a million records, each 100 bytes in size.

Doing the aggregate on the database means reading through 1M records
and having the DBMS summarize them.  That's pretty expensive.

If you push the work into an in-memory Perl program, that doesn't
help; it means that each time it runs, it has to pull 100,000,000
bytes of data across the network connection, in addition to all the
other work.  That should be SLOWER than using the aggregate.

I think you may be misunderstanding the nature of the problem.  It's
not that aggregates are forcibly "performance destroying;" it's that
they require walking through the data, which happens not to be cheap
even though the result set may be pretty small.

>    2. preprocessing a lot of the data at pre-determined times.
> Essentially doubling the size of our database.
>
> I'd be open to any other suggestions.

Well, if you have a field that allows handling it in a time-based
manner, and the aggregate is static/well-defined, you might create a
table that is used to collect daily summaries.

Every new day, you would do something like:

insert into summary_table (date, gparameter, count)
  (select '2003-09-16'::date, gparameter, count(*)
      from detail_table
      where trans_on between '2003-09-16' and '2003-09-17'
         group by gparameter);

Then set up a view...

create view summary as
  select gparameter, sum(count) from
    ((select * from summary_table)  UNION ALL
     (select gparameter, count(*) from detail_table
      where trans_on > '2003-09-17' group by gparameter));

I am assuming that there is an index on trans_on; you might want to
handle that a bit differently.

I leave, as exercise for the reader, what to do about the '2003-09-17'
in the view.  That probably needs to be more dynamic.  Or perhaps not;
perhaps the process that augments the summary table would drop and
recreate the view.

Another improvement would be for the hourly/daily process to further
collapse summary_table so it only has one entry per 'gparameter'
grouping.
--
"cbbrowne","@","libertyrms.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

В списке pgsql-general по дате отправления:

Предыдущее
От: sumit_kaushik@yahoo.com (Sam123)
Дата:
Сообщение: Job Opening
Следующее
От: "Marc G. Fournier"
Дата:
Сообщение: Re: State of Beta 2