Re: Strategy for doing number-crunching

Поиск
Список
Период
Сортировка
От Sean Davis
Тема Re: Strategy for doing number-crunching
Дата
Msg-id CANeAVB=9dessVbx=p9TCtggZuj=LaGvDZqhKT_zV4zHS5tO7XQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Strategy for doing number-crunching  ("ktm@rice.edu" <ktm@rice.edu>)
Ответы Re: Strategy for doing number-crunching  (Joe Conway <mail@joeconway.com>)
Список pgsql-novice
On Wed, Jan 4, 2012 at 11:58 AM, ktm@rice.edu <ktm@rice.edu> wrote:
> On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote:
>> On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote:
>> > We have a database with approximately 130M rows, and we need to produce
>> > statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
>> > we're generating these stats via a single SELECT, and it is extremely
>> > slow...like it can take hours to return results.  The queries are variable,
>> > so there is no way I can think of to "pre-calculate" the stats.
>> >
>> > Here is an example query...
>> >
>> > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
>> > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999')
>> > AS bias,
>> > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS
>> > mae_min,
>> > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999')
>> > AS bias_min,
>> > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)),
>> > 'FM990.999') AS rms_min,
>> > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
>> > to_char(stddev(mae), 'FM990.999') AS mae_stddev,
>> > to_char(stddev(bias), 'FM990.999') AS bias_stddev
>> > FROM verify_data.verification_data
>> > WHERE model_name='foo'
>> > AND...several other conditions...
>> >
>> > As one might imagine, this query pounds the CPU relentlessly for hours.
>> >
>> > I'm thinking there must be a better approach for doing large amounts of
>> > calculations on the data.  Any pointers would be greatly appreciated!
>>
>> I'm assuming that someone will point out a quicker way with postgres,
>> so I'll mention another solution.
>>
>> It may be quicker to pull the data out of the database and calculate
>> in-memory using something like R (take a look at the ff package if
>> memory is an issue).  In R, a numeric vector of length 130M uses about
>> 1Gb (a little less).  Calculating both a mean and stddev on this
>> vector takes a total of 750ms on my Mac laptop.
>>
>> Sean
>>
>
> Maybe pl/R would be an option as well.

Good point.  I do not know what the performance is for these types of
aggregates using Pl/R.  Perhaps Joe Conway or others could comment.

Sean

В списке pgsql-novice по дате отправления:

Предыдущее
От: "ktm@rice.edu"
Дата:
Сообщение: Re: Strategy for doing number-crunching
Следующее
От: "Jean-Yves F. Barbier"
Дата:
Сообщение: Re: Strategy for doing number-crunching