Обсуждение: How to avoid "Out of memory" using aggregate functions?

Поиск
Список
Период
Сортировка

How to avoid "Out of memory" using aggregate functions?

От
Mark Dalphin
Дата:
Hi,

I am working with a simple table similar to the following:

CREATE TABLE Seq (   SeqID int8 PRIMARY KEY,   FileID int2 NOT NULL,   FileOffset int8 NOT NULL,   SeqLength int4,
FOREIGNKEY (FileID) REFERENCES File (FileID)
 
);

CREATE INDEX Seq_SeqLength ON Seq (SeqLength);

The table contains facts about a DNA sequence (eg its length) and how to access
that sequence from an ordinary Unix file (FileID and FileOffset) where it is
concatonated with all the other DNA sequences.

Currently the TABLE Seq contains 4,586,379 (times 2) rows and the overall size
of the Unix file is 3,660,117,293 bytes (ie I expect sum(SeqLength) to be in the
region of 3.6 Gbytes).

I want to use the PostgreSQL aggregate functions to learn more about the
sequences.

The following query works well and quickly (I need to cast as int8 to avoid
overflow of the sum() and perhaps for avg()):
SELECT    count(seqlength::int8),         min(seqlength::int8),         max(seqlength::int8),
avg(seqlength::int8),        sum(seqlength::int8)
 
FROM Seq
WHERE FileID=4 AND SeqLength>1000;

An EXPLAIN of the above query returns the following:
NOTICE:  QUERY PLAN:

Aggregate  (cost=202000.62 rows=918779 width=4) ->  Index Scan using seq_seqlength on seq  (cost=202000.62 rows=918779
width=4)

EXPLAIN


If I remove the SeqLength constraint, the query takes forever to execute and
then crashes with an out-of-memory (on my SGI, I have 256 Mbyte RAM and equal
sized swap space; I run out of swap space during the query which makes a very
unpleasent noise).

EXPLAIN
SELECT  count(seqlength::int8),       min(seqlength::int8),       max(seqlength::int8),       avg(seqlength::int8),
 sum(seqlength::int8)
 
FROM Seq
WHERE FileID=4;

NOTICE:  QUERY PLAN:

Aggregate  (cost=397916.69 rows=2756337 width=4) ->  Seq Scan on seq  (cost=397916.69 rows=2756337 width=4)

EXPLAIN

I do not understand how to interpret this Query Plan, except that I am doing a
sequential scan of the DB (hence slow), but, since all the elements requested
are aggregates, why do I run out of memory? I am not trying to load the whole
table into memory at once, am I?  Suggestions for how to make this query run
would be welcome; my database is rapidly growing larger!

Thanks,
Mark

--
Mark Dalphin                          email: mdalphin@amgen.com
Mail Stop: 29-2-A                     phone: +1-805-447-4951 (work)
One Amgen Center Drive                       +1-805-375-0680 (home)
Thousand Oaks, CA 91320                 fax: +1-805-499-9955 (work)





Re: [SQL] How to avoid "Out of memory" using aggregate functions?

От
Tom Lane
Дата:
Mark Dalphin <mdalphin@amgen.com> writes:
> ... since all the elements requested
> are aggregates, why do I run out of memory?

The basic problem here is that Postgres' expression evaluator leaks
memory when dealing with expressions that use pass-by-reference data
types (which means practically any type other than bool, int2, int4).
The memory is reclaimed at the end of the query ... but that doesn't
help you if your allowable swap space fills with int8 values before
you get to the end :-(.

Fixing this is on the TODO list.  I have made a start on it in current
sources: aggregate functions applied to simple field references do not
leak memory anymore.  That won't help your query as written:

SELECT min(seqlength::int8) ...

since cast-to-int8 is a runtime type conversion expression and the
result of that expression evaluation will be leaked.  But perhaps you
can store seqlength as an int8 field to begin with, and skip the
conversion.  If so, and if you're handy with C, you might want to look
at src/backend/executor/nodeAgg.c in a current snapshot and see whether
it's practical to back-patch the memory-releasing logic into 6.5 ...
or just run the current snapshot if you're feeling adventurous ...

Of course the whole problem only comes up because avg() and sum() use
accumulators of the same width as the source data type, so they are
uncomfortably prone to overflow on large tables.  They really should
be modified to use more reasonable choices of accumulation data type;
probably float8 for all datatypes except int8 and numeric, and numeric
for those two would work OK.  This wasn't practical in prior releases
because a float8 accumulator would leak memory, but I think it would
be a good change to make for 7.0.

Anyway, to get back to your immediate problem, what you probably want to
do while using 6.5 is tweak the query to minimize the problem as much as
you can.  First off, I see no reason to cast the inputs of count(),
min(), or max() to int8; leaving those inputs as the int4 field value
should work just as well and not leak memory.  You don't have much
choice for either avg() or sum(), but perhaps not trying to evaluate
both in one query will bring the memory usage down to where you can
live with it.  If not, the only answer I can see is to do sums() over
subsets of the table and save the results to be added and averaged
later (hack hack ...)
        regards, tom lane