Re: Time Series on Postgres (HOWTO?)

Поиск
Список
Период
Сортировка
От Alban Hertroys
Тема Re: Time Series on Postgres (HOWTO?)
Дата
Msg-id 335844CB-E6BC-43E8-B436-CAB53A51C696@solfertje.student.utwente.nl
обсуждение исходный текст
Ответ на Time Series on Postgres (HOWTO?)  (bubba postgres <bubba.postgres@gmail.com>)
Список pgsql-general
On 15 Jan 2011, at 1:41, bubba postgres wrote:

> I've been googling, but haven't found a good answer to what I should do if I want to store time series in Postgres.
> My current solution is store serialized (compressed) blobs of data.
> (So for example store 1 day worth of 1 minute samples (~1440 samples) stored as one row in a bytea. (Plus meta data)
> It would be nice if I could use 1 sample per column,(because updating individual columns/samples is clear to me) but
postgresdoesn't compress the row (which is bad because of high amount of repetitive data.. Easily 10X bigger. 

Not an easy problem to solve without knowing more about your data, like what the nature of this repetition is, how
repetitiveit is, etc. It also much depends on how you intend to use this data later on. Are there gaps in your data
ranges?How accurate does your data need to be stored? Etc. 

For a decision you're probably the only person who'll be close enough to the data and how it will be used.

That said, assuming your data has significant _sequential_ repetition (That's just how I would phrase it; meaning that
arange of values in sequence are the same), it may be an idea to just store differences larger than 0 and the length of
asample of equal values. If such ranges are small, this would just add overhead though. 

You could extend that approach by fitting curves to ranges of values with a similar tendency and store those curves
instead.Retrieving single values from those is of course a little harder, plus you probably will get some inaccuracy
whenyou can't fit curves exactly. 

Hard to say how effective this would be though.

> I've been considering a Double[] array, which would get compressed, but before I start down that path (I suppose I
needto make some storedprocs to update individual samples), has anyone built anything like this? Any open source
projectsI should look at? 


This is the safest approach I think. You don't have to rely on any assumptions on how your data behaves over time. It's
alsoquite predictable in terms of database and storage requirements, you won't run into surprises here. Just the usual
time-relatedstuff (DST-changes and such). 

One change I'd probably make is to store them like this:
CREATE TABLE sample (
    start    timestamp,
    length    int,
    values    double[]
);

That way you don't need to reserve space for longer gaps. For example, if a day ends "early" because data stopped
comingin, you can just store a shorter day (length < 1440 minutes), or if you missed data at the start of the day you
canmake it start later (once the first sample arrives). 

Querying them isn't too tough either, although determining the array index you need based on a timestamp is somewhat
trickyI just found out. 
Finding the record is pretty easy though, that's just:

    WHERE now() BETWEEN start AND start + length * interval '1 minute';

I'd advise putting an index on the latter formula ;)


There is one possible problem I'm seeing with this approach though: Updating values requires the whole row to be
rewritten(MVCC). 
If your arrays are wide enough, then updating those rows will take longer and longer because the amount of data that
needsto be written each time is increasing. 
Seeing that your data-samples apparently arrive at 1-minute intervals you'll probably be safe, but if you're going to
addsignificant processing around updating a row, then you could run into a race where the previous version of the
recordis still being processed. 
There are ways around that, but they only complicate matters more and probably hurt performance, so that's probably
bestleft for if it's needed at all. 

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4d318b0811872055413410!



В списке pgsql-general по дате отправления:

Предыдущее
От: Tim Uckun
Дата:
Сообщение: Re: Adding an "and is not null" on an indexed field slows the query down immensely.
Следующее
От: pasman pasmański
Дата:
Сообщение: Re: Adding an "and is not null" on an indexed field slows the query down immensely.