Using Postgres to store high volume streams of sensor readings

Поиск
Список
Период
Сортировка
От Ciprian Dorin Craciun
Тема Using Postgres to store high volume streams of sensor readings
Дата
Msg-id 8e04b5820811210450gce53e33vc7b693d15b1f78e0@mail.gmail.com
обсуждение исходный текст
Ответы Re: Using Postgres to store high volume streams of sensor readings  (Gerhard Heift <ml-postgresql-20081012-3518@gheift.de>)
Re: Using Postgres to store high volume streams of sensor readings  (Rafael Martinez <r.m.guerrero@usit.uio.no>)
Re: Using Postgres to store high volume streams of sensor readings  (Sam Mason <sam@samason.me.uk>)
Re: Using Postgres to store high volume streams of sensor readings  ("Diego Schulz" <dschulz@gmail.com>)
Список pgsql-general
    Hello all!

    I would like to ask some advice about the following problem
(related to the Dehems project: http://www.dehems.eu/ ):
    * there are some clients; (the clients are in fact house holds;)
    * each device has a number of sensors (about 10), and not all the
clients have the same sensor; also sensors might appear and disappear
dynamicaly; (the sensors are appliances;)
    * for each device and each sensor a reading is produced (at about
6 seconds); (the values could be power consumptions;)
    * I would like to store the following data: (client, sensor,
timestamp, value);
    * the usual queries are:
        * for a given client (and sensor), and time interval, I need
the min, max, and avg of the values;
        * for a given time interval (and sensor), I need min, max, and
avg of the values;
        * other statistics;

    Currently I'm benchmarking the following storage solutions for this:
    * Hypertable (http://www.hypertable.org/) -- which has good insert
rate (about 250k inserts / s), but slow read rate (about 150k reads /
s); (the aggregates are manually computed, as Hypertable does not
support other queries except scanning (in fact min, and max are easy
beeing the first / last key in the ordered set, but avg must be done
by sequential scan);)
    * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
fabulos read rate (about 2M reads / s); (the same issue with
aggregates;)
    * Postgres -- which behaves quite poorly (see below)...
    * MySQL -- next to be tested;

    So what can I do / how could I optimize the use of Postgres for this usage?

    (I'm aware that there could be optimizations for this problem
(like computing the aggregates in memory and storing only these
aggregates at 10 minutes, or other interval), but I want to have the
full history (for data mining tasks for example);)

    I will also like to present how was the Postgres benchmark implemented:

    * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb,
SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2);
    * Postgres version: 8.3.3;

    * database schema:
> create table sds_benchmark_data (
>     client int4 not null,
>     sensor int4 not null,
>     timestamp int4 not null,
>     value int4
> );
>
> alter table sds_benchmark_data add primary key (client, sensor, timestamp);
>
> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);

    * postgres.conf (the default values, I will list only what has
been changed):
> max_connections = 20
> shared_buffers = 24MB
> work_mem = 64MB
> maintenance_work_mem = 256MB
> fsync = off
> checkpoint_segments = 32
> effective_cache_size = 1024MB

    * inserts are done like this:
        * generated 100 million readings by using the following rule:
            * client is randomly chosen between 0 and 10 thousand;
            * sensor is randomly chosen between 0 and 10;
            * the timestamp is always increasing by one;
        * the insert is done in batches of 500 thousand inserts (I've
also tried 5, 25, 50 and 100 thousand without big impact);
        * the banch inserts are done through COPY sds_benchmark_data
FROM STDIN through libpq (by using UNIX (local) sockets);

    What have I observed / tried:
    * I've tested without the primary key and the index, and the
results were the best for inserts (600k inserts / s), but the
readings, worked extremly slow (due to the lack of indexing);
    * with only the index (or only the primary key) the insert rate is
good at start (for the first 2 million readings), but then drops to
about 200 inserts / s;

    So could someone point me where I'me wrong, or what can I do to
optimize Postgres for this particular task?

    Thanks for your help,
    Ciprian Dorin Craciun.

    P.S.: I'll want to publish the benchmark results after they are
done, and I want to squeeze as much power out of Postgres as possible.

В списке pgsql-general по дате отправления:

Предыдущее
От: Christian Schröder
Дата:
Сообщение: Re: inherit table and its data
Следующее
От: Gerhard Heift
Дата:
Сообщение: Re: Using Postgres to store high volume streams of sensor readings