Обсуждение: Performance issues with large amounts of time-series data

От:
Hrishikesh (हृषीकेश मेहेंदळे)
Дата:

Hi All,

We are improving our network appliance monitoring system, and are evaluating using PostgreSQL as the back-end traffic statistics database (we're currently running a home-grown Berkeley-DB based statistics database).

We log data from various network elements (it's mainly in/out bytes and packet counters, recorded for every port that we see traffic on). As such, the system  can expect to get data from 2000 devices (eventually, at the moment it's only about 250), and has a monitoring target of 100 ports (although this is not enforced at 100, in practice we've seen only about 20-30 ports in a given timeframe, and only about 50 distinct ports over a whole year of monitoring) -- this is akin to RRD (e.g. MRTG or Cacti) but with a lot more flexibility.

Our current monitoring system reports the data per device as

key = {device_id (uint64), identifier (uint32), sub_identifier (uint32), unix_time} (these four taken together are unique)
data = 4 x uint64 (BIGINT in PG tables)


My table structure in PG mirrors this format with a UNIQUE constraint across the four columns, and an index on each column separately. The data is received every 5 minutes, and stored at 5 minute, 1 hour and 1-day granularities into partitioned tables named like stats_300 -> (stats_300_begintime_endtime, stats_300_begintime_endtime) and so on. I have currently split the 5min tables at every 2 hours, 1 hour tables at  2 days, and 1-day tables at every month).

For this schema, the typical queries would be:

For timeseries graphs (graphed as bar/line graphs):
  SELECT TIMESTAMP, SUM(DATA_0), SUM(DATA_1), SUM(DATA_2), SUM(DATA_3)
    FROM <appropriate parent table> WHERE TIMESTAMP >= X AND TIMESTAMP < Y
    AND DEVICE IN (id1, id2, id3, ..... up to 2000 IDs can be here)
    GROUP BY TIMESTAMP;

For aggregate graphs (graphed as a pie chart):
  SELECT SUB_ID,  SUM(DATA_0), SUM(DATA_1), SUM(DATA_2), SUM(DATA_3)
    FROM <appropriate top table> WHERE TIMESTAMP >= X AND TIMESTAMP < Y
    AND DEVICE IN (id1, id2, id3, ..... up to 2000 IDs can be here)
    GROUP BY SUB_ID;

In my timing tests, the performance of PG is quite a lot worse than the equivalent BerkeleyDB implementation. Specifically, I get the following timing results:

For the longest-running queries:
BDB - 10-15 sec (cold transfer), <2 sec (warm - if I rerun the query immediately)
PG (command line) - 25 - 30 sec (cold), 25-30 sec (warm).
PG (via libpqxx) - ~40 sec (cold), 25-30 sec (warm)

The data is immutable once it goes in (unless I DROP TABLE), and I've VACUUM FULL ANALYZED the whole database *before* my timing queries.

An explain analyze looks like (the tables are prepopulated with data for 2000 devices and 100 sub_ids):

mydb=> explain analyze SELECT TIMESTAMP, SUM(DATA_0), SUM(DATA_1), SUM(DATA_2), SUM(DATA_3) FROM stats_3600 WHERE MAIN_ID = 1 AND SUB_ID = 0 AND TIMESTAMP >= 1251676859 AND TIMESTAMP <= 1251849659 GROUP BY TIMESTAMP;
                                                                                  QUERY PLAN                                                                                  
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=226659.20..226663.20 rows=200 width=36) (actual time=1709.651..1709.745 rows=48 loops=1)
   ->  Append  (cost=0.00..225288.47 rows=109659 width=36) (actual time=33.840..1264.328 rows=96000 loops=1)
         ->  Index Scan using uniq_3600 on stats_3600  (cost=0.00..8.28 rows=1 width=36) (actual time=0.019..0.019 rows=0 loops=1)
               Index Cond: ((main_id = 1) AND (sub_id = 0) AND ("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
         ->  Bitmap Heap Scan on stats_3600_1251590400_1251763199 stats_3600  (cost=2131.71..112946.75 rows=60642 width=36) (actual time=33.816..495.239 rows=46000 loops=1)
               Recheck Cond: ((main_id = 1) AND (sub_id = 0) AND ("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
               ->  Bitmap Index Scan on stats_3600_1251590400_1251763199_unique_check  (cost=0.00..2116.55 rows=60642 width=0) (actual time=21.415..21.415 rows=46000 loops=1)
                     Index Cond: ((main_id = 1) AND (sub_id = 0) AND ("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
         ->  Bitmap Heap Scan on stats_3600_1251763200_1251935999 stats_3600  (cost=1727.24..112333.44 rows=49016 width=36) (actual time=38.169..526.578 rows=50000 loops=1)
               Recheck Cond: ((main_id = 1) AND (sub_id = 0) AND ("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
               ->  Bitmap Index Scan on stats_3600_1251763200_1251935999_unique_check  (cost=0.00..1714.99 rows=49016 width=0) (actual time=24.059..24.059 rows=50000 loops=1)
                     Index Cond: ((main_id = 1) AND (sub_id = 0) AND ("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
 Total runtime: 1710.844 ms
(13 rows)


mydb=> explain analyze SELECT SUB_ID, SUM(DATA_0), SUM(DATA_1), SUM(DATA_2), SUM(DATA_3) FROM stats_3600 WHERE MAIN_ID = 1 AND TIMESTAMP >= 1251676859 AND TIMESTAMP <= 1251849659  GROUP BY SUB_ID;
                                                                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=881887.53..881891.53 rows=200 width=36) (actual time=82007.298..82007.493 rows=99 loops=1)
   ->  Append  (cost=0.00..771583.84 rows=8824295 width=36) (actual time=37.206..42504.106 rows=8819844 loops=1)
         ->  Index Scan using uniq_3600 on stats_3600  (cost=0.00..8.32 rows=1 width=36) (actual time=0.024..0.024 rows=0 loops=1)
               Index Cond: ((main_id = 1) AND ("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
         ->  Index Scan using idx_ts_stats_3600_1251590400_1251763199 on stats_3600_1251590400_1251763199 stats_3600  (cost=0.00..369424.65 rows=4234747 width=36) (actual time=37.178..9776.530 rows=4226249 loops=1)
               Index Cond: (("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
               Filter: (main_id = 1)
         ->  Index Scan using idx_ts_stats_3600_1251763200_1251935999 on stats_3600_1251763200_1251935999 stats_3600  (cost=0.00..402150.87 rows=4589547 width=36) (actual time=0.119..11339.277 rows=4593595 loops=1)
               Index Cond: (("timestamp" >= 1251676859) AND ("timestamp" <= 1251849659))
               Filter: (main_id = 1)
 Total runtime: 82007.762 ms

The corresponding table definition looks like:
mydb=> \d stats_3600_1251590400_1251763199
Table "public.stats_3600_1251590400_1251763199"
   Column    |  Type   | Modifiers
-------------+---------+-----------
 main_id    | integer |
 sub_id | integer |
 timestamp   | integer |
 device      | bigint  |
 data_0      | bigint  |
 data_1      | bigint  |
 data_2      | bigint  |
 data_3      | bigint  |
Indexes:
    "stats_3600_1251590400_1251763199_unique_check" UNIQUE, btree (main_id, sub_id, "timestamp", device)
    "idx_cid_stats_3600_1251590400_1251763199" btree (main_id)
    "idx_scid_stats_3600_1251590400_1251763199" btree (sub_id)
    "idx_dev_stats_3600_1251590400_1251763199" btree (device)
    "idx_ts_stats_3600_1251590400_1251763199" btree ("timestamp")
Check constraints:
    "stats_3600_1251590400_1251763199_timestamp_check" CHECK ("timestamp" >= 1251590400 AND "timestamp" <= 1251763199)
Inherits: stats_3600

The table contains the following data (other tables are similar):
mydb=> select relname, relpages, reltuples from pg_class where relname like 'stats_%';
                    relname                     | relpages |  reltuples 
------------------------------------------------+----------+-------------
 stats_300_1251705600_1251712799                |    49532 |  4.8046e+06
  stats_3600_1251763200_1251935999               |   181861 | 1.76404e+07
  stats_86400_1244160000_1246751999              |    61845 | 5.99888e+06
[the rest truncated for brevity]


So my questions are:
1. Is there anything I can do to speed up performance for the queries? Even a warm performance comparable to the BDB version would be a big improvement from the current numbers.
2. Does the order in which data was received vs. data being queried matter? (If so, I can either cache the data before writing to DB, or rewrite the table when I rollover to the next one)


System Configuration:
 - 64-bit quad-core Xeon with 6 GB RAM
 - 4x250 GB SATA disks configured as RAID stripe+mirror
 - Linux 2.6.9-34 with some custom patches (CentOS 4.2 based)
 - postgres 8.3.7 (from sources, no special config options, installed to /var/opt/pgsql-8.3)
 - C++ interface using libpqxx-3.0 (also built from sources)Relevant parameters from postgresql.conf:
 - Relevant postgresql.conf parameters:
     data_directory = /data/pg (400 GB partition)
     max_connections = 8
     shared_buffers = 128MB
     work_mem = 256MB
     maintenance_work_mem=64MB
     effective_cache_size = 2048MB
     max_fsm_pages=204800
     default_statistics_target = 100
     constraint_exclusion = on


Thanks Much!
Hrishi
От:
Tom Lane
Дата:

=?UTF-8?B?SHJpc2hpa2VzaCAo4KS54KWD4KS34KWA4KSV4KWH4KS2IOCkruClh+CkueClh+CkguCkpuCksw==?= =?UTF-8?B?4KWHKQ==?=
<>writes: 
> In my timing tests, the performance of PG is quite a lot worse than the
> equivalent BerkeleyDB implementation.

Are you actually comparing apples to apples?  I don't recall that BDB
has any built-in aggregation functionality.  It looks to me like you've
moved some work out of the client into the database.

> 1. Is there anything I can do to speed up performance for the queries?

Do the data columns have to be bigint, or would int be enough to hold
the expected range?  SUM(bigint) is a *lot* slower than SUM(int),
because the former has to use "numeric" arithmetic whereas the latter
can sum in bigint.  If you want to keep the data on-disk as bigint,
but you know the particular values being summed here are not that
big, you could cast in the query (SUM(data_1::int) etc).

I'm also wondering if you've done something to force indexscans to be
used.  If I'm interpreting things correctly, some of these scans are
traversing all/most of a partition and would be better off as seqscans.

>      shared_buffers = 128MB

This is really quite lame for the size of machine and database you've
got.  Consider knocking it up to 1GB or so.

            regards, tom lane

От:
Hrishikesh (हृषीकेश मेहेंदळे)
Дата:

Hi Tom,

Thanks for your quick response.

2009/8/26 Tom Lane <>
> <> writes:
> > In my timing tests, the performance of PG is quite a lot worse than the
> > equivalent BerkeleyDB implementation.
>
> Are you actually comparing apples to apples?  I don't recall that BDB
> has any built-in aggregation functionality.  It looks to me like you've
> moved some work out of the client into the database.

I'm measuring end-to-end time, which includes the in-code aggregation
with BDB (post DB fetch) and the in-query aggregation in PG.

> > 1. Is there anything I can do to speed up performance for the queries?
>
> Do the data columns have to be bigint, or would int be enough to hold
> the expected range?  SUM(bigint) is a *lot* slower than SUM(int),
> because the former has to use "numeric" arithmetic whereas the latter
> can sum in bigint.  If you want to keep the data on-disk as bigint,
> but you know the particular values being summed here are not that
> big, you could cast in the query (SUM(data_1::int) etc).

For the 300-sec tables I probably can drop it to an integer, but for
3600 and 86400 tables (1 hr, 1 day) will probably need to be BIGINTs.
However, given that I'm on a 64-bit platform (sorry if I didn't
mention it earlier), does it make that much of a difference? How does
a float ("REAL") compare in terms of SUM()s ?

> I'm also wondering if you've done something to force indexscans to be
> used.  If I'm interpreting things correctly, some of these scans are
> traversing all/most of a partition and would be better off as seqscans.
One thing I noticed is that if I specify what devices I want the data
for (specifically, all of them, listed out as DEVICE IN (1,2,3,4,5...)
in the WHERE clause, PG uses a Bitmap heap scan, while if I don't
specify the list (which still gives me data for all the devices), PG
uses a sequential scan. (I might have missed the DEVICE IN (...) in my
earlier query). However, more often than not, the query _will_ be of
the form DEVICE IN (...). If I actually execute the queries (on the
psql command line), their runtimes are about the same (15s vs 16s)

> >      shared_buffers = 128MB
>
> This is really quite lame for the size of machine and database you've
> got.  Consider knocking it up to 1GB or so.

OK, I've bumped it up to 1 GB. However, that doesn't seem to make a
huge difference (unless I need to do the same on libpqxx's connection
object too).

Cheers,
Hrishi

От:
Tom Lane
Дата:

=?UTF-8?B?SHJpc2hpa2VzaCAo4KS54KWD4KS34KWA4KSV4KWH4KS2IOCkruClh+CkueClh+CkguCkpuCksw==?= =?UTF-8?B?4KWHKQ==?=
<>writes: 
> 2009/8/26 Tom Lane <>
>> Do the data columns have to be bigint, or would int be enough to hold
>> the expected range?

> For the 300-sec tables I probably can drop it to an integer, but for
> 3600 and 86400 tables (1 hr, 1 day) will probably need to be BIGINTs.
> However, given that I'm on a 64-bit platform (sorry if I didn't
> mention it earlier), does it make that much of a difference?

Even more so.

> How does a float ("REAL") compare in terms of SUM()s ?

Casting to float or float8 is certainly a useful alternative if you
don't mind the potential for roundoff error.  On any non-ancient
platform those will be considerably faster than numeric.  BTW,
I think that 8.4 might be noticeably faster than 8.3 for summing
floats, because of the switch to pass-by-value for them.

            regards, tom lane

От:
Greg Stark
Дата:

2009/8/26 Tom Lane <>:
>> How does a float ("REAL") compare in terms of SUM()s ?
>
> Casting to float or float8 is certainly a useful alternative if you
> don't mind the potential for roundoff error.  On any non-ancient
> platform those will be considerably faster than numeric.  BTW,
> I think that 8.4 might be noticeably faster than 8.3 for summing
> floats, because of the switch to pass-by-value for them.

It occurs to me we could build a special case state variable which
contains a bigint or a numeric only if it actually overflows. This
would be like my other suggestion with dates only it would never be
exposed. The final function would always convert to a numeric.

Alternatively we could change the numeric data type as was proposed
aeons ago but make it more general so it stores integers that fit in a
bigint as a 64-bit integer internally. That would be more work but be
more generally useful. I'm not sure it would be possible to avoid
generating palloc garbage for sum() that way though.

--
greg
http://mit.edu/~gsstark/resume.pdf

От:
Greg Smith
Дата:

On Wed, 26 Aug 2009, Hrishikesh (??????? ????????) wrote:

> key = {device_id (uint64), identifier (uint32), sub_identifier (uint32), unix_time} (these four taken together are
unique)

You should probably tag these fields as NOT NULL to eliminate needing to
consider that possibility during query planning.  As of V8.3 this isn't as
critical anymore, but it's still good practice.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Hrishikesh (हृषीकेश मेहेंदळे)
Дата:

Hi Tom, Greg,

Thanks for your helpful suggestions - switching the BIGINT to FLOAT
and fixing the postgresql.conf to better match my server configuration
gave me about 30% speedup on the queries.

Because of the fact that my data insert order was almost never the
data retrieval order, I also got a significant (about 3x - 10x)
speedup by CLUSTERing the tables on an index that represented the most
frequent query orders (main_id, timestamp, sub_id, device_id) - the
queries that were taking a few seconds earlier now complete in a few
hundred milliseconds (5s vs. 600ms in some instances).

Thanks Again,
Hrishikesh