Обсуждение: Partitioning with a lot of number of partitions

Поиск
Список
Период
Сортировка

Partitioning with a lot of number of partitions

От
Aleksej Trofimov
Дата:
Hello, I'm novice in using postgres, so please be patient =)

By the moment we are using PostgreSQL 9.1.0. Our database is about 270GB
and currently we have a problem with our "hot" table which is about
120GB of data and 20GB index. By "hot" I mean, that we have a lot of
insert/update/select operations on this table (about 30/30/120 per
second). This hot table is called coordinates. In our logic we have
several definitions: Coordinate - basic packet of data; Object - basic
unit, which sends coordinates; Client - client which own object.  One
client may has a lot of object. and each object may has a lot  of
coordinates. All coordinates are stored in coordinates table.
So by the moment we have an issue with slow insert/select queries...
this why we decided to partition this table. We have tested 2 types of
partitioning: By date (weekly), by object;
This are our  results:

SELECT QUERY (1 object):

-----------------------------------------------------
Partition type         |  M     |  W    |    D    |
-----------------------------------------------------
Non-partitioned     | 5830 | 460  | 2913  |
Date(weekly)         | 1000 | 440  | 106    |
Object                    | 0.02 | 0.03 |  0.009 |
------------------------------------------------------

M - query for month period
W - query for week period
D - query for day period
All times in ms.

So according to our results, we have much greater performance using
partitioning by Object, BUT we have a lot of object (about 5000, and
planning to connect 2-5 times bigger object count next year). This plans
leads to the firs question:

1) How postgres would "like" such a big count of partitions? Will it
decrease performance in future? By the moment we noticed several
performance issues with the big count of object id in such conditional
statement: SELECT * FROM coordinates_object.coordinates WHERE object_id
in (HERE ARE SEVERAL OBJECTS);
EXPLAIN ANALYZE:
EXPLAIN ANALYZE SELECT * FROM coordinates_object.coordinates WHERE
datetime between '2011-07-01' and '2011-07-02' AND object_id in
(722::bigint, 728::bigint, 727::bigint, 248::bigint);

Returns:

"Result  (cost=0.00..1887.23 rows=1455 width=655) (actual
time=49.266..695.483 rows=908 loops=1)"
"  ->  Append  (cost=0.00..1887.23 rows=1455 width=655) (actual
time=49.264..694.995 rows=908 loops=1)"
"        ->  Seq Scan on coordinates  (cost=0.00..0.00 rows=1 width=314)
(actual time=0.001..0.001 rows=0 loops=1)"
"              Filter: ((datetime >= '2011-07-01 00:00:00'::timestamp
without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp
without time zone) AND (object_id = ANY ('{722,728,727,248}'::bigint[])))"
"        ->  Bitmap Heap Scan on coordinates_722 coordinates
(cost=10.90..451.49 rows=258 width=654) (actual time=49.261..195.291
rows=166 loops=1)"
"              Recheck Cond: ((datetime >= '2011-07-01
00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02
00:00:00'::timestamp without time zone))"
"              Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))"
"              ->  Bitmap Index Scan on coordinates_722_datetime_index
(cost=0.00..10.83 rows=258 width=0) (actual time=49.237..49.237 rows=166
loops=1)"
"                    Index Cond: ((datetime >= '2011-07-01
00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02
00:00:00'::timestamp without time zone))"
"        ->  Bitmap Heap Scan on coordinates_728 coordinates
(cost=13.00..568.65 rows=463 width=656) (actual time=108.833..203.139
rows=331 loops=1)"
"              Recheck Cond: ((datetime >= '2011-07-01
00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02
00:00:00'::timestamp without time zone))"
"              Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))"
"              ->  Bitmap Index Scan on coordinates_728_datetime_index
(cost=0.00..12.88 rows=463 width=0) (actual time=61.914..61.914 rows=332
loops=1)"
"                    Index Cond: ((datetime >= '2011-07-01
00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02
00:00:00'::timestamp without time zone))"
"        ->  Bitmap Heap Scan on coordinates_727 coordinates
(cost=19.56..824.53 rows=713 width=651) (actual time=147.295..251.406
rows=370 loops=1)"
"              Recheck Cond: ((datetime >= '2011-07-01
00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02
00:00:00'::timestamp without time zone))"
"              Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))"
"              ->  Bitmap Index Scan on coordinates_727_datetime_index
(cost=0.00..19.38 rows=713 width=0) (actual time=125.261..125.261
rows=370 loops=1)"
"                    Index Cond: ((datetime >= '2011-07-01
00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02
00:00:00'::timestamp without time zone))"
"        ->  Index Scan using coordinates_248_datetime_index on
coordinates_248 coordinates  (cost=0.00..42.56 rows=20 width=777)
(actual time=0.017..44.991 rows=41 loops=1)"
"              Index Cond: ((datetime >= '2011-07-01
00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02
00:00:00'::timestamp without time zone))"
"              Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))"
"Total runtime: 695.963 ms"

2) As you see from EXPLAIN, postgres is using array for filtering
Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))", why?

3) Last question would be about some "unpredictable" behavior on such query:

EXPLAIN ANALYZE SELECT * FROM coordinates_object.coordinates WHERE
datetime between '2011-07-01' and '2011-07-02' AND object_id in (SELECT
id FROM object WHERE client_id=318);

Using such query, postgres is trying to Bitmap Heap Scan on every index
of all partitions (Remember we have 5000 partitions). Such a query makes
our server think a lot =).

So if it possible please judge our decision and offer some other
solutions...





Re: Partitioning with a lot of number of partitions

От
Josh Kupershmidt
Дата:
On Wed, Oct 12, 2011 at 10:09 AM, Aleksej Trofimov
<aleksej.trofimov@ruptela.lt> wrote:
> By the moment we are using PostgreSQL 9.1.0. Our database is about 270GB and
> currently we have a problem with our "hot" table which is about 120GB of
> data and 20GB index.

That sounds quite manageable with decent hardware.

> By "hot" I mean, that we have a lot of
> insert/update/select operations on this table (about 30/30/120 per second).
> This hot table is called coordinates. In our logic we have several
> definitions: Coordinate - basic packet of data; Object - basic unit, which
> sends coordinates; Client - client which own object.  One client may has a
> lot of object. and each object may has a lot  of coordinates. All
> coordinates are stored in coordinates table.
> So by the moment we have an issue with slow insert/select queries... this
> why we decided to partition this table. We have tested 2 types of
> partitioning: By date (weekly), by object;
> This are our  results:
>
> SELECT QUERY (1 object):
>
> -----------------------------------------------------
> Partition type         |  M     |  W    |    D    |
> -----------------------------------------------------
> Non-partitioned     | 5830 | 460  | 2913  |
> Date(weekly)         | 1000 | 440  | 106    |
> Object                    | 0.02 | 0.03 |  0.009 |
> ------------------------------------------------------
>
> M - query for month period
> W - query for week period
> D - query for day period
> All times in ms.
>
> So according to our results, we have much greater performance using
> partitioning by Object,

Well hold on, your results only show the timings for "SELECT QUERY (1
object):" -- does this mean you're just selecting one row? If so, that
doesn't seem like a particularly useful performance benchmark. What
are the _real_ bottleneck queries for your application when you're not
using partitioning?

Also, 5830 ms seems like a lot for what amounts to an indexed (primary
key?) lookup of a single row -- how long does it take for you to
SELECT [some object ID] FROM original_unpartitioned_big_table;  (try
the query a few times, with both same and different object IDs to get
a feel for cache effects).

BTW, If you do decide to partition based on objects, you don't have to
make a partition for each object ID, if there are 5000 of those. You
can partition based on ranged of object IDs.

> BUT we have a lot of object (about 5000, and
> planning to connect 2-5 times bigger object count next year). This plans
> leads to the firs question:
>
> 1) How postgres would "like" such a big count of partitions? Will it
> decrease performance in future?

5000 partitions is going to incur significant planner overhead for
each query; check the docs and archives (search term "large number of
partitions" or similar).

> By the moment we noticed several performance
> issues with the big count of object id in such conditional statement: SELECT
> * FROM coordinates_object.coordinates WHERE object_id in (HERE ARE SEVERAL
> OBJECTS);
[snip]
> "              Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))"
> "Total runtime: 695.963 ms"
>
> 2) As you see from EXPLAIN, postgres is using array for filtering Filter:
> (object_id = ANY ('{722,728,727,248}'::bigint[]))", why?

AFAIR, that's a normal transformation of a IN (...) query to = ANY
(...), and I bet you'll see the same thing without partitioning at
play.

For a variety of reasons (performance, ease of maintenance), I think
partitioning is most useful for improving query times for large
data-sets when you can separate out your "hot" and "cold" data. Or, if
you can't do that, then next-best is to try to partition into sets
that will be queried together often (e.g. if you often query a cluster
of nearby object IDs together, partition by object ID, or same thing
for your other types), so that your queries hit a small number of
partitions on average.

Josh