Обсуждение: Partitioning with a lot of number of partitions
Hello, I'm novice in using postgres, so please be patient =) By the moment we are using PostgreSQL 9.1.0. Our database is about 270GB and currently we have a problem with our "hot" table which is about 120GB of data and 20GB index. By "hot" I mean, that we have a lot of insert/update/select operations on this table (about 30/30/120 per second). This hot table is called coordinates. In our logic we have several definitions: Coordinate - basic packet of data; Object - basic unit, which sends coordinates; Client - client which own object. One client may has a lot of object. and each object may has a lot of coordinates. All coordinates are stored in coordinates table. So by the moment we have an issue with slow insert/select queries... this why we decided to partition this table. We have tested 2 types of partitioning: By date (weekly), by object; This are our results: SELECT QUERY (1 object): ----------------------------------------------------- Partition type | M | W | D | ----------------------------------------------------- Non-partitioned | 5830 | 460 | 2913 | Date(weekly) | 1000 | 440 | 106 | Object | 0.02 | 0.03 | 0.009 | ------------------------------------------------------ M - query for month period W - query for week period D - query for day period All times in ms. So according to our results, we have much greater performance using partitioning by Object, BUT we have a lot of object (about 5000, and planning to connect 2-5 times bigger object count next year). This plans leads to the firs question: 1) How postgres would "like" such a big count of partitions? Will it decrease performance in future? By the moment we noticed several performance issues with the big count of object id in such conditional statement: SELECT * FROM coordinates_object.coordinates WHERE object_id in (HERE ARE SEVERAL OBJECTS); EXPLAIN ANALYZE: EXPLAIN ANALYZE SELECT * FROM coordinates_object.coordinates WHERE datetime between '2011-07-01' and '2011-07-02' AND object_id in (722::bigint, 728::bigint, 727::bigint, 248::bigint); Returns: "Result (cost=0.00..1887.23 rows=1455 width=655) (actual time=49.266..695.483 rows=908 loops=1)" " -> Append (cost=0.00..1887.23 rows=1455 width=655) (actual time=49.264..694.995 rows=908 loops=1)" " -> Seq Scan on coordinates (cost=0.00..0.00 rows=1 width=314) (actual time=0.001..0.001 rows=0 loops=1)" " Filter: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone) AND (object_id = ANY ('{722,728,727,248}'::bigint[])))" " -> Bitmap Heap Scan on coordinates_722 coordinates (cost=10.90..451.49 rows=258 width=654) (actual time=49.261..195.291 rows=166 loops=1)" " Recheck Cond: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone))" " Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))" " -> Bitmap Index Scan on coordinates_722_datetime_index (cost=0.00..10.83 rows=258 width=0) (actual time=49.237..49.237 rows=166 loops=1)" " Index Cond: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone))" " -> Bitmap Heap Scan on coordinates_728 coordinates (cost=13.00..568.65 rows=463 width=656) (actual time=108.833..203.139 rows=331 loops=1)" " Recheck Cond: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone))" " Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))" " -> Bitmap Index Scan on coordinates_728_datetime_index (cost=0.00..12.88 rows=463 width=0) (actual time=61.914..61.914 rows=332 loops=1)" " Index Cond: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone))" " -> Bitmap Heap Scan on coordinates_727 coordinates (cost=19.56..824.53 rows=713 width=651) (actual time=147.295..251.406 rows=370 loops=1)" " Recheck Cond: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone))" " Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))" " -> Bitmap Index Scan on coordinates_727_datetime_index (cost=0.00..19.38 rows=713 width=0) (actual time=125.261..125.261 rows=370 loops=1)" " Index Cond: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone))" " -> Index Scan using coordinates_248_datetime_index on coordinates_248 coordinates (cost=0.00..42.56 rows=20 width=777) (actual time=0.017..44.991 rows=41 loops=1)" " Index Cond: ((datetime >= '2011-07-01 00:00:00'::timestamp without time zone) AND (datetime <= '2011-07-02 00:00:00'::timestamp without time zone))" " Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))" "Total runtime: 695.963 ms" 2) As you see from EXPLAIN, postgres is using array for filtering Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))", why? 3) Last question would be about some "unpredictable" behavior on such query: EXPLAIN ANALYZE SELECT * FROM coordinates_object.coordinates WHERE datetime between '2011-07-01' and '2011-07-02' AND object_id in (SELECT id FROM object WHERE client_id=318); Using such query, postgres is trying to Bitmap Heap Scan on every index of all partitions (Remember we have 5000 partitions). Such a query makes our server think a lot =). So if it possible please judge our decision and offer some other solutions...
On Wed, Oct 12, 2011 at 10:09 AM, Aleksej Trofimov <aleksej.trofimov@ruptela.lt> wrote: > By the moment we are using PostgreSQL 9.1.0. Our database is about 270GB and > currently we have a problem with our "hot" table which is about 120GB of > data and 20GB index. That sounds quite manageable with decent hardware. > By "hot" I mean, that we have a lot of > insert/update/select operations on this table (about 30/30/120 per second). > This hot table is called coordinates. In our logic we have several > definitions: Coordinate - basic packet of data; Object - basic unit, which > sends coordinates; Client - client which own object. One client may has a > lot of object. and each object may has a lot of coordinates. All > coordinates are stored in coordinates table. > So by the moment we have an issue with slow insert/select queries... this > why we decided to partition this table. We have tested 2 types of > partitioning: By date (weekly), by object; > This are our results: > > SELECT QUERY (1 object): > > ----------------------------------------------------- > Partition type | M | W | D | > ----------------------------------------------------- > Non-partitioned | 5830 | 460 | 2913 | > Date(weekly) | 1000 | 440 | 106 | > Object | 0.02 | 0.03 | 0.009 | > ------------------------------------------------------ > > M - query for month period > W - query for week period > D - query for day period > All times in ms. > > So according to our results, we have much greater performance using > partitioning by Object, Well hold on, your results only show the timings for "SELECT QUERY (1 object):" -- does this mean you're just selecting one row? If so, that doesn't seem like a particularly useful performance benchmark. What are the _real_ bottleneck queries for your application when you're not using partitioning? Also, 5830 ms seems like a lot for what amounts to an indexed (primary key?) lookup of a single row -- how long does it take for you to SELECT [some object ID] FROM original_unpartitioned_big_table; (try the query a few times, with both same and different object IDs to get a feel for cache effects). BTW, If you do decide to partition based on objects, you don't have to make a partition for each object ID, if there are 5000 of those. You can partition based on ranged of object IDs. > BUT we have a lot of object (about 5000, and > planning to connect 2-5 times bigger object count next year). This plans > leads to the firs question: > > 1) How postgres would "like" such a big count of partitions? Will it > decrease performance in future? 5000 partitions is going to incur significant planner overhead for each query; check the docs and archives (search term "large number of partitions" or similar). > By the moment we noticed several performance > issues with the big count of object id in such conditional statement: SELECT > * FROM coordinates_object.coordinates WHERE object_id in (HERE ARE SEVERAL > OBJECTS); [snip] > " Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))" > "Total runtime: 695.963 ms" > > 2) As you see from EXPLAIN, postgres is using array for filtering Filter: > (object_id = ANY ('{722,728,727,248}'::bigint[]))", why? AFAIR, that's a normal transformation of a IN (...) query to = ANY (...), and I bet you'll see the same thing without partitioning at play. For a variety of reasons (performance, ease of maintenance), I think partitioning is most useful for improving query times for large data-sets when you can separate out your "hot" and "cold" data. Or, if you can't do that, then next-best is to try to partition into sets that will be queried together often (e.g. if you often query a cluster of nearby object IDs together, partition by object ID, or same thing for your other types), so that your queries hit a small number of partitions on average. Josh