Обсуждение: Postgresql 14 partitioning advice

Поиск
Список
Период
Сортировка

Postgresql 14 partitioning advice

От
Rick Otten
Дата:
I'm spinning up a new Postgresql 14 database where I'll have to store a couple years worth of time series data at the rate of single-digit millions of rows per day.  Since this has to run in AWS Aurora, I can't use TimescaleDB.

I've been soliciting advice for best practices for building this.

One person I talked to said "try not to have more than 100 partitions", even with the latest postgresql you'll end up with a lot of lock contention if you go over 100 partitions.  This person also recommended manually kicking off vacuums on a regular schedule rather than trusting autovacuum to work reliably on the partitioned tables.

I've got several keys, besides the obvious time-key that I could partition on.   I could do a multi-key partitioning scheme.  Since the data is inbound at a relatively steady rate, if I partition on time, I can adjust the partitions to be reasonably similarly sized.  What is a good partition size?

Are there any tunables I should experiment with in particular for a database with only 2 or 3 tables in it but many partitions each with millions of rows?

Since the data most frequently queried would be recent data (say the past month or so) would it make sense to build an archiving strategy that rolled up older partitions into larger ones?  ie, do daily partitions for the first four weeks, then come up with a process that rolled them up into monthly partitions for the next few months, then maybe quarterly partitions for the data older than a year?  (I'm thinking about ways to keep the partition count low - if that advice is justified.)

Or, should I just have a single 7 Trillion row table with a BRIN index on the timestamp and not mess with partitions at all?

Re: Postgresql 14 partitioning advice

От
Justin Pryzby
Дата:
On Wed, Jul 27, 2022 at 08:55:14AM -0400, Rick Otten wrote:
> I'm spinning up a new Postgresql 14 database where I'll have to store a
> couple years worth of time series data at the rate of single-digit millions
> of rows per day.  Since this has to run in AWS Aurora, I can't use
> TimescaleDB.

> One person I talked to said "try not to have more than 100 partitions",
> even with the latest postgresql you'll end up with a lot of lock contention
> if you go over 100 partitions.

I'm not familiar with this (but now I'm curious).  We have over 2000 partitions
in some tables.  No locking issue that I'm aware of.  One issue that I *have*
seen is if you have many partitions, you can end up with query plans with a
very large number of planner nodes, and it's hard to set
work_mem*hash_mem_multiplier to account for that.

> This person also recommended manually
> kicking off vacuums on a regular schedule rather than trusting autovacuum
> to work reliably on the partitioned tables.

They must mean *analyze*, which does not run automatically on the partitioned
tables (only the partitions).  The partitioned table is empty, so doesn't need
to be vacuumed.

> I've got several keys, besides the obvious time-key that I could partition
> on.   I could do a multi-key partitioning scheme.  Since the data is
> inbound at a relatively steady rate, if I partition on time, I can adjust
> the partitions to be reasonably similarly sized.  What is a good partition
> size?

Depends on 1) the target number of partitions; and 2) the target size for
indexes on those partitions.  More partition keys will lead to smaller indexes.
Depending on the type of index, and the index keys, to get good INSERT
performance, you may need to set shared_buffers to accommodate the sum of size
of all the indexes (but maybe not, if the leading column is timestamp).

> Since the data most frequently queried would be recent data (say the past
> month or so) would it make sense to build an archiving strategy that rolled
> up older partitions into larger ones?  ie, do daily partitions for the
> first four weeks, then come up with a process that rolled them up into
> monthly partitions for the next few months, then maybe quarterly partitions
> for the data older than a year?  (I'm thinking about ways to keep the
> partition count low - if that advice is justified.)

I think it can make sense.  I do that myself in order to: 1) avoid having a
huge *total* number of tables (which causes pg_attribute to be large, since our
tables are also "wide"); and 2) make our backups of "recent data" smaller; and
3) make autoanalyze a bit more efficient (a monthly partition will be analyzed
numerous times the 2nd half of the month, even though all the historic data
hasn't changed at all).

> Or, should I just have a single 7 Trillion row table with a BRIN index on
> the timestamp and not mess with partitions at all?

Are you going to need to DELETE data ?  Then this isn't great, and DELETEing
data will innevitably cause a lower correlation, making BRIN less effective.

-- 
Justin



Re: Postgresql 14 partitioning advice

От
Jeff Janes
Дата:
On Wed, Jul 27, 2022 at 8:55 AM Rick Otten <rottenwindfish@gmail.com> wrote:

One person I talked to said "try not to have more than 100 partitions", even with the latest postgresql you'll end up with a lot of lock contention if you go over 100 partitions.


It is hard to know how seriously to take the advice of anonymous people accompanied with such sparse justification.  Meanwhile, people who actually wrote the code seem to think that this problem has been mostly overcome with declarative partitioning in the newer versions.

When you do decide to start removing the oldest data, how will you do it?  Your partitioning should probably be designed to align with this.

Since the data most frequently queried would be recent data (say the past month or so)

Is this done specifically with a time clause, or just by criteria which happen to align with time, but have no formal relationship with it?

Cheers,

Jeff

Re: Postgresql 14 partitioning advice

От
Rick Otten
Дата:
On Wed, Jul 27, 2022 at 8:55 AM Rick Otten <rottenwindfish@gmail.com> wrote:
I'm spinning up a new Postgresql 14 database where I'll have to store a couple years worth of time series data at the rate of single-digit millions of rows per day.  Since this has to run in AWS Aurora, I can't use TimescaleDB.

I thought I'd report back some of my findings from testing this week:

I took the same real world two week data set and created identical tables except that I partitioned one by month, one by week, one by day, and one by hour.  I partitioned a little bit into the past and a little bit into the future. I did this on a PG 14.2 RDS instance.  This gave me tables with:
3 partitions, 13 partitions, 90 partitions and 2136 partitions, but otherwise the same data.

Insert times were equivalent.

Then I crafted a query that was one of the main use cases for the data and ran it a bunch of times.

I noticed a significant degradation in performance as the number of partitions increased.  The jump from 13 to 90, in particular, was very steep.  It didn't matter what I set work_mem or other tunables to.  I dug deeper...

Surprising to me was if you partition on a `timestamp with timezone` column, call it "ts":
If your where clause looks like
```
where ts at time zone 'UTC' > '2022-07-01 00:00'::timestamp
```
you will NOT get partition pruning and it will sequence scan.
However if you change it to (with an appropriately adjusted right hand side if necessary):
```
where ts > '2022-07-01 00:00'::timestamp
```
It will do partition pruning and will index scan.

When I made that change the query performance was equivalent regardless of which number of partitions I had in play.
I did a quick test and this happens on a regular timestamp index on a regular table as well.

The other problem I ran into, which I'm still building a test case for and I fear might be a bug if I can easily reproduce it,
is if I did the original select in a CTE, and then did a sort outside of the CTE, even though the CTE found 0 rows, the database
still spent a _ton_ of time sorting those 0 rows:
```
   ->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual time=84848.452..84848.453 rows=0 loops=1)
```
Once I can reproduce this on test data I'll be able to pin down more closely what is happening and tell if I'm just reading
the explain plan wrong or if something is broken.  It was getting mixed up with the lack of pruning/index usage problem.

I'll report back again next week.  Anyway it is looking to me like it doesn't really matter (within reason) from a performance
perspective how many partitions we use for our data set and query patterns.  We should be able to pick the most convenient
from an archiving and data management perspective instead.


Re: Postgresql 14 partitioning advice

От
Nathan Ward
Дата:


On 30/07/2022, at 9:44 AM, Rick Otten <rottenwindfish@gmail.com> wrote:

On Wed, Jul 27, 2022 at 8:55 AM Rick Otten <rottenwindfish@gmail.com> wrote:
I'm spinning up a new Postgresql 14 database where I'll have to store a couple years worth of time series data at the rate of single-digit millions of rows per day.  Since this has to run in AWS Aurora, I can't use TimescaleDB.

I thought I'd report back some of my findings from testing this week:

I took the same real world two week data set and created identical tables except that I partitioned one by month, one by week, one by day, and one by hour.  I partitioned a little bit into the past and a little bit into the future. I did this on a PG 14.2 RDS instance.  This gave me tables with:
3 partitions, 13 partitions, 90 partitions and 2136 partitions, but otherwise the same data.

Insert times were equivalent.

Then I crafted a query that was one of the main use cases for the data and ran it a bunch of times.

I noticed a significant degradation in performance as the number of partitions increased.  The jump from 13 to 90, in particular, was very steep.  It didn't matter what I set work_mem or other tunables to.  I dug deeper...

Surprising to me was if you partition on a `timestamp with timezone` column, call it "ts":
If your where clause looks like
```
where ts at time zone 'UTC' > '2022-07-01 00:00'::timestamp
```
you will NOT get partition pruning and it will sequence scan.
However if you change it to (with an appropriately adjusted right hand side if necessary):
```
where ts > '2022-07-01 00:00'::timestamp
```
It will do partition pruning and will index scan.

When I made that change the query performance was equivalent regardless of which number of partitions I had in play.
I did a quick test and this happens on a regular timestamp index on a regular table as well.

The other problem I ran into, which I'm still building a test case for and I fear might be a bug if I can easily reproduce it,
is if I did the original select in a CTE, and then did a sort outside of the CTE, even though the CTE found 0 rows, the database
still spent a _ton_ of time sorting those 0 rows:
```
   ->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual time=84848.452..84848.453 rows=0 loops=1)
```
Once I can reproduce this on test data I'll be able to pin down more closely what is happening and tell if I'm just reading
the explain plan wrong or if something is broken.  It was getting mixed up with the lack of pruning/index usage problem.

I'll report back again next week.  Anyway it is looking to me like it doesn't really matter (within reason) from a performance
perspective how many partitions we use for our data set and query patterns.  We should be able to pick the most convenient
from an archiving and data management perspective instead.

Hi Rick,

I am working with data with a similar structure. The most recent data is accessed significantly more often than older data, so my next step will be to have very recent data in hourly tables, then daily, and probably monthly tables for what is effectively archived data. My data is a little different in that it’s stored by the start of an “interval” which means I can do a = comparison for the start of the hour (or day once data is aggregated).

I found a similar interesting partitioning performance issue recently when partitioning by timestamp, where if your where clause for a timestamp includes math the planner runs very slowly. In my case saying something like:
```
select * from table where ts = some_time_variable - interval ‘1 hour’;
```
is *much* slower than something like:
```
offset_time_variable = some_time_variable - '1 hour’ interval;
select * from table where ts = offset_time_variable;
```

Everything is `timestamp with time zone`.
I believe that it’s calculating that offset for each partition - of which there are a couple hundred - and it was causing the planner to run very slowly. Pruning works correctly once the planner has run.

This is on postgres 13 - I have yet to try 14 and see if this issue persists in 14.

Changing my main query to the above structure significantly improved performance - I was previously having lots of performance issues when aggregation tasks ran and dropped partitions etc.


--
Nathan Ward


Re: Postgresql 14 partitioning advice

От
Rick Otten
Дата:

The other problem I ran into, which I'm still building a test case for and I fear might be a bug if I can easily reproduce it,
is if I did the original select in a CTE, and then did a sort outside of the CTE, even though the CTE found 0 rows, the database
still spent a _ton_ of time sorting those 0 rows:
```
   ->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual time=84848.452..84848.453 rows=0 loops=1)
```
Once I can reproduce this on test data I'll be able to pin down more closely what is happening and tell if I'm just reading
the explain plan wrong or if something is broken.  It was getting mixed up with the lack of pruning/index usage problem.

I'll report back again next week.  Anyway it is looking to me like it doesn't really matter (within reason) from a performance
perspective how many partitions we use for our data set and query patterns.  We should be able to pick the most convenient
from an archiving and data management perspective instead.


This behavior is definitely consistent.  0 rows end up slower than when I find some rows in my CTE:
```
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=87110.841..87110.842 rows=0 loops=1)
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=25367.867..25367.930 rows=840 loops=1)
```
The only thing I changed in the query was the date range.  It is actually the CTE scan step inside the Sort block that is slower when no rows are returned than when rows are returned.  It also only happens when all the partitions are sequence scanned instead of being partition pruned.

I'm still writing up a test case that can demo this without using proprietary data.

 

Re: Postgresql 14 partitioning advice

От
Rick Otten
Дата:


On Mon, Aug 1, 2022 at 10:16 AM Rick Otten <rottenwindfish@gmail.com> wrote:

The other problem I ran into, which I'm still building a test case for and I fear might be a bug if I can easily reproduce it,
is if I did the original select in a CTE, and then did a sort outside of the CTE, even though the CTE found 0 rows, the database
still spent a _ton_ of time sorting those 0 rows:
```
   ->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual time=84848.452..84848.453 rows=0 loops=1)
```
Once I can reproduce this on test data I'll be able to pin down more closely what is happening and tell if I'm just reading
the explain plan wrong or if something is broken.  It was getting mixed up with the lack of pruning/index usage problem.

I'll report back again next week.  Anyway it is looking to me like it doesn't really matter (within reason) from a performance
perspective how many partitions we use for our data set and query patterns.  We should be able to pick the most convenient
from an archiving and data management perspective instead.


This behavior is definitely consistent.  0 rows end up slower than when I find some rows in my CTE:
```
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=87110.841..87110.842 rows=0 loops=1)
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=25367.867..25367.930 rows=840 loops=1)
```
The only thing I changed in the query was the date range.  It is actually the CTE scan step inside the Sort block that is slower when no rows are returned than when rows are returned.  It also only happens when all the partitions are sequence scanned instead of being partition pruned.

I'm still writing up a test case that can demo this without using proprietary data.

After a bunch of experiments I can explain this now.  :-)

I had a `limit` clause in my test CTE.  When sequence scanning a bunch of partitions, if the limit is reached, the subsequent partitions are marked with `never executed` and not scanned.  On the other hand, when no rows are found, all of the partitions are scanned.

Therefore, with many millions of rows in the partitions, and being forced to sequence scan because I put the `at time zone` clause in the `where`, the case when rows are found is always noticeably faster than the case when rows aren't found as long as at least one partition hasn't been scanned yet when the limit is hit.

I'm now satisfied this is a good thing, and will move on to other problems.  Thanks for hearing me out.  I was scratching my head for a while over that one.

 

Re: Postgresql 14 partitioning advice

От
Slava Mudry
Дата:
> I'm spinning up a new Postgresql 14 database where I'll have to store a couple years worth of time series data at the rate of single-digit millions of rows per day.
I think you absolutely need to use partitioning for the following reasons:
1. maintenance and roll-off of older data
2. indexes are much smaller
3. performance is predictable (if partition pruning kicks in)

Postgres 14 improved partitioning quite a bit. I used it in Postgres 9 and there was a lot of locking on partition hierarchy when you add/drop partition tables.
Having thousands of partitions shouldn't be a problem, BUT you will incur cost on query planning, which is usually under 0.1 second on modern hardware.


On Tue, Aug 2, 2022 at 5:55 AM Rick Otten <rottenwindfish@gmail.com> wrote:


On Mon, Aug 1, 2022 at 10:16 AM Rick Otten <rottenwindfish@gmail.com> wrote:

The other problem I ran into, which I'm still building a test case for and I fear might be a bug if I can easily reproduce it,
is if I did the original select in a CTE, and then did a sort outside of the CTE, even though the CTE found 0 rows, the database
still spent a _ton_ of time sorting those 0 rows:
```
   ->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual time=84848.452..84848.453 rows=0 loops=1)
```
Once I can reproduce this on test data I'll be able to pin down more closely what is happening and tell if I'm just reading
the explain plan wrong or if something is broken.  It was getting mixed up with the lack of pruning/index usage problem.

I'll report back again next week.  Anyway it is looking to me like it doesn't really matter (within reason) from a performance
perspective how many partitions we use for our data set and query patterns.  We should be able to pick the most convenient
from an archiving and data management perspective instead.


This behavior is definitely consistent.  0 rows end up slower than when I find some rows in my CTE:
```
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=87110.841..87110.842 rows=0 loops=1)
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=25367.867..25367.930 rows=840 loops=1)
```
The only thing I changed in the query was the date range.  It is actually the CTE scan step inside the Sort block that is slower when no rows are returned than when rows are returned.  It also only happens when all the partitions are sequence scanned instead of being partition pruned.

I'm still writing up a test case that can demo this without using proprietary data.

After a bunch of experiments I can explain this now.  :-)

I had a `limit` clause in my test CTE.  When sequence scanning a bunch of partitions, if the limit is reached, the subsequent partitions are marked with `never executed` and not scanned.  On the other hand, when no rows are found, all of the partitions are scanned.

Therefore, with many millions of rows in the partitions, and being forced to sequence scan because I put the `at time zone` clause in the `where`, the case when rows are found is always noticeably faster than the case when rows aren't found as long as at least one partition hasn't been scanned yet when the limit is hit.

I'm now satisfied this is a good thing, and will move on to other problems.  Thanks for hearing me out.  I was scratching my head for a while over that one.

 


--
-slava

Re: Postgresql 14 partitioning advice

От
Justin Pryzby
Дата:
On Mon, Aug 08, 2022 at 03:45:11PM -0700, Slava Mudry wrote:
> Postgres 14 improved partitioning quite a bit. I used it in Postgres 9 and
> there was a lot of locking on partition hierarchy when you add/drop
> partition tables.

Note that postgres 9 didn't have native/declarative partitioning, and most
improvements in native partitioning don't apply to legacy/inheritance
partitioning.

https://www.postgresql.org/docs/devel/ddl-partitioning.html

"Native" partitioning added in v10 tends to require stronger locks for add/drop
than legacy partitioning, since partitions have associated bounds, which cannot
overlap.  The locking is improved in v12 with CREATE+ATTACH and v14 with
DETACH CONCURRENTLY+DROP.

-- 
Justin