Обсуждение: Dealing with SeqScans when Time-based Partitions Cut Over

Поиск
Список
Период
Сортировка

Dealing with SeqScans when Time-based Partitions Cut Over

От
Matthew Planchard
Дата:
In a table with high insert frequency (~1.5k rows/s) and high query
frequency (~1k queries/s), partitioned by record creation time, we have
observed the following behavior:

* When the current time crosses a partition boundary, all new records
  are written to the new partition, which was previously empty, as
  expected

* Because the planner's latest knowledge of the partition was based on
  its state prior to the cutover, it assumes the partition is empty and
  creates plans that use sequential scans

* The table accumulates tens to hundreds of thousands of rows, and the
  sequentail scans start to use nearly 100% of available database CPU

* Eventually the planner updates thee stats and all is well, but the
  cycle repeats the next time the partitions cut over.

We have tried setting up a cron job that runs ANALYZE on the most recent
partition of the table every 15 seconds at the start of the hour, and
while this does help in reducing the magnitude and duration of the
problem, it is insufficient to fully resolve it (our engineers are still
getting daily pages for high DB CPU utilization).

We have considered maintaining a separate connection pool with
connections that have `enable_seqscan` set to `off`, and updating the
application to use that pool for these queries, but I was hoping the
community might have some better suggestions.

- Matthew Planchard



Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
Ron Johnson
Дата:
On Thu, Dec 18, 2025 at 1:48 PM Matthew Planchard <msplanchard@gmail.com> wrote:

In a table with high insert frequency (~1.5k rows/s) and high query
frequency (~1k queries/s), partitioned by record creation time, we have
observed the following behavior:

* When the current time crosses a partition boundary, all new records
  are written to the new partition, which was previously empty, as
  expected

* Because the planner's latest knowledge of the partition was based on
  its state prior to the cutover, it assumes the partition is empty and
  creates plans that use sequential scans

* The table accumulates tens to hundreds of thousands of rows, and the
  sequentail scans start to use nearly 100% of available database CPU

* Eventually the planner updates thee stats and all is well, but the
  cycle repeats the next time the partitions cut over.

We have tried setting up a cron job that runs ANALYZE on the most recent
partition of the table every 15 seconds at the start of the hour, and
while this does help in reducing the magnitude and duration of the
problem, it is insufficient to fully resolve it (our engineers are still
getting daily pages for high DB CPU utilization).

What's autovacuum_analyze_scale_factor set to?   The default 20% is pretty high.
autovacuum_naptime might need to be dropped, too.

And maybe have the shell script that the cron job runs sleep only 5 seconds in the ANALY loop.
 
We have considered maintaining a separate connection pool with
connections that have `enable_seqscan` set to `off`, and updating the
application to use that pool for these queries, but I was hoping the
community might have some better suggestions.

How about just force seqscan off when the table is created?
ALTER TABLE <table_partition> SET (enable_seqscan  = off);

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
Matthew Planchard
Дата:
Thanks very much for the reply and the suggestions, a lot of good
stuff to look at.

> What's autovacuum_analyze_scale_factor set to?   The default 20% is pretty high.
> autovacuum_naptime might need to be dropped, too.

This is just whatever the default is, so I will look into updating
these settings.

> And maybe have the shell script that the cron job runs sleep only 5 seconds in the ANALY loop.

Good suggestion. Our job runner will not run duplicate jobs if one is
still running, so I may even just have this run every second.

> How about just force seqscan off when the table is created?
> ALTER TABLE <table_partition> SET (enable_seqscan  = off);

I didn't know this could be set on the table level! Our partitions are
created by partman, but we have a nightly job that does maintenance
stuff that could go through and update this setting on future partitions
before they come into use. I'm always hesitant to mess
too much with the planner, but this might be the ticket.

On Thu, Dec 18, 2025 at 2:32 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 1:48 PM Matthew Planchard <msplanchard@gmail.com> wrote:
>>
>>
>> In a table with high insert frequency (~1.5k rows/s) and high query
>> frequency (~1k queries/s), partitioned by record creation time, we have
>> observed the following behavior:
>>
>> * When the current time crosses a partition boundary, all new records
>>   are written to the new partition, which was previously empty, as
>>   expected
>>
>> * Because the planner's latest knowledge of the partition was based on
>>   its state prior to the cutover, it assumes the partition is empty and
>>   creates plans that use sequential scans
>>
>> * The table accumulates tens to hundreds of thousands of rows, and the
>>   sequentail scans start to use nearly 100% of available database CPU
>>
>> * Eventually the planner updates thee stats and all is well, but the
>>   cycle repeats the next time the partitions cut over.
>>
>> We have tried setting up a cron job that runs ANALYZE on the most recent
>> partition of the table every 15 seconds at the start of the hour, and
>> while this does help in reducing the magnitude and duration of the
>> problem, it is insufficient to fully resolve it (our engineers are still
>> getting daily pages for high DB CPU utilization).
>
>
> What's autovacuum_analyze_scale_factor set to?   The default 20% is pretty high.
> autovacuum_naptime might need to be dropped, too.
>
> And maybe have the shell script that the cron job runs sleep only 5 seconds in the ANALY loop.
>
>>
>> We have considered maintaining a separate connection pool with
>> connections that have `enable_seqscan` set to `off`, and updating the
>> application to use that pool for these queries, but I was hoping the
>> community might have some better suggestions.
>
>
> How about just force seqscan off when the table is created?
> ALTER TABLE <table_partition> SET (enable_seqscan  = off);
>
> --
> Death to <Redacted>, and butter sauce.
> Don't boil me, I'm still alive.
> <Redacted> lobster!



Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
David Rowley
Дата:
On Fri, 19 Dec 2025 at 07:48, Matthew Planchard <msplanchard@gmail.com> wrote:
> * Because the planner's latest knowledge of the partition was based on
>   its state prior to the cutover, it assumes the partition is empty and
>   creates plans that use sequential scans

If the table/partition was just created and not analyzed, the planner
shouldn't consider it empty. By default, an empty table is assumed to
have 10 pages. (Check what pg_class.relpages is set to. -1 means it's
not been analyzed, 0 would mean analyzed/vacuumed when empty)

create table a (a int primary key, b int not null);
create index on a(b);
explain select * from a where a = 1;

 Index Scan using a_pkey on a  (cost=0.15..8.17 rows=1 width=8)
   Index Cond: (a = 1)

explain select * from a where b = 1;

 Bitmap Heap Scan on a  (cost=4.24..14.91 rows=11 width=8)
   Recheck Cond: (b = 1)
   ->  Bitmap Index Scan on a_b_idx  (cost=0.00..4.24 rows=11 width=0)
         Index Cond: (b = 1)

Perhaps pg_partman thinks running an ANALYZE or VACUUM on the new
partitions is a good idea? Otherwise, if you're not doing this
yourself, I don't know why you're getting a Seq Scan. It might help if
you showed the problem query and schema.

David



Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
David Rowley
Дата:
On Fri, 19 Dec 2025 at 08:55, Matthew Planchard <msplanchard@gmail.com> wrote:
> > How about just force seqscan off when the table is created?
> > ALTER TABLE <table_partition> SET (enable_seqscan  = off);
>
> I didn't know this could be set on the table level! Our partitions are
> created by partman, but we have a nightly job that does maintenance
> stuff that could go through and update this setting on future partitions
> before they come into use. I'm always hesitant to mess
> too much with the planner, but this might be the ticket.

You thought correctly. There's no such reloption.

David



Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
Ron Johnson
Дата:
On Thu, Dec 18, 2025 at 3:21 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 19 Dec 2025 at 08:55, Matthew Planchard <msplanchard@gmail.com> wrote:
> > How about just force seqscan off when the table is created?
> > ALTER TABLE <table_partition> SET (enable_seqscan  = off);
>
> I didn't know this could be set on the table level! Our partitions are
> created by partman, but we have a nightly job that does maintenance
> stuff that could go through and update this setting on future partitions
> before they come into use. I'm always hesitant to mess
> too much with the planner, but this might be the ticket.

You thought correctly. There's no such reloption.

Ah well.  I should have known that would be too easy.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
Laurenz Albe
Дата:
On Thu, 2025-12-18 at 13:48 -0500, Matthew Planchard wrote:
> In a table with high insert frequency (~1.5k rows/s) and high query
> frequency (~1k queries/s), partitioned by record creation time, we have
> observed the following behavior:
>
> * When the current time crosses a partition boundary, all new records
>   are written to the new partition, which was previously empty, as
>   expected
>
> * Because the planner's latest knowledge of the partition was based on
>   its state prior to the cutover, it assumes the partition is empty and
>   creates plans that use sequential scans
>
> * The table accumulates tens to hundreds of thousands of rows, and the
>   sequentail scans start to use nearly 100% of available database CPU
>
> * Eventually the planner updates thee stats and all is well, but the
>   cycle repeats the next time the partitions cut over.
>
> We have tried setting up a cron job that runs ANALYZE on the most recent
> partition of the table every 15 seconds at the start of the hour, and
> while this does help in reducing the magnitude and duration of the
> problem, it is insufficient to fully resolve it (our engineers are still
> getting daily pages for high DB CPU utilization).
>
> We have considered maintaining a separate connection pool with
> connections that have `enable_seqscan` set to `off`, and updating the
> application to use that pool for these queries, but I was hoping the
> community might have some better suggestions.

I would try to tune autovacuum to check more often:

  autovacuum_naptime = 5s  # perhaps even less

Then hopefully the new partitions get analyzed early enough.

Yours,
Laurenz Albe



Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
Greg Sabino Mullane
Дата:
If this is still an issue after you check David's theory about premature analyzing, another approach is to pre-populate and pre-analyze future tables. Something like this:

* disable autovac on the future table
* detach the table from the main partition
* insert a few hundred thousand rows into it, then run analyze on it
* can pull rows from a current table, or just use random data on a key column- whatever is enough to generate "good" stats
* delete the rows - the stats will remain
* reattach the table
* enable autovac if you like; I would not


Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
Laurenz Albe
Дата:
On Fri, 2025-12-19 at 09:49 -0500, Greg Sabino Mullane wrote:
> If this is still an issue after you check David's theory about premature analyzing,
> another approach is to pre-populate and pre-analyze future tables. Something like this:
>
> * disable autovac on the future table
> * detach the table from the main partition
> * insert a few hundred thousand rows into it, then run analyze on it
> * can pull rows from a current table, or just use random data on a key column- whatever is enough to generate "good"
stats
> * delete the rows - the stats will remain
> * reattach the table
> * enable autovac if you like; I would not

I doubt that that is good advice.  For one, wrong statistics are not necessarily
better than no statistics.  Disabling autovacuum is dangerous - and re-enabling
it would trigger another autovacuum, which would undo your efforts.
*Not* re-enabling autovacuum is not an option, unless you schedule explicit
VACUUM runs on the partition.

Yours,
Laurenz Albe



Re: Dealing with SeqScans when Time-based Partitions Cut Over

От
Greg Sabino Mullane
Дата:
On Fri, Dec 19, 2025 at 12:42 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
I doubt that that is good advice.  For one, wrong statistics are not necessarily better than no statistics.

But clearly in this case, "wrong" statistics are better than none. Frankly, any plan is better than the one they are getting (full seq scans). Also, with a time-based highly active partition, we can be very sure that the stats for one partition are going to be similar to another, so doing some "pre-estimation" seems fine with me. At the end of the day, this is a query planner issue, and the goal is to prevent that. The same data will be returned.
 
  Disabling autovacuum is dangerous - and re-enabling it would trigger another autovacuum, which would undo your efforts.

No, re-enabling it will put it back in the pool of tables that may get vacuumed/analyzed once thresholds are reached. It will not immediately trigger another autovac.
 
*Not* re-enabling autovacuum is not an option, unless you schedule explicit VACUUM runs on the partition.

*shrug* Well, for an hourly partition, if the stats you set at the top of the hour are the same as the stats when you analyze at the end of the hour, and give the same plan, analyzing doesn't really matter. If this were a normal table, I might advocate cron-based or insert-based vacuum runs for all the other benefits vacuum provides, but with time-partitioning, the app usually only cares about very recent partitions, and there is little-to-no updating going on.

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support