Обсуждение: Raising the checkpoint_timeout limit
Hi, is there any reason for the rather arbitrary and low checkpoint_timeout limit? Obviously it's not not appropriate to have a 10h timeout on all that many systems. But if the reaction to a server crashing is going to be a failover to one of several standbys anyway, setting the timeout to a high value, can make sense. A high timeout has the advantage that the total amount of full page writes reduces and, especially if the whole system fits into s_b, that the total amount of writes to disk is drastically reduced. I'm not sure what'd actually be a good upper limit. I'd be inclined to even go to as high as a week or so. A lot of our settings have upper/lower limits that aren't a good idea in general. I'm also wondering if it'd not make sense to raise the default timeout to 15min or so. The upper ceiling for that really is recovery time, and that has really shrunk rather drastically due to faster cpus and architectural improvements in postgres (bgwriter, separate checkpointer/bgwriter, restartpoints, ...). Greetings, Andres Freund
On 2/1/16 6:13 PM, Andres Freund wrote: > I'm not sure what'd actually be a good upper limit. I'd be inclined to > even go to as high as a week or so. A lot of our settings have > upper/lower limits that aren't a good idea in general. The only reason I can see for the 1 hour limit is to try and prevent footguns. I think that's a valid goal, but there should be a way to over-ride it. And if we don't want that kind of protection then I'd say just yank the upper limit. > I'm also wondering if it'd not make sense to raise the default timeout > to 15min or so. The upper ceiling for that really is recovery time, and > that has really shrunk rather drastically due to faster cpus and > architectural improvements in postgres (bgwriter, separate > checkpointer/bgwriter, restartpoints, ...). It would be interesting if someone had a large-ish 9.4 or 9.5 install that they could test recovery timing on. My suspicion is that as long as FPWs are on that you'd generally end up limited by how fast you could read WAL unless you exceeded the FS cache. (I'm assuming a BBU and that the FS and controller will do a nice job of ordering writes optimally so that you'll get performance similar to reads when it's time to fsync.) -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 01-02-2016 21:13, Andres Freund wrote: > is there any reason for the rather arbitrary and low checkpoint_timeout > limit? > AFAICS the only reason is to run recover quickly. This setting is the same value since day 1. > A high timeout has the advantage that the total amount of full page > writes reduces and, especially if the whole system fits into s_b, that > the total amount of writes to disk is drastically reduced. > This statement could be added to documentation. Using this use case, I want to propose raising the c_t upper limit to one day or even a week. > I'm not sure what'd actually be a good upper limit. I'd be inclined to > even go to as high as a week or so. A lot of our settings have > upper/lower limits that aren't a good idea in general. > A week is an insane default value. However, I'm fine with 10 until 20 minutes (those are the most common values I use for c_t). -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
On Tue, Feb 02, 2016 at 01:13:20AM +0100, Andres Freund wrote: > is there any reason for the rather arbitrary and low checkpoint_timeout > limit? Not that I know, and it is inconvenient. > I'm not sure what'd actually be a good upper limit. I'd be inclined to > even go to as high as a week or so. A lot of our settings have > upper/lower limits that aren't a good idea in general. In general, I favor having limits reflect fundamental system limitations rather than paternalism. Therefore, I would allow INT_MAX (68 years). > I'm also wondering if it'd not make sense to raise the default timeout > to 15min or so. The upper ceiling for that really is recovery time, and > that has really shrunk rather drastically due to faster cpus and > architectural improvements in postgres (bgwriter, separate > checkpointer/bgwriter, restartpoints, ...). Have those recovery improvements outpaced the increases in max recovery time from higher core counts generating more WAL per minute? nm
On Tue, Feb 2, 2016 at 1:16 PM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Feb 02, 2016 at 01:13:20AM +0100, Andres Freund wrote: >> is there any reason for the rather arbitrary and low checkpoint_timeout >> limit? > > Not that I know, and it is inconvenient. > >> I'm not sure what'd actually be a good upper limit. I'd be inclined to >> even go to as high as a week or so. A lot of our settings have >> upper/lower limits that aren't a good idea in general. > > In general, I favor having limits reflect fundamental system limitations > rather than paternalism. Therefore, I would allow INT_MAX (68 years). +1. This way users can play as they wish. >> I'm also wondering if it'd not make sense to raise the default timeout >> to 15min or so. The upper ceiling for that really is recovery time, and >> that has really shrunk rather drastically due to faster cpus and >> architectural improvements in postgres (bgwriter, separate >> checkpointer/bgwriter, restartpoints, ...). > > Have those recovery improvements outpaced the increases in max recovery time > from higher core counts generating more WAL per minute? Perhaps having some numbers showing that the architecture improvements of Postgres really matter at constant checkpoint_segments for pgbench load would help to get into a better default value. I would tend to agree that as things speed up it would make sense to increase this value a bit though, even if Postgres is usually conservative enough in default parameters aimed at low-spec machines. -- Michael
On 2 February 2016 at 05:54, Michael Paquier <michael.paquier@gmail.com> wrote:
--
On Tue, Feb 2, 2016 at 1:16 PM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Feb 02, 2016 at 01:13:20AM +0100, Andres Freund wrote:
>> is there any reason for the rather arbitrary and low checkpoint_timeout
>> limit?
>
> Not that I know, and it is inconvenient.
>
>> I'm not sure what'd actually be a good upper limit. I'd be inclined to
>> even go to as high as a week or so. A lot of our settings have
>> upper/lower limits that aren't a good idea in general.
>
> In general, I favor having limits reflect fundamental system limitations
> rather than paternalism. Therefore, I would allow INT_MAX (68 years).
+1. This way users can play as they wish.
If people wish to turn off crash recovery, they can already set fsync=off. I don't wish to see us support a setting that causes problems for people that don't understand what checkpoints are and why everybody needs them.
The current code needs to act differently with regard to very low settings, so when we are a small number of blocks remaining we don't spend hours performing them. Allowing very large values would make that even more strange.
I would put a limit of 100,000 seconds = 27 hours.
Some systems offer a recovery_time_objective setting that is used to control how frequently checkpoints occur. That might be a more usable approach.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-02-01 23:16:16 -0500, Noah Misch wrote: > On Tue, Feb 02, 2016 at 01:13:20AM +0100, Andres Freund wrote: > > I'm not sure what'd actually be a good upper limit. I'd be inclined to > > even go to as high as a week or so. A lot of our settings have > > upper/lower limits that aren't a good idea in general. > > In general, I favor having limits reflect fundamental system limitations > rather than paternalism. Therefore, I would allow INT_MAX (68 years). I generally incree with that attitude - I'm disinclined to go just that high though. Going close to INT_MAX means having to care about overflow in trivial computations, in a scenario we're unlikely to ever test. Sure, we can use a debugger to adjust time or accellerate time progress, but that's all unrealistic if we're honest. So maybe go with a year? > > I'm also wondering if it'd not make sense to raise the default timeout > > to 15min or so. The upper ceiling for that really is recovery time, and > > that has really shrunk rather drastically due to faster cpus and > > architectural improvements in postgres (bgwriter, separate > > checkpointer/bgwriter, restartpoints, ...). > > Have those recovery improvements outpaced the increases in max recovery time > from higher core counts generating more WAL per minute? Mostly yes, imo. I think the biggest problem with max recovery time is in workloads that are considerably bigger than shared buffers: There the single threaded, synchronously reading, startup process (without the benefit of FPWs filling up pages), has to compete with a lot of processes having higher IO throughput, because of multiple processes, at the same time. But even that has considerably improved due to SSDs. Andres
On 2016-02-02 11:37:15 +0100, Simon Riggs wrote: > If people wish to turn off crash recovery, they can already set fsync=off. > I don't wish to see us support a setting that causes problems for people > that don't understand what checkpoints are and why everybody needs them. I don't think fsync=off and very long checkpoints are really comparable. Many large modern machines, especially with directly attached storage and/or large amounts of memory, take a *long* while to boot. So any outage will be dealth with a failover anyway. But at the same time, a database in the 10TB+ range can't easily be copied again. Thus running with fsync=off isn't something that you'd want in those scenarios - it'd prevent the previous master/other standbys from failing back/catching up; the databases could be arbitrarily corrupted after all. Additionally a significant portion of the cost of checkpoints are full page writes - you easily can get into the situation where you have ~20MB/sec normal WAL without FPWs, but with them 300MB/s. That rate is rather expensive, regardless fsync=off. > The current code needs to act differently with regard to very low settings, > so when we are a small number of blocks remaining we don't spend hours > performing them. Allowing very large values would make that even more > strange. Why is that a good thing? Every checkpoint triggers a new round of full page writes. I don't see why you want to accellerate a checkpoint, just because there's few writes remaining? Yes, the current code partially behaves that way, but that's imo more an implementation artifact or even a bug. > Some systems offer a recovery_time_objective setting that is used to > control how frequently checkpoints occur. That might be a more usable > approach. While desirable, I have no idea to realistically calculate that :(. It's also a lot bigger than just adjusting a pointlessly low GUC limit. Regards, Andres
On Tue, Feb 02, 2016 at 12:24:50PM +0100, Andres Freund wrote: > On 2016-02-01 23:16:16 -0500, Noah Misch wrote: > > On Tue, Feb 02, 2016 at 01:13:20AM +0100, Andres Freund wrote: > > > I'm not sure what'd actually be a good upper limit. I'd be inclined to > > > even go to as high as a week or so. A lot of our settings have > > > upper/lower limits that aren't a good idea in general. > > > > In general, I favor having limits reflect fundamental system limitations > > rather than paternalism. Therefore, I would allow INT_MAX (68 years). > > I generally incree with that attitude - I'm disinclined to go just that > high though. Going close to INT_MAX means having to care about overflow > in trivial computations, in a scenario we're unlikely to ever > test. Sure, we can use a debugger to adjust time or accellerate time > progress, but that's all unrealistic if we're honest. So maybe go with > a year? Okay.
On Tue, Feb 2, 2016 at 8:09 PM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Feb 02, 2016 at 12:24:50PM +0100, Andres Freund wrote: >> On 2016-02-01 23:16:16 -0500, Noah Misch wrote: >> > On Tue, Feb 02, 2016 at 01:13:20AM +0100, Andres Freund wrote: >> > > I'm not sure what'd actually be a good upper limit. I'd be inclined to >> > > even go to as high as a week or so. A lot of our settings have >> > > upper/lower limits that aren't a good idea in general. >> > >> > In general, I favor having limits reflect fundamental system limitations >> > rather than paternalism. Therefore, I would allow INT_MAX (68 years). >> >> I generally incree with that attitude - I'm disinclined to go just that >> high though. Going close to INT_MAX means having to care about overflow >> in trivial computations, in a scenario we're unlikely to ever >> test. Sure, we can use a debugger to adjust time or accellerate time >> progress, but that's all unrealistic if we're honest. So maybe go with >> a year? > > Okay. Sounds good to me, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Noah Misch <noah@leadboat.com> writes: > On Tue, Feb 02, 2016 at 12:24:50PM +0100, Andres Freund wrote: >> On 2016-02-01 23:16:16 -0500, Noah Misch wrote: >>> In general, I favor having limits reflect fundamental system limitations >>> rather than paternalism. Therefore, I would allow INT_MAX (68 years). >> I generally incree with that attitude - I'm disinclined to go just that >> high though. Going close to INT_MAX means having to care about overflow >> in trivial computations, in a scenario we're unlikely to ever >> test. Sure, we can use a debugger to adjust time or accellerate time >> progress, but that's all unrealistic if we're honest. So maybe go with >> a year? > Okay. I've gotta go with the "paternalism" side of the argument here. Suppose you configure your system to checkpoint once a year --- what is going to happen when the year is up? Or when you try to shut it down? You *will* regret such a setting. I don't think we should allow the checkpoint distances to be so large that checkpoints don't happen in the normal course of events. I'd be okay with the max being a day, perhaps. regards, tom lane
On Tue, Feb 2, 2016 at 10:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've gotta go with the "paternalism" side of the argument here. Suppose > you configure your system to checkpoint once a year --- what is going to > happen when the year is up? Or when you try to shut it down? You *will* > regret such a setting. > > I don't think we should allow the checkpoint distances to be so large that > checkpoints don't happen in the normal course of events. I'd be okay with > the max being a day, perhaps. If smart people[1] want to set checkpoint_timeout to a value higher than 1 day[2], then I think we should let them. I think what will happen if you set checkpoint_timeout to 1 year is that you will checkpoint solely based on WAL volume, which does not seem like a manifestly unreasonable thing to want. It's true that if you set BOTH max_wal_size AND checkpoint_timeout to $WAYTOOBIG then something bad might happen to you, but even such configurations are actually not totally crazy: for example, you could ingest data into a temporary PostgreSQL instance and then do logical replication from there to another cluster for permanent storage. You don't really need recovery or shutdown to happen in the lifetime of the cluster, so no harm, no foul. Now, you could also set such configuration settings in a situation where it will not work out well. But that is true of most configuration settings. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] like Andres [2] see original post
On 2/2/16 11:10 PM, Robert Haas wrote: > On Tue, Feb 2, 2016 at 10:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I've gotta go with the "paternalism" side of the argument here. Suppose >> you configure your system to checkpoint once a year --- what is going to >> happen when the year is up? Or when you try to shut it down? You *will* >> regret such a setting. >> >> I don't think we should allow the checkpoint distances to be so large that >> checkpoints don't happen in the normal course of events. I'd be okay with >> the max being a day, perhaps. > > If smart people[1] want to set checkpoint_timeout to a value higher > than 1 day[2], then I think we should let them. Agreed - I have a specific instance where I would prefer the daily backups or checkpoint segments to be the primary source of checkpoints with checkpoint_timeout set to 36 hours and used only as a fallback. A limit of 1 day would be lower than I like though still better than what I have now, 1 hour. For this case I would probably configure: checkpoint_segments = 256 checkpoint_timeout = 36h So the daily backups would generally trigger the checkpoint unless there was unusually high volume in which case it would be checkpoint segments.Finally, if the backups weren't working and volumewas normal then checkpoint_timeout would come along to save the day. In this case its all about reducing full-page writes on a medium-sized system with heavy churn in order to make a long backup/archive retention more cost-effective. Recovery time is not as much of an issue for this application. -- -David david@pgmasters.net
On 2/2/16 10:10 PM, Robert Haas wrote: > Now, you could also set such configuration settings in > a situation where it will not work out well. But that is true of most > configuration settings. Yeah, if we're going to start playing parent then I think the first thing to do is remove the fsync GUC. The AWS team has done testing that shows it to be worthless from a performance standpoint now that we have synchronous commit, and it's an extremely large foot-bazooka to have laying around. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2016-02-03 15:07:12 -0600, Jim Nasby wrote: > On 2/2/16 10:10 PM, Robert Haas wrote: > >Now, you could also set such configuration settings in > >a situation where it will not work out well. But that is true of most > >configuration settings. By that argument we should probably raise the lower limit of a bunch of parameters :P > Yeah, if we're going to start playing parent then I think the first thing to > do is remove the fsync GUC. The AWS team has done testing that shows it to > be worthless from a performance standpoint now that we have synchronous > commit, and it's an extremely large foot-bazooka to have laying around. Meh, I don't buy that. There are workloads where that's the case, but also ones were it's not true. Try e.g. 2PC. And yes, there's definitely cases where 2PC makes sense, even if you don't need durability on a local basis. Andres
On Mon, Feb 1, 2016 at 8:16 PM, Noah Misch <noah@leadboat.com> wrote: >> I'm not sure what'd actually be a good upper limit. I'd be inclined to >> even go to as high as a week or so. A lot of our settings have >> upper/lower limits that aren't a good idea in general. > > In general, I favor having limits reflect fundamental system limitations > rather than paternalism. Therefore, I would allow INT_MAX (68 years). I agree. I'm in favor of having things be what is sometimes called foolproof, but I think that you can only take that so far, and it's mostly a matter of guiding a more or less reasonable user in the right direction. Making it easy to do the right thing and hard to do the wrong thing. I don't think you can effectively design anything around a user that makes perversely bad decision at every turn. If you investigate why a user made a bad decision, there will usually be a chain of faulty but not outrageous reasoning behind it. -- Peter Geoghegan