Redesigning checkpoint_segments

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Redesigning checkpoint_segments
Дата
Msg-id 51AF2C00.5020704@vmware.com
обсуждение исходный текст
Ответы Re: Redesigning checkpoint_segments  (Fujii Masao <masao.fujii@gmail.com>)
Список pgsql-hackers
checkpoint_segments is awkward. From an admin's point of view, it 
controls two things:

1. it limits the amount of disk space needed for pg_xlog. (it's a soft 
limit, but still)
2. it limits the time required to recover after a crash.

For limiting the disk space needed for pg_xlog, checkpoint_segments is 
awkward because it's defined in terms of 16MB segments between 
checkpoints. It takes a fair amount of arithmetic to calculate the disk 
space required to hold the specified number of segments. The manual 
gives the formula: (2 + checkpoint_completion_target) * 
checkpoint_segments + 1, which amounts to about 1GB per 20 segments as a 
rule of thumb. We shouldn't impose that calculation on the user. It 
should be possible to just specify "checkpoint_segments=512MB", and the 
system would initiate checkpoints so that the total size of WAL in 
pg_xlog stays below 512MB.

For limiting the time required to recover after crash, 
checkpoint_segments is awkward because it's difficult to calculate how 
long recovery will take, given checkpoint_segments=X. A bulk load can 
use up segments really fast, and recovery will be fast, while segments 
full of random deletions can need a lot of random I/O to replay, and 
take a long time. IMO checkpoint_timeout is a much better way to control 
that, although it's not perfect either.

A third point is that even if you have 10 GB of disk space reserved for 
WAL, you don't want to actually consume all that 10 GB, if it's not 
required to run the database smoothly. There are several reasons for 
that: backups based on a filesystem-level snapshot are larger than 
necessary, if there are a lot of preallocated WAL segments and in a 
virtualized or shared system, there might be other VMs or applications 
that could make use of the disk space. On the other hand, you don't want 
to run out of disk space while writing WAL - that can lead to a PANIC in 
the worst case.


In VMware's vPostgres fork, we've hacked the way that works, so that 
there is a new setting, checkpoint_segments_max that can be set by the 
user, but checkpoint_segments is adjusted automatically, on the fly. The 
system counts how many segments were consumed during the last checkpoint 
cycle, and that becomes the checkpoint_segments setting for the next 
cycle. That means that in a system with a steady load, checkpoints are 
triggered by checkpoint_timeout, and the effective checkpoint_segments 
value converges at the exact number of segments needed for that. That's 
simple but very effective. It doesn't behave too well with bursty load, 
however; during quiet times, checkpoint_segments is dialed way down, and 
when the next burst comes along, you get several checkpoints in quick 
succession, until checkpoint_segments is dialed back up again.


I propose that we do something similar, but not exactly the same. Let's 
have a setting, max_wal_size, to control the max. disk space reserved 
for WAL. Once that's reached (or you get close enough, so that there are 
still some segments left to consume while the checkpoint runs), a 
checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled 
separately from max_wal_size, so that you can set max_wal_size high, 
without actually consuming that much space in normal operation. It's 
just a backstop, to avoid completely filling the disk, if there's a 
sudden burst of activity. The number of segments preallocated is 
auto-tuned, based on the number of segments used in previous checkpoint 
cycles.

I'll write up a patch to do that, but before I do, does anyone disagree 
on those tuning principles? How do you typically tune 
checkpoint_segments on your servers? If the system was to tune it 
automatically, what formula should it use?

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Proposal for Allow postgresql.conf values to be changed via SQL [review]
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Time for beta2 ?