Обсуждение: PGSQL, checkpoints, and file system syncs

Поиск
Список
Период
Сортировка

PGSQL, checkpoints, and file system syncs

От
Reza Taheri
Дата:

Hello PGSQL performance community,

You might remember that I pinged you in July 2012 to introduce the TPC-V benchmark. I am now back with more data, and a question about checkpoints. As far as the plans for the benchmark, we are hoping to release a benchmarking kit for multi-VM servers this year (and of course one can always simply configure it to run on one database)

 

I am now dealing with a situation of performance dips when checkpoints complete. To simplify the discussion, I have reproduced the problem on a single VM/single database.

 

Complete config info is in the attached files. Briefly, it is a 6-vCPU VM with 91G of memory, and 70GB in PGSQL shared buffers. The host has 512GB of memory and 4 sockets of Westmere (E7-4870) processors with HT enabled.

 

The data tablespace is on an ext4 file system on a (virtual) disk which is striped on 16 SSD drives in RAID 0. This is obviously overkill for the load we are putting on this one VM, but in the usual benchmarking config, the 16 SSDs are shared by 24 VMs. Log is on an ext3 file system on 4 spinning drives in RAID 1.

 

We are running PGSQL version 9.2 on RHEL 6.4; here are some parameters of interest (postgresql.conf is in the attachment):

checkpoint_segments = 1200

checkpoint_timeout = 360s

checkpoint_completion_target = 0.8

wal_sync_method = open_datasync

wal_buffers = 16MB

wal_writer_delay = 10ms

effective_io_concurrency = 10

effective_cache_size = 1024MB

 

When running tests, I noticed that when a checkpoint completes, we have a big burst of writes to the data disk. The log disk has a very steady write rate that is not affected by checkpoints except for the known phenomenon of more bytes in each log write when a new checkpoint period starts. In a multi-VM config with all VMs sharing the same data disks, when these write bursts happen, all VMs take a hit.

 

So I set out to see what causes this write burst.  After playing around with PGSQL parameters and observing its behavior, it appears that the bursts aren’t produced by the database engine; they are produced by the file system. I suspect PGSQL has to issue a sync(2)/fsync(2)/sync_file_range(2) system call at the completion of the checkpoint to ensure that all blocks are flushed to disk before creating a checkpoint marker. To test this, I ran a loop to call sync(8) once a second.

 

The graphs in file “run280.mht” have the throughput, data disk activity, and checkpoint start/completion timestamps for the baseline case. You can see that the checkpoint completion, the write burst, and the throughput dip all occur at the same time, so much so that it is hard to see the checkpoint completion line under the graph of writes. It looks like the file system does a mini flush every 30 seconds. The file “run274.mht” is the case with sync commands running in the background. You can see that everything is more smooth.

 

Is there something I can set in the PGSQL parameters or in the file system parameters to force a steady flow of writes to disk rather than waiting for a sync system call? Mounting with “commit=1” did not make a difference.

 

Thanks,

Reza

Вложения

Re: PGSQL, checkpoints, and file system syncs

От
Heikki Linnakangas
Дата:
On 04/03/2014 08:39 PM, Reza Taheri wrote:
> Hello PGSQL performance community,
> You might remember that I pinged you in July 2012 to introduce the TPC-V benchmark. I am now back with more data, and
aquestion about checkpoints. As far as the plans for the benchmark, we are hoping to release a benchmarking kit for
multi-VMservers this year (and of course one can always simply configure it to run on one database) 
>
> I am now dealing with a situation of performance dips when checkpoints complete. To simplify the discussion, I have
reproducedthe problem on a single VM/single database. 
>
> Complete config info is in the attached files. Briefly, it is a 6-vCPU VM with 91G of memory, and 70GB in PGSQL
sharedbuffers. The host has 512GB of memory and 4 sockets of Westmere (E7-4870) processors with HT enabled. 
>
> The data tablespace is on an ext4 file system on a (virtual) disk which is striped on 16 SSD drives in RAID 0. This
isobviously overkill for the load we are putting on this one VM, but in the usual benchmarking config, the 16 SSDs are
sharedby 24 VMs. Log is on an ext3 file system on 4 spinning drives in RAID 1. 
>
> We are running PGSQL version 9.2 on RHEL 6.4; here are some parameters of interest (postgresql.conf is in the
attachment):
> checkpoint_segments = 1200
> checkpoint_timeout = 360s
> checkpoint_completion_target = 0.8
> wal_sync_method = open_datasync
> wal_buffers = 16MB
> wal_writer_delay = 10ms
> effective_io_concurrency = 10
> effective_cache_size = 1024MB
>
> When running tests, I noticed that when a checkpoint completes, we have a big burst of writes to the data disk. The
logdisk has a very steady write rate that is not affected by checkpoints except for the known phenomenon of more bytes
ineach log write when a new checkpoint period starts. In a multi-VM config with all VMs sharing the same data disks,
whenthese write bursts happen, all VMs take a hit. 
>
> So I set out to see what causes this write burst.  After playing around with PGSQL parameters and observing its
behavior,it appears that the bursts aren't produced by the database engine; they are produced by the file system. I
suspectPGSQL has to issue a sync(2)/fsync(2)/sync_file_range(2) system call at the completion of the checkpoint to
ensurethat all blocks are flushed to disk before creating a checkpoint marker. To test this, I ran a loop to call
sync(8)once a second. 
>
> The graphs in file "run280.mht" have the throughput, data disk activity, and checkpoint start/completion timestamps
forthe baseline case. You can see that the checkpoint completion, the write burst, and the throughput dip all occur at
thesame time, so much so that it is hard to see the checkpoint completion line under the graph of writes. It looks like
thefile system does a mini flush every 30 seconds. The file "run274.mht" is the case with sync commands running in the
background.You can see that everything is more smooth. 
>
> Is there something I can set in the PGSQL parameters or in the file system parameters to force a steady flow of
writesto disk rather than waiting for a sync system call? Mounting with "commit=1" did not make a difference. 

Try setting the vm.dirty_bytes sysctl. Something like 256MB might be a
good starting point.

This comes up fairly often, see e.g.:
http://www.postgresql.org/message-id/flat/27C32FD4-0142-44FE-8488-9F366DC75966@mr-paradox.net

- Heikki


Re: PGSQL, checkpoints, and file system syncs

От
Reza Taheri
Дата:
> Try setting the vm.dirty_bytes sysctl. Something like 256MB might be a good
> starting point.
>
> This comes up fairly often, see e.g.:
> http://www.postgresql.org/message-id/flat/27C32FD4-0142-44FE-8488-
> 9F366DC75966@mr-paradox.net
>
> - Heikki

Thanks, Heikki. That sounds like my problem alright. I will play with these parameters right away, and will report
back.

Cheers,
Reza


Re: PGSQL, checkpoints, and file system syncs

От
Ilya Kosmodemiansky
Дата:
Hi Reza,

vm.dirty_bytes indeed makes sense, but just in case: how exactly is
your ext4 mount? Particularly, have you disabled barrier?

Ilya

On Thu, Apr 3, 2014 at 8:11 PM, Reza Taheri <rtaheri@vmware.com> wrote:
>> Try setting the vm.dirty_bytes sysctl. Something like 256MB might be a good
>> starting point.



--
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com


Re: PGSQL, checkpoints, and file system syncs

От
Reza Taheri
Дата:
Hi Ilya,
I mount the esx4 file system with: nofail,noatime,nodiratime,nobarrier,commit=1
The esxt3 file system for the log is mounted with: nofail,noatime,nodiratime,data=writeback

Attached is a tar file with charts from last night's runs. File 0_dirty_background_bytes.htm has the default with
dirty_background_bytes=0and dirty_background_ratio=10 (it's a warm-up run, so it starts with low throughput and a high
readrate). I still see a little bit of burstiness with dirty_background_bytes=50MB, but no burstiness with
dirty_background_ratiofrom 10MB down to even 1MB. 

Thanks everyone for the advice,
Reza

> -----Original Message-----
> From: Ilya Kosmodemiansky [mailto:ilya.kosmodemiansky@postgresql-
> consulting.com]
> Sent: Thursday, April 03, 2014 11:12 PM
> To: Reza Taheri
> Cc: Heikki Linnakangas; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] PGSQL, checkpoints, and file system syncs
>
> Hi Reza,
>
> vm.dirty_bytes indeed makes sense, but just in case: how exactly is your
> ext4 mount? Particularly, have you disabled barrier?
>
> Ilya
>
> On Thu, Apr 3, 2014 at 8:11 PM, Reza Taheri <rtaheri@vmware.com> wrote:
> >> Try setting the vm.dirty_bytes sysctl. Something like 256MB might be
> >> a good starting point.
>
>
>
> --
> Ilya Kosmodemiansky,
>
> PostgreSQL-Consulting.com
> tel. +14084142500
> cell. +4915144336040
> ik@postgresql-consulting.com

Вложения

Re: PGSQL, checkpoints, and file system syncs

От
Bruce Momjian
Дата:
On Thu, Apr 3, 2014 at 09:01:08PM +0300, Heikki Linnakangas wrote:
> >Is there something I can set in the PGSQL parameters or in the file
> >system parameters to force a steady flow of writes to disk rather
> >than waiting for a sync system call? Mounting with "commit=1" did not
> >make a difference.
>
> Try setting the vm.dirty_bytes sysctl. Something like 256MB might be a
> good starting point.
>
> This comes up fairly often, see e.g.:
> http://www.postgresql.org/message-id/flat/27C32FD4-0142-44FE-8488-9F36
> 6DC75966@mr-paradox.net

Uh, should he set vm.dirty_bytes or vm.dirty_background_bytes?  It is my
understanding that vm.dirty_background_bytes starts the I/O while still
accepting writes, while vm.dirty_bytes stops accepting writes during the
I/O, which isn't optimal.  See:

    https://www.kernel.org/doc/Documentation/sysctl/vm.txt

    dirty_bytes

    Contains the amount of dirty memory at which a process generating disk
    writes will itself start writeback.

    dirty_background_bytes

    Contains the amount of dirty memory at which the background kernel
    flusher threads will start writeback.

I think we want the flusher to be active, not necessarly the writing
process.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +