Обсуждение: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Поиск
Список
Период
Сортировка

BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
valgog@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      7494
Logged by:          Valentine Gogichashvili
Email address:      valgog@gmail.com
PostgreSQL version: 9.0.7
Operating system:   Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
Description:        =


We are experiencing strange(?) behavior on the replication slave machines.
The master machine has a very heavy update load, where many processes are
updating lots of data. It generates up to 30GB of WAL files per hour.
Normally it is not a problem for the slave machines to replay this amount of
WAL files on time and keep on with the master. But at some moments, the
slaves are =E2=80=9Changing=E2=80=9D with 100% CPU usage on the WAL replay =
process and 3%
IOWait, needing up to 30 seconds to process one WAL file. If this tipping
point is reached, then a huge WAL replication lag is building up quite fast,
that also leads to overfill of the XLOG directory on the slave machines, as
the WAL receiver is putting the WAL files it gets via streaming replication
the XLOG directory (that, in many cases are quite a limited size separate
disk partition).

What we noticed also, is that reducing shared_buffers parameter from our
normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL
replay dramatically. After restart of the slave machine with much lower
shared_buffers values, the replay becomes up to 10-20 times faster.

On the attached graph, there is a typical graph of WAL replication delay for
one of the slaves.

In that graph small (up to 6GB) replication delay peaks during the night are
caused by some long running transactions, stopping WAL replay on this slave,
to prevent replication collisions. But the last, big peaks are sometimes
start because of that waiting for a long running transaction on the slave,
but then they are growing as described above.

I know, that there is only one process that replays data, generated by many
threads on master machine. But why does the replay performance depend so
much on the shared_buffers parameter and can it be optimized?

With best regards,

Valentine Gogichashvili

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Bruce Momjian
Дата:
On Wed, Aug 15, 2012 at 10:10:42AM +0000, valgog@gmail.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      7494
> Logged by:          Valentine Gogichashvili
> Email address:      valgog@gmail.com
> PostgreSQL version: 9.0.7
> Operating system:   Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
> Description:
>
> We are experiencing strange(?) behavior on the replication slave machines.
> The master machine has a very heavy update load, where many processes are
> updating lots of data. It generates up to 30GB of WAL files per hour.
> Normally it is not a problem for the slave machines to replay this amount of
> WAL files on time and keep on with the master. But at some moments, the
> slaves are “hanging” with 100% CPU usage on the WAL replay process and 3%
> IOWait, needing up to 30 seconds to process one WAL file. If this tipping
> point is reached, then a huge WAL replication lag is building up quite fast,
> that also leads to overfill of the XLOG directory on the slave machines, as
> the WAL receiver is putting the WAL files it gets via streaming replication
> the XLOG directory (that, in many cases are quite a limited size separate
> disk partition).
>
> What we noticed also, is that reducing shared_buffers parameter from our
> normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL
> replay dramatically. After restart of the slave machine with much lower
> shared_buffers values, the replay becomes up to 10-20 times faster.
>
> On the attached graph, there is a typical graph of WAL replication delay for
> one of the slaves.
>
> In that graph small (up to 6GB) replication delay peaks during the night are
> caused by some long running transactions, stopping WAL replay on this slave,
> to prevent replication collisions. But the last, big peaks are sometimes
> start because of that waiting for a long running transaction on the slave,
> but then they are growing as described above.
>
> I know, that there is only one process that replays data, generated by many
> threads on master machine. But why does the replay performance depend so
> much on the shared_buffers parameter and can it be optimized?

We warn against making shared buffers > 8GB, and this is perhaps another
good reason.  The problem is probably due to the shared buffers filling
up with lots of dirty data, and the kernel being unable to contain all
the data coming during a checkpoint.  It is also possible that the
buffer management overhead is just too high for that many buffers.

It is also possible that Postgres 9.1 or the coming 9.2 would do
better in this regard.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
John R Pierce
Дата:
On 08/15/12 6:36 AM, Bruce Momjian wrote:
> We warn against making shared buffers > 8GB, and this is perhaps another
> good reason.

I generally keep it at no more than 2gb as I've never found any
performance improvements going higher, on systems with 48gb ram, and had
more than a few performance degradations with larger shared buffers.



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Andres Freund
Дата:
Hi,

On Wednesday, August 15, 2012 12:10:42 PM valgog@gmail.com wrote:
> The following bug has been logged on the website:
>=20
> Bug reference:      7494
> Logged by:          Valentine Gogichashvili
> Email address:      valgog@gmail.com
> PostgreSQL version: 9.0.7
> Operating system:   Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
> Description:
>=20
> We are experiencing strange(?) behavior on the replication slave machines.
> The master machine has a very heavy update load, where many processes are
> updating lots of data. It generates up to 30GB of WAL files per hour.
> Normally it is not a problem for the slave machines to replay this amount
> of WAL files on time and keep on with the master. But at some moments, the
> slaves are =E2=80=9Changing=E2=80=9D with 100% CPU usage on the WAL repla=
y process and 3%
> IOWait, needing up to 30 seconds to process one WAL file. If this tipping
> point is reached, then a huge WAL replication lag is building up quite
> fast, that also leads to overfill of the XLOG directory on the slave
> machines, as the WAL receiver is putting the WAL files it gets via
> streaming replication the XLOG directory (that, in many cases are quite a
> limited size separate disk partition).
Could you try to get a profile of that 100% cpu time?

Greetings,

Andres
=2D-=20
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Valentine Gogichashvili
Дата:
Hallo Andreas,


> Could you try to get a profile of that 100% cpu time?
>
>
Yes I will try to do it.

-- Valentine

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Valentine Gogichashvili
Дата:
Hallo John,

On Wed, Aug 15, 2012 at 3:40 PM, John R Pierce <pierce@hogranch.com> wrote:

> On 08/15/12 6:36 AM, Bruce Momjian wrote:
>
>> We warn against making shared buffers > 8GB, and this is perhaps another
>> good reason.
>>
>
> I generally keep it at no more than 2gb as I've never found any
> performance improvements going higher, on systems with 48gb ram, and had
> more than a few performance degradations with larger shared buffers.
>
>
we see up to 10x performance increase with bigger shared_buffers in case of
this database. Main database entities are about 20GB in size and we see
that performance drops considerably when running with smaller
shared_buffers smaller then that.

Regards,

-- Valentin

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
John R Pierce
Дата:
On 08/15/12 9:02 AM, Valentine Gogichashvili wrote:
>
> On Wed, Aug 15, 2012 at 3:40 PM, John R Pierce <pierce@hogranch.com
> <mailto:pierce@hogranch.com>> wrote:
>
>     On 08/15/12 6:36 AM, Bruce Momjian wrote:
>
>         We warn against making shared buffers > 8GB, and this is
>         perhaps another
>         good reason.
>
>
>     I generally keep it at no more than 2gb as I've never found any
>     performance improvements going higher, on systems with 48gb ram,
>     and had more than a few performance degradations with larger
>     shared buffers.
>
>
> we see up to 10x performance increase with bigger shared_buffers in
> case of this database. Main database entities are about 20GB in size
> and we see that performance drops considerably when running with
> smaller shared_buffers smaller then that.
>

do you adjust effective_cache_size accordingly?   with the smaller
shared_buffers, we typically find at least half or more of physical
memory is available as OS level disk cache, as shown by the 'cached'
output of 'free' or whatever after the system has been running long
enough to fully populate its disk cache.  this parameter has a
significant performance impact on the planner's estimation of the best
way of executing given queries.  also, especially if you're executing
queries that process a lot of rows and have to do sorts and such,
increasing work_mem is quite helpful.



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Valentine Gogichashvili
Дата:
Hello Andreas,

here is the process, that now actually is not using CPU at all and the
shared_buffers are set to 2GB:

50978 postgres  20   0 2288m 2.0g 2.0g S  0.0  1.6   4225:34 postgres:
startup process   recovering 000000050000262E000000FD

It is hanging on that file for several minutes now.

and here is the strace:

$ strace -c -f -p 50978
Process 50978 attached - interrupt to quit
 Process 50978 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.82    0.007999          37       215           select
  2.73    0.000230           1       215           getppid
  2.45    0.000207           1       215       215 stat
------ ----------- ----------- --------- --------- ----------------
100.00    0.008436                   645       215 total

What kind of additional profiling information would you like to see?

Regards,

-- Valentin


On Wed, Aug 15, 2012 at 4:09 PM, Andres Freund <andres@2ndquadrant.com>wrot=
e:

> Hi,
>
> On Wednesday, August 15, 2012 12:10:42 PM valgog@gmail.com wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference:      7494
> > Logged by:          Valentine Gogichashvili
> > Email address:      valgog@gmail.com
> > PostgreSQL version: 9.0.7
> > Operating system:   Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
> > Description:
> >
> > We are experiencing strange(?) behavior on the replication slave
> machines.
> > The master machine has a very heavy update load, where many processes a=
re
> > updating lots of data. It generates up to 30GB of WAL files per hour.
> > Normally it is not a problem for the slave machines to replay this amou=
nt
> > of WAL files on time and keep on with the master. But at some moments,
> the
> > slaves are =E2=80=9Changing=E2=80=9D with 100% CPU usage on the WAL rep=
lay process and 3%
> > IOWait, needing up to 30 seconds to process one WAL file. If this tippi=
ng
> > point is reached, then a huge WAL replication lag is building up quite
> > fast, that also leads to overfill of the XLOG directory on the slave
> > machines, as the WAL receiver is putting the WAL files it gets via
> > streaming replication the XLOG directory (that, in many cases are quite=
 a
> > limited size separate disk partition).
> Could you try to get a profile of that 100% cpu time?
>
> Greetings,
>
> Andres
> --
> Andres Freund           http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Valentine Gogichashvili
Дата:
Hello again,

now I have restarted the slave with shared_buffers set to 32GB, and now the
recovery process is at 100% CPU:

$ strace -c -f -p 27076
Process 27076 attached - interrupt to quit
 Process 27076 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 67.65    0.002127           0     18127           lseek
 28.98    0.000911           0     16496           getppid
  2.58    0.000081           0       758           read
  0.80    0.000025           4         6         1 open
  0.00    0.000000           0         4           close
  0.00    0.000000           0         1           stat
  0.00    0.000000           0       493       493 unlink
------ ----------- ----------- --------- --------- ----------------
100.00    0.003144                 35885       494 total


$ strace -c -f -p 27076
Process 27076 attached - interrupt to quit
Process 27076 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 44.44    0.000240           0      5976           lseek
 38.15    0.000206           0      5522           getppid
 17.41    0.000094           1       172           read
  0.00    0.000000           0         4         1 open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         1           stat
  0.00    0.000000           0       159       159 unlink
------ ----------- ----------- --------- --------- ----------------
100.00    0.000540                 11836       160 total

Then restarted again with shared_buffers set back to 2GB:

now the replay process is relatively fast again and here is the strace:

$ strace -c -f -p 36450
Process 36450 attached - interrupt to quit
Process 39336 attached
Process 39337 attached
Process 39337 detached
Process 39338 attached
Process 39338 detached
Process 39339 attached
Process 39339 detached
Process 39336 detached
Process 39359 attached
Process 39360 attached
Process 39360 detached
Process 39361 attached
Process 39361 detached
Process 39362 attached
Process 39362 detached
Process 39359 detached
Process 40168 attached
Process 40169 attached
Process 40169 detached
Process 40170 attached
Process 40170 detached
Process 40171 attached
Process 40171 detached
Process 40168 detached
Process 36450 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 88.21    0.668159       31817        21         9 wait4
  4.69    0.035553           6      5970           read
  1.92    0.014572           0    120565           lseek
  1.91    0.014474           5      2957      2954 unlink
  1.55    0.011766           8      1545           write
  1.27    0.009582           0    109274           getppid
  0.31    0.002340           3       762           fcntl
  0.03    0.000245          20        12           clone
  0.02    0.000163           1       219           rt_sigprocmask
  0.02    0.000162           1       154        10 open
  0.01    0.000092           0       216           mmap
  0.01    0.000084           1       133        42 stat
  0.01    0.000066           6        12           getrlimit
  0.01    0.000046           0       178           close
  0.00    0.000037           1        27           munmap
  0.00    0.000035           6         6           pipe
  0.00    0.000035          12         3           set_tid_address
  0.00    0.000030           1        42           getegid
  0.00    0.000021           0       183           rt_sigaction
  0.00    0.000000           0       108           fstat
  0.00    0.000000           0       102           mprotect
  0.00    0.000000           0        36           brk
  0.00    0.000000           0         9           rt_sigreturn
  0.00    0.000000           0         9         9 ioctl
  0.00    0.000000           0        48        12 access
  0.00    0.000000           0         6           dup2
  0.00    0.000000           0         6           getpid
  0.00    0.000000           0        12           execve
  0.00    0.000000           0         9           uname
  0.00    0.000000           0        42           getuid
  0.00    0.000000           0        42           getgid
  0.00    0.000000           0        45           geteuid
  0.00    0.000000           0         6           getpgrp
  0.00    0.000000           0         6           statfs
  0.00    0.000000           0        12           arch_prctl
  0.00    0.000000           0         6         3 futex
  0.00    0.000000           0         3           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.757462                242786      3039 total

With best regards,

-- Valentine Gogichashvili


On Thu, Aug 16, 2012 at 4:53 PM, Valentine Gogichashvili
<valgog@gmail.com>wrote:

> Hello Andreas,
>
> here is the process, that now actually is not using CPU at all and the
> shared_buffers are set to 2GB:
>
> 50978 postgres  20   0 2288m 2.0g 2.0g S  0.0  1.6   4225:34 postgres:
> startup process   recovering 000000050000262E000000FD
>
> It is hanging on that file for several minutes now.
>
> and here is the strace:
>
> $ strace -c -f -p 50978
> Process 50978 attached - interrupt to quit
>  Process 50978 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  94.82    0.007999          37       215           select
>   2.73    0.000230           1       215           getppid
>   2.45    0.000207           1       215       215 stat
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.008436                   645       215 total
>
> What kind of additional profiling information would you like to see?
>
> Regards,
>
> -- Valentin
>
>
>
>

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Valentine Gogichashvili
Дата:
Hello John,


>> we see up to 10x performance increase with bigger shared_buffers in case
>> of this database. Main database entities are about 20GB in size and we see
>> that performance drops considerably when running with smaller
>> shared_buffers smaller then that.
>>
>>
> do you adjust effective_cache_size accordingly?   with the smaller
> shared_buffers, we typically find at least half or more of physical memory
> is available as OS level disk cache, as shown by the 'cached' output of
> 'free' or whatever after the system has been running long enough to fully
> populate its disk cache.  this parameter has a significant performance
> impact on the planner's estimation of the best way of executing given
> queries.  also, especially if you're executing queries that process a lot
> of rows and have to do sorts and such, increasing work_mem is quite helpful.
>
>
Yes, the effective_cache_size is set to the the 50% of the RAM = 64GB, but
as I mentioned already, we are measuring considerable performance increase
when increasing shared_buffers to the values, when it fits most important
tables completely.

Regards,

-- Valentine

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Andres Freund
Дата:
Hi Valentine,

On Thursday, August 16, 2012 05:05:11 PM Valentine Gogichashvili wrote:
> Hello again,
>
> now I have restarted the slave with shared_buffers set to 32GB, and now the
> recovery process is at 100% CPU:

> $ strace -c -f -p 27076
> Process 27076 attached - interrupt to quit
> Process 27076 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  44.44    0.000240           0      5976           lseek
>  38.15    0.000206           0      5522           getppid
>  17.41    0.000094           1       172           read
>   0.00    0.000000           0         4         1 open
>   0.00    0.000000           0         2           close
>   0.00    0.000000           0         1           stat
>   0.00    0.000000           0       159       159 unlink
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.000540                 11836       160 total

> $ strace -c -f -p 36450
> Process 36450 attached - interrupt to quit
> Process 39336 attached
> Process 39337 attached
> Process 39337 detached
> Process 39338 attached
> Process 39338 detached
> Process 39339 attached
> Process 39339 detached
> Process 39336 detached
> Process 39359 attached
> Process 39360 attached
> Process 39360 detached
> Process 39361 attached
> Process 39361 detached
> Process 39362 attached
> Process 39362 detached
> Process 39359 detached
> Process 40168 attached
> Process 40169 attached
> Process 40169 detached
> Process 40170 attached
> Process 40170 detached
> Process 40171 attached
> Process 40171 detached
> Process 40168 detached
> Process 36450 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  88.21    0.668159       31817        21         9 wait4
>   4.69    0.035553           6      5970           read
>   1.92    0.014572           0    120565           lseek
>   1.91    0.014474           5      2957      2954 unlink
>   1.55    0.011766           8      1545           write
>   1.27    0.009582           0    109274           getppid
>   0.31    0.002340           3       762           fcntl
>   0.03    0.000245          20        12           clone
>   0.02    0.000163           1       219           rt_sigprocmask
>   0.02    0.000162           1       154        10 open
>   0.01    0.000092           0       216           mmap
>   0.01    0.000084           1       133        42 stat
>   0.01    0.000066           6        12           getrlimit
>   0.01    0.000046           0       178           close
>
Those straces don't indicate contention in the kernel to me.

Could you get a profile with 'perf' or if youre more familiar with it with
'oprofile'. With the former its just something like:
perf record -F 10000 -p 27076
perf report > somefile

And then send in somefile.

Greetings,

Andres

Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

От
Andres Freund
Дата:
On Friday, August 17, 2012 12:51:44 PM Valentine Gogichashvili wrote:
> Hello Andreas,
>
> here are the results of perf profiling:
> https://gist.github.com/3b8cb0c15661da439632
> Also attached the files to that mail.
>
> For a problematic case of big shared_buffers:
>
> # Events: 320K cycles
> #
> # Overhead   Command      Shared Object                              Symbol
> # ........  ........  .................  ..................................
> #
>     98.70%  postgres  postgres           [.] DropRelFileNodeBuffers
>      0.18%  postgres  postgres           [.] RecordIsValid
>      0.11%  postgres  [kernel.kallsyms]  [k] native_write_msr_safe
>      0.07%  postgres  [kernel.kallsyms]  [k] dyntick_save_progress_counter
>      0.06%  postgres  [kernel.kallsyms]  [k] scheduler_tick
>      0.03%  postgres  [kernel.kallsyms]  [k] _spin_lock
>      0.03%  postgres  [kernel.kallsyms]  [k] __do_softirq
>      0.03%  postgres  [kernel.kallsyms]  [k] rcu_process_callbacks
>      0.03%  postgres  postgres           [.] hash_search_with_hash_value
>      0.03%  postgres  [kernel.kallsyms]  [k] native_read_msr_safe
>      0.02%  postgres  libc-2.12.so       [.] memcpy
>      0.02%  postgres  [kernel.kallsyms]  [k] rcu_process_gp_end
>      0.02%  postgres  [kernel.kallsyms]  [k] apic_timer_interrupt
>      0.02%  postgres  [kernel.kallsyms]  [k] run_timer_softirq
>      0.02%  postgres  [kernel.kallsyms]  [k] system_call
Ok, that explains it. Youre frequently dropping/truncating tables? That
currently requires walking through shared buffers and loosing all buffered
pages related to that table. That obviously scales linearly with shared
buffers and is particularly expensive on multi socket machines.

Unless youre running 9.3 that will even lock each single page which is a
relatively expensive and slow operation. Depending on how adventurous you are
you could try backporting e8d029a30b5a5fb74b848a8697b1dfa3f66d9697 and see how
big the benefits are for you.

Greetings,

Andres