Обсуждение: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
The following bug has been logged on the website: Bug reference: 7494 Logged by: Valentine Gogichashvili Email address: valgog@gmail.com PostgreSQL version: 9.0.7 Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) Description: = We are experiencing strange(?) behavior on the replication slave machines. The master machine has a very heavy update load, where many processes are updating lots of data. It generates up to 30GB of WAL files per hour. Normally it is not a problem for the slave machines to replay this amount of WAL files on time and keep on with the master. But at some moments, the slaves are =E2=80=9Changing=E2=80=9D with 100% CPU usage on the WAL replay = process and 3% IOWait, needing up to 30 seconds to process one WAL file. If this tipping point is reached, then a huge WAL replication lag is building up quite fast, that also leads to overfill of the XLOG directory on the slave machines, as the WAL receiver is putting the WAL files it gets via streaming replication the XLOG directory (that, in many cases are quite a limited size separate disk partition). What we noticed also, is that reducing shared_buffers parameter from our normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL replay dramatically. After restart of the slave machine with much lower shared_buffers values, the replay becomes up to 10-20 times faster. On the attached graph, there is a typical graph of WAL replication delay for one of the slaves. In that graph small (up to 6GB) replication delay peaks during the night are caused by some long running transactions, stopping WAL replay on this slave, to prevent replication collisions. But the last, big peaks are sometimes start because of that waiting for a long running transaction on the slave, but then they are growing as described above. I know, that there is only one process that replays data, generated by many threads on master machine. But why does the replay performance depend so much on the shared_buffers parameter and can it be optimized? With best regards, Valentine Gogichashvili
On Wed, Aug 15, 2012 at 10:10:42AM +0000, valgog@gmail.com wrote: > The following bug has been logged on the website: > > Bug reference: 7494 > Logged by: Valentine Gogichashvili > Email address: valgog@gmail.com > PostgreSQL version: 9.0.7 > Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) > Description: > > We are experiencing strange(?) behavior on the replication slave machines. > The master machine has a very heavy update load, where many processes are > updating lots of data. It generates up to 30GB of WAL files per hour. > Normally it is not a problem for the slave machines to replay this amount of > WAL files on time and keep on with the master. But at some moments, the > slaves are âhangingâ with 100% CPU usage on the WAL replay process and 3% > IOWait, needing up to 30 seconds to process one WAL file. If this tipping > point is reached, then a huge WAL replication lag is building up quite fast, > that also leads to overfill of the XLOG directory on the slave machines, as > the WAL receiver is putting the WAL files it gets via streaming replication > the XLOG directory (that, in many cases are quite a limited size separate > disk partition). > > What we noticed also, is that reducing shared_buffers parameter from our > normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL > replay dramatically. After restart of the slave machine with much lower > shared_buffers values, the replay becomes up to 10-20 times faster. > > On the attached graph, there is a typical graph of WAL replication delay for > one of the slaves. > > In that graph small (up to 6GB) replication delay peaks during the night are > caused by some long running transactions, stopping WAL replay on this slave, > to prevent replication collisions. But the last, big peaks are sometimes > start because of that waiting for a long running transaction on the slave, > but then they are growing as described above. > > I know, that there is only one process that replays data, generated by many > threads on master machine. But why does the replay performance depend so > much on the shared_buffers parameter and can it be optimized? We warn against making shared buffers > 8GB, and this is perhaps another good reason. The problem is probably due to the shared buffers filling up with lots of dirty data, and the kernel being unable to contain all the data coming during a checkpoint. It is also possible that the buffer management overhead is just too high for that many buffers. It is also possible that Postgres 9.1 or the coming 9.2 would do better in this regard. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 08/15/12 6:36 AM, Bruce Momjian wrote: > We warn against making shared buffers > 8GB, and this is perhaps another > good reason. I generally keep it at no more than 2gb as I've never found any performance improvements going higher, on systems with 48gb ram, and had more than a few performance degradations with larger shared buffers. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
Hi, On Wednesday, August 15, 2012 12:10:42 PM valgog@gmail.com wrote: > The following bug has been logged on the website: >=20 > Bug reference: 7494 > Logged by: Valentine Gogichashvili > Email address: valgog@gmail.com > PostgreSQL version: 9.0.7 > Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) > Description: >=20 > We are experiencing strange(?) behavior on the replication slave machines. > The master machine has a very heavy update load, where many processes are > updating lots of data. It generates up to 30GB of WAL files per hour. > Normally it is not a problem for the slave machines to replay this amount > of WAL files on time and keep on with the master. But at some moments, the > slaves are =E2=80=9Changing=E2=80=9D with 100% CPU usage on the WAL repla= y process and 3% > IOWait, needing up to 30 seconds to process one WAL file. If this tipping > point is reached, then a huge WAL replication lag is building up quite > fast, that also leads to overfill of the XLOG directory on the slave > machines, as the WAL receiver is putting the WAL files it gets via > streaming replication the XLOG directory (that, in many cases are quite a > limited size separate disk partition). Could you try to get a profile of that 100% cpu time? Greetings, Andres =2D-=20 Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
От
Valentine Gogichashvili
Дата:
Hallo Andreas, > Could you try to get a profile of that 100% cpu time? > > Yes I will try to do it. -- Valentine
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
От
Valentine Gogichashvili
Дата:
Hallo John, On Wed, Aug 15, 2012 at 3:40 PM, John R Pierce <pierce@hogranch.com> wrote: > On 08/15/12 6:36 AM, Bruce Momjian wrote: > >> We warn against making shared buffers > 8GB, and this is perhaps another >> good reason. >> > > I generally keep it at no more than 2gb as I've never found any > performance improvements going higher, on systems with 48gb ram, and had > more than a few performance degradations with larger shared buffers. > > we see up to 10x performance increase with bigger shared_buffers in case of this database. Main database entities are about 20GB in size and we see that performance drops considerably when running with smaller shared_buffers smaller then that. Regards, -- Valentin
On 08/15/12 9:02 AM, Valentine Gogichashvili wrote: > > On Wed, Aug 15, 2012 at 3:40 PM, John R Pierce <pierce@hogranch.com > <mailto:pierce@hogranch.com>> wrote: > > On 08/15/12 6:36 AM, Bruce Momjian wrote: > > We warn against making shared buffers > 8GB, and this is > perhaps another > good reason. > > > I generally keep it at no more than 2gb as I've never found any > performance improvements going higher, on systems with 48gb ram, > and had more than a few performance degradations with larger > shared buffers. > > > we see up to 10x performance increase with bigger shared_buffers in > case of this database. Main database entities are about 20GB in size > and we see that performance drops considerably when running with > smaller shared_buffers smaller then that. > do you adjust effective_cache_size accordingly? with the smaller shared_buffers, we typically find at least half or more of physical memory is available as OS level disk cache, as shown by the 'cached' output of 'free' or whatever after the system has been running long enough to fully populate its disk cache. this parameter has a significant performance impact on the planner's estimation of the best way of executing given queries. also, especially if you're executing queries that process a lot of rows and have to do sorts and such, increasing work_mem is quite helpful. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
От
Valentine Gogichashvili
Дата:
Hello Andreas, here is the process, that now actually is not using CPU at all and the shared_buffers are set to 2GB: 50978 postgres 20 0 2288m 2.0g 2.0g S 0.0 1.6 4225:34 postgres: startup process recovering 000000050000262E000000FD It is hanging on that file for several minutes now. and here is the strace: $ strace -c -f -p 50978 Process 50978 attached - interrupt to quit Process 50978 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 94.82 0.007999 37 215 select 2.73 0.000230 1 215 getppid 2.45 0.000207 1 215 215 stat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.008436 645 215 total What kind of additional profiling information would you like to see? Regards, -- Valentin On Wed, Aug 15, 2012 at 4:09 PM, Andres Freund <andres@2ndquadrant.com>wrot= e: > Hi, > > On Wednesday, August 15, 2012 12:10:42 PM valgog@gmail.com wrote: > > The following bug has been logged on the website: > > > > Bug reference: 7494 > > Logged by: Valentine Gogichashvili > > Email address: valgog@gmail.com > > PostgreSQL version: 9.0.7 > > Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) > > Description: > > > > We are experiencing strange(?) behavior on the replication slave > machines. > > The master machine has a very heavy update load, where many processes a= re > > updating lots of data. It generates up to 30GB of WAL files per hour. > > Normally it is not a problem for the slave machines to replay this amou= nt > > of WAL files on time and keep on with the master. But at some moments, > the > > slaves are =E2=80=9Changing=E2=80=9D with 100% CPU usage on the WAL rep= lay process and 3% > > IOWait, needing up to 30 seconds to process one WAL file. If this tippi= ng > > point is reached, then a huge WAL replication lag is building up quite > > fast, that also leads to overfill of the XLOG directory on the slave > > machines, as the WAL receiver is putting the WAL files it gets via > > streaming replication the XLOG directory (that, in many cases are quite= a > > limited size separate disk partition). > Could you try to get a profile of that 100% cpu time? > > Greetings, > > Andres > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
От
Valentine Gogichashvili
Дата:
Hello again, now I have restarted the slave with shared_buffers set to 32GB, and now the recovery process is at 100% CPU: $ strace -c -f -p 27076 Process 27076 attached - interrupt to quit Process 27076 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 67.65 0.002127 0 18127 lseek 28.98 0.000911 0 16496 getppid 2.58 0.000081 0 758 read 0.80 0.000025 4 6 1 open 0.00 0.000000 0 4 close 0.00 0.000000 0 1 stat 0.00 0.000000 0 493 493 unlink ------ ----------- ----------- --------- --------- ---------------- 100.00 0.003144 35885 494 total $ strace -c -f -p 27076 Process 27076 attached - interrupt to quit Process 27076 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 44.44 0.000240 0 5976 lseek 38.15 0.000206 0 5522 getppid 17.41 0.000094 1 172 read 0.00 0.000000 0 4 1 open 0.00 0.000000 0 2 close 0.00 0.000000 0 1 stat 0.00 0.000000 0 159 159 unlink ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000540 11836 160 total Then restarted again with shared_buffers set back to 2GB: now the replay process is relatively fast again and here is the strace: $ strace -c -f -p 36450 Process 36450 attached - interrupt to quit Process 39336 attached Process 39337 attached Process 39337 detached Process 39338 attached Process 39338 detached Process 39339 attached Process 39339 detached Process 39336 detached Process 39359 attached Process 39360 attached Process 39360 detached Process 39361 attached Process 39361 detached Process 39362 attached Process 39362 detached Process 39359 detached Process 40168 attached Process 40169 attached Process 40169 detached Process 40170 attached Process 40170 detached Process 40171 attached Process 40171 detached Process 40168 detached Process 36450 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 88.21 0.668159 31817 21 9 wait4 4.69 0.035553 6 5970 read 1.92 0.014572 0 120565 lseek 1.91 0.014474 5 2957 2954 unlink 1.55 0.011766 8 1545 write 1.27 0.009582 0 109274 getppid 0.31 0.002340 3 762 fcntl 0.03 0.000245 20 12 clone 0.02 0.000163 1 219 rt_sigprocmask 0.02 0.000162 1 154 10 open 0.01 0.000092 0 216 mmap 0.01 0.000084 1 133 42 stat 0.01 0.000066 6 12 getrlimit 0.01 0.000046 0 178 close 0.00 0.000037 1 27 munmap 0.00 0.000035 6 6 pipe 0.00 0.000035 12 3 set_tid_address 0.00 0.000030 1 42 getegid 0.00 0.000021 0 183 rt_sigaction 0.00 0.000000 0 108 fstat 0.00 0.000000 0 102 mprotect 0.00 0.000000 0 36 brk 0.00 0.000000 0 9 rt_sigreturn 0.00 0.000000 0 9 9 ioctl 0.00 0.000000 0 48 12 access 0.00 0.000000 0 6 dup2 0.00 0.000000 0 6 getpid 0.00 0.000000 0 12 execve 0.00 0.000000 0 9 uname 0.00 0.000000 0 42 getuid 0.00 0.000000 0 42 getgid 0.00 0.000000 0 45 geteuid 0.00 0.000000 0 6 getpgrp 0.00 0.000000 0 6 statfs 0.00 0.000000 0 12 arch_prctl 0.00 0.000000 0 6 3 futex 0.00 0.000000 0 3 set_robust_list ------ ----------- ----------- --------- --------- ---------------- 100.00 0.757462 242786 3039 total With best regards, -- Valentine Gogichashvili On Thu, Aug 16, 2012 at 4:53 PM, Valentine Gogichashvili <valgog@gmail.com>wrote: > Hello Andreas, > > here is the process, that now actually is not using CPU at all and the > shared_buffers are set to 2GB: > > 50978 postgres 20 0 2288m 2.0g 2.0g S 0.0 1.6 4225:34 postgres: > startup process recovering 000000050000262E000000FD > > It is hanging on that file for several minutes now. > > and here is the strace: > > $ strace -c -f -p 50978 > Process 50978 attached - interrupt to quit > Process 50978 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 94.82 0.007999 37 215 select > 2.73 0.000230 1 215 getppid > 2.45 0.000207 1 215 215 stat > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.008436 645 215 total > > What kind of additional profiling information would you like to see? > > Regards, > > -- Valentin > > > >
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
От
Valentine Gogichashvili
Дата:
Hello John, >> we see up to 10x performance increase with bigger shared_buffers in case >> of this database. Main database entities are about 20GB in size and we see >> that performance drops considerably when running with smaller >> shared_buffers smaller then that. >> >> > do you adjust effective_cache_size accordingly? with the smaller > shared_buffers, we typically find at least half or more of physical memory > is available as OS level disk cache, as shown by the 'cached' output of > 'free' or whatever after the system has been running long enough to fully > populate its disk cache. this parameter has a significant performance > impact on the planner's estimation of the best way of executing given > queries. also, especially if you're executing queries that process a lot > of rows and have to do sorts and such, increasing work_mem is quite helpful. > > Yes, the effective_cache_size is set to the the 50% of the RAM = 64GB, but as I mentioned already, we are measuring considerable performance increase when increasing shared_buffers to the values, when it fits most important tables completely. Regards, -- Valentine
Hi Valentine, On Thursday, August 16, 2012 05:05:11 PM Valentine Gogichashvili wrote: > Hello again, > > now I have restarted the slave with shared_buffers set to 32GB, and now the > recovery process is at 100% CPU: > $ strace -c -f -p 27076 > Process 27076 attached - interrupt to quit > Process 27076 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 44.44 0.000240 0 5976 lseek > 38.15 0.000206 0 5522 getppid > 17.41 0.000094 1 172 read > 0.00 0.000000 0 4 1 open > 0.00 0.000000 0 2 close > 0.00 0.000000 0 1 stat > 0.00 0.000000 0 159 159 unlink > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.000540 11836 160 total > $ strace -c -f -p 36450 > Process 36450 attached - interrupt to quit > Process 39336 attached > Process 39337 attached > Process 39337 detached > Process 39338 attached > Process 39338 detached > Process 39339 attached > Process 39339 detached > Process 39336 detached > Process 39359 attached > Process 39360 attached > Process 39360 detached > Process 39361 attached > Process 39361 detached > Process 39362 attached > Process 39362 detached > Process 39359 detached > Process 40168 attached > Process 40169 attached > Process 40169 detached > Process 40170 attached > Process 40170 detached > Process 40171 attached > Process 40171 detached > Process 40168 detached > Process 36450 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 88.21 0.668159 31817 21 9 wait4 > 4.69 0.035553 6 5970 read > 1.92 0.014572 0 120565 lseek > 1.91 0.014474 5 2957 2954 unlink > 1.55 0.011766 8 1545 write > 1.27 0.009582 0 109274 getppid > 0.31 0.002340 3 762 fcntl > 0.03 0.000245 20 12 clone > 0.02 0.000163 1 219 rt_sigprocmask > 0.02 0.000162 1 154 10 open > 0.01 0.000092 0 216 mmap > 0.01 0.000084 1 133 42 stat > 0.01 0.000066 6 12 getrlimit > 0.01 0.000046 0 178 close > Those straces don't indicate contention in the kernel to me. Could you get a profile with 'perf' or if youre more familiar with it with 'oprofile'. With the former its just something like: perf record -F 10000 -p 27076 perf report > somefile And then send in somefile. Greetings, Andres
On Friday, August 17, 2012 12:51:44 PM Valentine Gogichashvili wrote: > Hello Andreas, > > here are the results of perf profiling: > https://gist.github.com/3b8cb0c15661da439632 > Also attached the files to that mail. > > For a problematic case of big shared_buffers: > > # Events: 320K cycles > # > # Overhead Command Shared Object Symbol > # ........ ........ ................. .................................. > # > 98.70% postgres postgres [.] DropRelFileNodeBuffers > 0.18% postgres postgres [.] RecordIsValid > 0.11% postgres [kernel.kallsyms] [k] native_write_msr_safe > 0.07% postgres [kernel.kallsyms] [k] dyntick_save_progress_counter > 0.06% postgres [kernel.kallsyms] [k] scheduler_tick > 0.03% postgres [kernel.kallsyms] [k] _spin_lock > 0.03% postgres [kernel.kallsyms] [k] __do_softirq > 0.03% postgres [kernel.kallsyms] [k] rcu_process_callbacks > 0.03% postgres postgres [.] hash_search_with_hash_value > 0.03% postgres [kernel.kallsyms] [k] native_read_msr_safe > 0.02% postgres libc-2.12.so [.] memcpy > 0.02% postgres [kernel.kallsyms] [k] rcu_process_gp_end > 0.02% postgres [kernel.kallsyms] [k] apic_timer_interrupt > 0.02% postgres [kernel.kallsyms] [k] run_timer_softirq > 0.02% postgres [kernel.kallsyms] [k] system_call Ok, that explains it. Youre frequently dropping/truncating tables? That currently requires walking through shared buffers and loosing all buffered pages related to that table. That obviously scales linearly with shared buffers and is particularly expensive on multi socket machines. Unless youre running 9.3 that will even lock each single page which is a relatively expensive and slow operation. Depending on how adventurous you are you could try backporting e8d029a30b5a5fb74b848a8697b1dfa3f66d9697 and see how big the benefits are for you. Greetings, Andres