Обсуждение: pgbench vs. wait events
Hi, I decided to do some testing on hydra (IBM-provided community resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using the newly-enhanced wait event stuff to try to get an idea of what we're waiting for during pgbench. I did 30-minute pgbench runs with various configurations, but all had max_connections = 200, shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = on. During each run, I ran this psql script in another window and captured the output: \t select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid() \watch 0.5 Then, I used a little shell-scripting to count up the number of times each wait event occurred in the output. First, I tried scale factor 3000 with 32 clients and got these results: 1 LWLockTranche | buffer_mapping 9 LWLockNamed | CLogControlLock 14 LWLockNamed | ProcArrayLock 16 Lock | tuple 25 LWLockNamed | CheckpointerCommLock 49 LWLockNamed | WALBufMappingLock 122 LWLockTranche | clog 182 Lock | transactionid 287 LWLockNamed | XidGenLock 1300 Client | ClientRead 1375 LWLockTranche | buffer_content 3990 Lock | extend 21014 LWLockNamed | WALWriteLock 28497 | 58279 LWLockTranche | wal_insert tps = 1150.803133 (including connections establishing) What jumps out here is, at least to me, is that there is furious contention on both the wal_insert locks and on WALWriteLock. Apparently, the system simply can't get WAL on disk fast enough to keep up with this workload. Relation extension locks and buffer_content locks also are also pretty common, both ahead of ClientRead, a relatively uncommon wait event on this test. The load average on the system was only about 3 during this test, indicating that most processes are in fact spending most of their time off-CPU. The first thing I tried was switching to unlogged tables, which produces these results: 1 BufferPin | BufferPin 1 LWLockTranche | lock_manager 2 LWLockTranche | buffer_mapping 8 LWLockNamed | ProcArrayLock 9 LWLockNamed | CheckpointerCommLock 9 LWLockNamed | CLogControlLock 11 LWLockTranche | buffer_content 37 LWLockTranche | clog 153 Lock | tuple 388 LWLockNamed | XidGenLock 827 Lock | transactionid 1267 Client | ClientRead 20631 Lock | extend91767 | tps = 1223.239416 (including connections establishing) If you don't look at the TPS number, these results look like a vast improvement. The overall amount of time spent not waiting for anything is now much higher, and the problematic locks have largely disappeared from the picture. However, the load average now shoots up to about 30, because most of the time that the backends are "not waiting for anything" they are in fact in kernel wait state D; that is, they're stuck doing I/O. This suggests that we might want to consider advertising a wait state when a backend is doing I/O, so we can measure this sort of thing. Next, I tried lowering the scale factor to something that fits in shared buffers. Here are the results at scale factor 300: 14 Lock | tuple 22 LWLockTranche | lock_manager 39 LWLockNamed | WALBufMappingLock 331 LWLockNamed | CLogControlLock 461 LWLockNamed | ProcArrayLock 536 Lock | transactionid 552 Lock | extend 716 LWLockTranche | buffer_content 763 LWLockNamed | XidGenLock 2113 LWLockNamed | WALWriteLock 6190 LWLockTranche | wal_insert 25002 Client | ClientRead 78466 | tps = 27651.562835 (including connections establishing) Obviously, there's a vast increase in TPS, and the backends seem to spend most of their time actually doing work. ClientRead is now the overwhelmingly dominant wait event, although wal_insert and WALWriteLock contention is clearly still a significant problem. Contention on other locks is apparently quite rare. Notice that client reads are really significant here - more than 20% of the time we sample what a backend is doing, it's waiting for the next query. It seems unlikely that performance on this workload can be improved very much by optimizing anything other than WAL writing, because no other wait event appears in more than 1% of the samples. It's not clear how much of the WAL-related stuff is artificial lock contention and how much of it is the limited speed at which the disk can rotate. However, I was curious about what's going on with CLogControlLock, persuant to the other thread on that topic, so I reran this with unlogged tables. At scale factor 300, 32 clients, unlogged tables, I get these results: 6 LWLockTranche | wal_insert 12 Lock | tuple 26 LWLockTranche | lock_manager 39 LWLockTranche | buffer_content 353 LWLockNamed | CLogControlLock 377 Lock | extend 491 Lock | transactionid 500 LWLockNamed | ProcArrayLock 696 LWLockNamed | XidGenLock 27685 Client | ClientRead 84930 | WAL contention is eliminated, and CLog contention doesn't increase. Next, I tried ramping up the client count to see what effect that had. These next three results are all at scale factor 300 with unlogged tables. 64 clients: 1 BufferPin | BufferPin 2 LWLockTranche | buffer_mapping 51 LWLockTranche | wal_insert 52 Lock | relation 132 LWLockTranche | lock_manager 154 LWLockTranche | buffer_content 209 Lock | tuple 940 Lock | extend 1412 LWLockNamed | XidGenLock 2337 LWLockNamed | CLogControlLock 2813 LWLockNamed | ProcArrayLock 3859 Lock | transactionid 83760 Client | ClientRead133840 | tps = 36672.484524 (including connections establishing) 96 clients: 1 LWLockTranche | proc 53 LWLockTranche | wal_insert 305 LWLockTranche | buffer_content 391 LWLockTranche | lock_manager 797 Lock | tuple 1074 Lock | extend 2492 LWLockNamed | XidGenLock 4185 LWLockNamed | ProcArrayLock 6003 LWLockNamed | CLogControlLock 11348 Lock | transactionid142881 Client | ClientRead174011 | tps = 37214.257661 (including connections establishing) 192 clients: 1 LWLockTranche | buffer_mapping 2 BufferPin | BufferPin 4 LWLockTranche | clog 8 LWLockTranche | proc 87 LWLockTranche | wal_insert 708 LWLockTranche | buffer_content 1892 LWLockTranche | lock_manager 2225 Lock | extend 4796 LWLockNamed | ProcArrayLock 5105 LWLockNamed | XidGenLock 7982 Lock | tuple 11882 LWLockNamed | CLogControlLock 64091 Lock | transactionid275481 |308627 Client | ClientRead tps = 35018.906863 (including connections establishing) There are a couple of interesting things here. First, CLogControlLock contention becomes more significant as the number of the clients goes up. At 32 clients, it is the 6th-most frequent wait event; at 192 clients, it's moved up to 4th, and it's clearly growing at a faster rate than some of the others. As we go from 32 to 64 to 96 to 192 clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples, but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so it goes from being only 2/3 of ProcArrayLock to being more than twice ProcArrayLock. However, in absolute terms, it's still not very significant. Even at 192 clients, where it's the fourth-most-common wait event, it appears in only 1.7% of samples, which means backends are only waiting for it about 1.7% of the time. It appears that, on this system, this workload is NEVER going to get meaningfully faster by improving CLogControlLock. Second, ClientRead becomes a bigger and bigger issue as the number of clients increases; by 192 clients, it appears in 45% of the samples. That basically means that pgbench is increasingly unable to keep up with the server; for whatever reason, it suffers more than the server does from the increasing lack of CPU resources. Third, Lock/transactionid and Lock/tuple become more and more common as the number of clients increases; these happen when two different pgbench threads deciding to hit the same row at the same time. Due to the birthday paradox this increases pretty quickly as the number of clients ramps up, but it's not really a server issue: there's nothing the server can do about the possibility that two or more clients pick the same random number at the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Hi,
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5
Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:
Scale factor 3000 obviously doesn't fit in shared_buffers. But does it fit in RAM? That is, are the backends doing real IO, or they just doing fake IO to the kernel's fs cache?
Cheers,
Jeff
On Thu, Oct 6, 2016 at 4:40 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Scale factor 3000 obviously doesn't fit in shared_buffers. But does it fit > in RAM? That is, are the backends doing real IO, or they just doing fake IO > to the kernel's fs cache? That's a good question. [rhaas@hydra ~]$ free -g total used free shared buffers cached Mem: 61 26 34 0 0 24 -/+ buffers/cache: 2 58 Swap: 19 4 15 rhaas=# select pg_size_pretty(pg_relation_size('pgbench_accounts'));pg_size_pretty ----------------38 GB (1 row) rhaas=# select pg_size_pretty(pg_database_size(current_database()));pg_size_pretty ----------------44 GB (1 row) That's pretty tight, especially since I now notice Andres also left a postmaster running on this machine back in April, with shared_buffers=8GB. 44GB for this database plus 8GB for shared_buffers plus 8GB for the other postmaster's shared_buffers leaves basically no slack, so it was probably doing quite a bit of real I/O, especially after the database got a bit of bloat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 7, 2016 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I decided to do some testing on hydra (IBM-provided community > resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using > the newly-enhanced wait event stuff to try to get an idea of what > we're waiting for during pgbench. I did 30-minute pgbench runs with > various configurations, but all had max_connections = 200, > shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = > off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, > log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = > on. During each run, I ran this psql script in another window and > captured the output: Nice. Thanks for sharing. > \t > select wait_event_type, wait_event from pg_stat_activity where pid != > pg_backend_pid() > \watch 0.5 > > Then, I used a little shell-scripting to count up the number of times > each wait event occurred in the output. Or an INSERT SELECT on an unlogged table? No need of extra maths then. > First, I tried scale factor > 3000 with 32 clients and got these results: > > 1 LWLockTranche | buffer_mapping > [...] > 21014 LWLockNamed | WALWriteLock > 28497 | > 58279 LWLockTranche | wal_insert > > tps = 1150.803133 (including connections establishing) > > What jumps out here is, at least to me, is that there is furious > contention on both the wal_insert locks and on WALWriteLock. > Apparently, the system simply can't get WAL on disk fast enough to > keep up with this workload. Relation extension locks and > buffer_content locks also are also pretty common, both ahead of > ClientRead, a relatively uncommon wait event on this test. The load > average on the system was only about 3 during this test, indicating > that most processes are in fact spending most of their time off-CPU. Increasing the number of WAL insert slots would help? With your tests this is at 8 all the time. > The first thing I tried was switching to unlogged tables, which > produces these results: > > 1 BufferPin | BufferPin > 1 LWLockTranche | lock_manager > 2 LWLockTranche | buffer_mapping > 8 LWLockNamed | ProcArrayLock > 9 LWLockNamed | CheckpointerCommLock > 9 LWLockNamed | CLogControlLock > 11 LWLockTranche | buffer_content > 37 LWLockTranche | clog > 153 Lock | tuple > 388 LWLockNamed | XidGenLock > 827 Lock | transactionid > 1267 Client | ClientRead > 20631 Lock | extend > 91767 | > > tps = 1223.239416 (including connections establishing) > > If you don't look at the TPS number, these results look like a vast > improvement. The overall amount of time spent not waiting for > anything is now much higher, and the problematic locks have largely > disappeared from the picture. However, the load average now shoots up > to about 30, because most of the time that the backends are "not > waiting for anything" they are in fact in kernel wait state D; that > is, they're stuck doing I/O. This suggests that we might want to > consider advertising a wait state when a backend is doing I/O, so we > can measure this sort of thing. Maybe something in fd.c.. It may be a good idea to actually have a look at a perf output on Linux to see where this contention is happening, use this conclusion to decide the place of a wait point, and then see if on other OSes share a similar pattern. -- Michael
Robert, This contention on WAL reminds me of another scenario I've heard about that was similar. To fix things what happened was that anyone that the first person to block would be responsible for writing out all buffers for anyone blocked behind "him". The for example if you have many threads, A, B, C, D If while A is writing to WAL and hold the lock, then B arrives, B of course blocks, then C comes along and blocks as well, then D. Finally A finishes its write and then .... Now you have two options for resolving this, either 1) A drops its lock, B picks up the lock... B writes its buffer and then drops the lock. Then C gets the lock, writes its buffer, drops the lock, then finally D gets the lock, writes its buffer and then drops the lock. 2) A then writes out B's, C's, and D's buffers, then A drops the lock, B, C and D wake up, note that their respective buffers are written and just return. This greatly speeds up the system. (just be careful to make sure A doesn't do "too much work" otherwise you can get a sort of livelock if too many threads are blocked behind it, generally only issue one additional flush on behalf of other threads, do not "loop until the queue is empty") I'm not sure if this is actually possible with the way WAL is implemented, (or perhaps if this strategy is already implemented) but it's definitely worth if not done already as it can speed things up enormously. On 10/6/16 11:38 AM, Robert Haas wrote: > Hi, > > I decided to do some testing on hydra (IBM-provided community > resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using > the newly-enhanced wait event stuff to try to get an idea of what > we're waiting for during pgbench. I did 30-minute pgbench runs with > various configurations, but all had max_connections = 200, > shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = > off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, > log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = > on. During each run, I ran this psql script in another window and > captured the output: > > \t > select wait_event_type, wait_event from pg_stat_activity where pid != > pg_backend_pid() > \watch 0.5 > > Then, I used a little shell-scripting to count up the number of times > each wait event occurred in the output. First, I tried scale factor > 3000 with 32 clients and got these results: > > 1 LWLockTranche | buffer_mapping > 9 LWLockNamed | CLogControlLock > 14 LWLockNamed | ProcArrayLock > 16 Lock | tuple > 25 LWLockNamed | CheckpointerCommLock > 49 LWLockNamed | WALBufMappingLock > 122 LWLockTranche | clog > 182 Lock | transactionid > 287 LWLockNamed | XidGenLock > 1300 Client | ClientRead > 1375 LWLockTranche | buffer_content > 3990 Lock | extend > 21014 LWLockNamed | WALWriteLock > 28497 | > 58279 LWLockTranche | wal_insert > > tps = 1150.803133 (including connections establishing) > > What jumps out here is, at least to me, is that there is furious > contention on both the wal_insert locks and on WALWriteLock. > Apparently, the system simply can't get WAL on disk fast enough to > keep up with this workload. Relation extension locks and > buffer_content locks also are also pretty common, both ahead of > ClientRead, a relatively uncommon wait event on this test. The load > average on the system was only about 3 during this test, indicating > that most processes are in fact spending most of their time off-CPU. > The first thing I tried was switching to unlogged tables, which > produces these results: > > 1 BufferPin | BufferPin > 1 LWLockTranche | lock_manager > 2 LWLockTranche | buffer_mapping > 8 LWLockNamed | ProcArrayLock > 9 LWLockNamed | CheckpointerCommLock > 9 LWLockNamed | CLogControlLock > 11 LWLockTranche | buffer_content > 37 LWLockTranche | clog > 153 Lock | tuple > 388 LWLockNamed | XidGenLock > 827 Lock | transactionid > 1267 Client | ClientRead > 20631 Lock | extend > 91767 | > > tps = 1223.239416 (including connections establishing) > > If you don't look at the TPS number, these results look like a vast > improvement. The overall amount of time spent not waiting for > anything is now much higher, and the problematic locks have largely > disappeared from the picture. However, the load average now shoots up > to about 30, because most of the time that the backends are "not > waiting for anything" they are in fact in kernel wait state D; that > is, they're stuck doing I/O. This suggests that we might want to > consider advertising a wait state when a backend is doing I/O, so we > can measure this sort of thing. > > Next, I tried lowering the scale factor to something that fits in > shared buffers. Here are the results at scale factor 300: > > 14 Lock | tuple > 22 LWLockTranche | lock_manager > 39 LWLockNamed | WALBufMappingLock > 331 LWLockNamed | CLogControlLock > 461 LWLockNamed | ProcArrayLock > 536 Lock | transactionid > 552 Lock | extend > 716 LWLockTranche | buffer_content > 763 LWLockNamed | XidGenLock > 2113 LWLockNamed | WALWriteLock > 6190 LWLockTranche | wal_insert > 25002 Client | ClientRead > 78466 | > > tps = 27651.562835 (including connections establishing) > > Obviously, there's a vast increase in TPS, and the backends seem to > spend most of their time actually doing work. ClientRead is now the > overwhelmingly dominant wait event, although wal_insert and > WALWriteLock contention is clearly still a significant problem. > Contention on other locks is apparently quite rare. Notice that > client reads are really significant here - more than 20% of the time > we sample what a backend is doing, it's waiting for the next query. > It seems unlikely that performance on this workload can be improved > very much by optimizing anything other than WAL writing, because no > other wait event appears in more than 1% of the samples. It's not > clear how much of the WAL-related stuff is artificial lock contention > and how much of it is the limited speed at which the disk can rotate. > > However, I was curious about what's going on with CLogControlLock, > persuant to the other thread on that topic, so I reran this with > unlogged tables. At scale factor 300, 32 clients, unlogged tables, I > get these results: > > 6 LWLockTranche | wal_insert > 12 Lock | tuple > 26 LWLockTranche | lock_manager > 39 LWLockTranche | buffer_content > 353 LWLockNamed | CLogControlLock > 377 Lock | extend > 491 Lock | transactionid > 500 LWLockNamed | ProcArrayLock > 696 LWLockNamed | XidGenLock > 27685 Client | ClientRead > 84930 | > > WAL contention is eliminated, and CLog contention doesn't increase. > Next, I tried ramping up the client count to see what effect that had. > These next three results are all at scale factor 300 with unlogged > tables. > > 64 clients: > > 1 BufferPin | BufferPin > 2 LWLockTranche | buffer_mapping > 51 LWLockTranche | wal_insert > 52 Lock | relation > 132 LWLockTranche | lock_manager > 154 LWLockTranche | buffer_content > 209 Lock | tuple > 940 Lock | extend > 1412 LWLockNamed | XidGenLock > 2337 LWLockNamed | CLogControlLock > 2813 LWLockNamed | ProcArrayLock > 3859 Lock | transactionid > 83760 Client | ClientRead > 133840 | > > tps = 36672.484524 (including connections establishing) > > 96 clients: > > 1 LWLockTranche | proc > 53 LWLockTranche | wal_insert > 305 LWLockTranche | buffer_content > 391 LWLockTranche | lock_manager > 797 Lock | tuple > 1074 Lock | extend > 2492 LWLockNamed | XidGenLock > 4185 LWLockNamed | ProcArrayLock > 6003 LWLockNamed | CLogControlLock > 11348 Lock | transactionid > 142881 Client | ClientRead > 174011 | > > tps = 37214.257661 (including connections establishing) > > 192 clients: > > 1 LWLockTranche | buffer_mapping > 2 BufferPin | BufferPin > 4 LWLockTranche | clog > 8 LWLockTranche | proc > 87 LWLockTranche | wal_insert > 708 LWLockTranche | buffer_content > 1892 LWLockTranche | lock_manager > 2225 Lock | extend > 4796 LWLockNamed | ProcArrayLock > 5105 LWLockNamed | XidGenLock > 7982 Lock | tuple > 11882 LWLockNamed | CLogControlLock > 64091 Lock | transactionid > 275481 | > 308627 Client | ClientRead > > tps = 35018.906863 (including connections establishing) > > There are a couple of interesting things here. First, CLogControlLock > contention becomes more significant as the number of the clients goes > up. At 32 clients, it is the 6th-most frequent wait event; at 192 > clients, it's moved up to 4th, and it's clearly growing at a faster > rate than some of the others. As we go from 32 to 64 to 96 to 192 > clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples, > but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so > it goes from being only 2/3 of ProcArrayLock to being more than twice > ProcArrayLock. However, in absolute terms, it's still not very > significant. Even at 192 clients, where it's the fourth-most-common > wait event, it appears in only 1.7% of samples, which means backends > are only waiting for it about 1.7% of the time. It appears that, on > this system, this workload is NEVER going to get meaningfully faster > by improving CLogControlLock. > > Second, ClientRead becomes a bigger and bigger issue as the number of > clients increases; by 192 clients, it appears in 45% of the samples. > That basically means that pgbench is increasingly unable to keep up > with the server; for whatever reason, it suffers more than the server > does from the increasing lack of CPU resources. Third, > Lock/transactionid and Lock/tuple become more and more common as the > number of clients increases; these happen when two different pgbench > threads deciding to hit the same row at the same time. Due to the > birthday paradox this increases pretty quickly as the number of > clients ramps up, but it's not really a server issue: there's nothing > the server can do about the possibility that two or more clients pick > the same random number at the same time. >
Robert, This contention on WAL reminds me of another scenario I've heard about that was similar. To fix things what happened was that anyone that the first person to block would be responsible for writing out all buffers for anyone blocked behind "him". The for example if you have many threads, A, B, C, D If while A is writing to WAL and hold the lock, then B arrives, B of course blocks, then C comes along and blocks as well, then D. Finally A finishes its write and then .... Now you have two options for resolving this, either 1) A drops its lock, B picks up the lock... B writes its buffer and then drops the lock. Then C gets the lock, writes its buffer, drops the lock, then finally D gets the lock, writes its buffer and then drops the lock. 2) A then writes out B's, C's, and D's buffers, then A drops the lock, B, C and D wake up, note that their respective buffers are written and just return. This greatly speeds up the system. (just be careful to make sure A doesn't do "too much work" otherwise you can get a sort of livelock if too many threads are blocked behind it, generally only issue one additional flush on behalf of other threads, do not "loop until the queue is empty") I'm not sure if this is actually possible with the way WAL is implemented, (or perhaps if this strategy is already implemented) but it's definitely worth if not done already as it can speed things up enormously. On 10/6/16 11:38 AM, Robert Haas wrote: > Hi, > > I decided to do some testing on hydra (IBM-provided community > resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using > the newly-enhanced wait event stuff to try to get an idea of what > we're waiting for during pgbench. I did 30-minute pgbench runs with > various configurations, but all had max_connections = 200, > shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = > off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, > log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = > on. During each run, I ran this psql script in another window and > captured the output: > > \t > select wait_event_type, wait_event from pg_stat_activity where pid != > pg_backend_pid() > \watch 0.5 > > Then, I used a little shell-scripting to count up the number of times > each wait event occurred in the output. First, I tried scale factor > 3000 with 32 clients and got these results: > > 1 LWLockTranche | buffer_mapping > 9 LWLockNamed | CLogControlLock > 14 LWLockNamed | ProcArrayLock > 16 Lock | tuple > 25 LWLockNamed | CheckpointerCommLock > 49 LWLockNamed | WALBufMappingLock > 122 LWLockTranche | clog > 182 Lock | transactionid > 287 LWLockNamed | XidGenLock > 1300 Client | ClientRead > 1375 LWLockTranche | buffer_content > 3990 Lock | extend > 21014 LWLockNamed | WALWriteLock > 28497 | > 58279 LWLockTranche | wal_insert > > tps = 1150.803133 (including connections establishing) > > What jumps out here is, at least to me, is that there is furious > contention on both the wal_insert locks and on WALWriteLock. > Apparently, the system simply can't get WAL on disk fast enough to > keep up with this workload. Relation extension locks and > buffer_content locks also are also pretty common, both ahead of > ClientRead, a relatively uncommon wait event on this test. The load > average on the system was only about 3 during this test, indicating > that most processes are in fact spending most of their time off-CPU. > The first thing I tried was switching to unlogged tables, which > produces these results: > > 1 BufferPin | BufferPin > 1 LWLockTranche | lock_manager > 2 LWLockTranche | buffer_mapping > 8 LWLockNamed | ProcArrayLock > 9 LWLockNamed | CheckpointerCommLock > 9 LWLockNamed | CLogControlLock > 11 LWLockTranche | buffer_content > 37 LWLockTranche | clog > 153 Lock | tuple > 388 LWLockNamed | XidGenLock > 827 Lock | transactionid > 1267 Client | ClientRead > 20631 Lock | extend > 91767 | > > tps = 1223.239416 (including connections establishing) > > If you don't look at the TPS number, these results look like a vast > improvement. The overall amount of time spent not waiting for > anything is now much higher, and the problematic locks have largely > disappeared from the picture. However, the load average now shoots up > to about 30, because most of the time that the backends are "not > waiting for anything" they are in fact in kernel wait state D; that > is, they're stuck doing I/O. This suggests that we might want to > consider advertising a wait state when a backend is doing I/O, so we > can measure this sort of thing. > > Next, I tried lowering the scale factor to something that fits in > shared buffers. Here are the results at scale factor 300: > > 14 Lock | tuple > 22 LWLockTranche | lock_manager > 39 LWLockNamed | WALBufMappingLock > 331 LWLockNamed | CLogControlLock > 461 LWLockNamed | ProcArrayLock > 536 Lock | transactionid > 552 Lock | extend > 716 LWLockTranche | buffer_content > 763 LWLockNamed | XidGenLock > 2113 LWLockNamed | WALWriteLock > 6190 LWLockTranche | wal_insert > 25002 Client | ClientRead > 78466 | > > tps = 27651.562835 (including connections establishing) > > Obviously, there's a vast increase in TPS, and the backends seem to > spend most of their time actually doing work. ClientRead is now the > overwhelmingly dominant wait event, although wal_insert and > WALWriteLock contention is clearly still a significant problem. > Contention on other locks is apparently quite rare. Notice that > client reads are really significant here - more than 20% of the time > we sample what a backend is doing, it's waiting for the next query. > It seems unlikely that performance on this workload can be improved > very much by optimizing anything other than WAL writing, because no > other wait event appears in more than 1% of the samples. It's not > clear how much of the WAL-related stuff is artificial lock contention > and how much of it is the limited speed at which the disk can rotate. > > However, I was curious about what's going on with CLogControlLock, > persuant to the other thread on that topic, so I reran this with > unlogged tables. At scale factor 300, 32 clients, unlogged tables, I > get these results: > > 6 LWLockTranche | wal_insert > 12 Lock | tuple > 26 LWLockTranche | lock_manager > 39 LWLockTranche | buffer_content > 353 LWLockNamed | CLogControlLock > 377 Lock | extend > 491 Lock | transactionid > 500 LWLockNamed | ProcArrayLock > 696 LWLockNamed | XidGenLock > 27685 Client | ClientRead > 84930 | > > WAL contention is eliminated, and CLog contention doesn't increase. > Next, I tried ramping up the client count to see what effect that had. > These next three results are all at scale factor 300 with unlogged > tables. > > 64 clients: > > 1 BufferPin | BufferPin > 2 LWLockTranche | buffer_mapping > 51 LWLockTranche | wal_insert > 52 Lock | relation > 132 LWLockTranche | lock_manager > 154 LWLockTranche | buffer_content > 209 Lock | tuple > 940 Lock | extend > 1412 LWLockNamed | XidGenLock > 2337 LWLockNamed | CLogControlLock > 2813 LWLockNamed | ProcArrayLock > 3859 Lock | transactionid > 83760 Client | ClientRead > 133840 | > > tps = 36672.484524 (including connections establishing) > > 96 clients: > > 1 LWLockTranche | proc > 53 LWLockTranche | wal_insert > 305 LWLockTranche | buffer_content > 391 LWLockTranche | lock_manager > 797 Lock | tuple > 1074 Lock | extend > 2492 LWLockNamed | XidGenLock > 4185 LWLockNamed | ProcArrayLock > 6003 LWLockNamed | CLogControlLock > 11348 Lock | transactionid > 142881 Client | ClientRead > 174011 | > > tps = 37214.257661 (including connections establishing) > > 192 clients: > > 1 LWLockTranche | buffer_mapping > 2 BufferPin | BufferPin > 4 LWLockTranche | clog > 8 LWLockTranche | proc > 87 LWLockTranche | wal_insert > 708 LWLockTranche | buffer_content > 1892 LWLockTranche | lock_manager > 2225 Lock | extend > 4796 LWLockNamed | ProcArrayLock > 5105 LWLockNamed | XidGenLock > 7982 Lock | tuple > 11882 LWLockNamed | CLogControlLock > 64091 Lock | transactionid > 275481 | > 308627 Client | ClientRead > > tps = 35018.906863 (including connections establishing) > > There are a couple of interesting things here. First, CLogControlLock > contention becomes more significant as the number of the clients goes > up. At 32 clients, it is the 6th-most frequent wait event; at 192 > clients, it's moved up to 4th, and it's clearly growing at a faster > rate than some of the others. As we go from 32 to 64 to 96 to 192 > clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples, > but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so > it goes from being only 2/3 of ProcArrayLock to being more than twice > ProcArrayLock. However, in absolute terms, it's still not very > significant. Even at 192 clients, where it's the fourth-most-common > wait event, it appears in only 1.7% of samples, which means backends > are only waiting for it about 1.7% of the time. It appears that, on > this system, this workload is NEVER going to get meaningfully faster > by improving CLogControlLock. > > Second, ClientRead becomes a bigger and bigger issue as the number of > clients increases; by 192 clients, it appears in 45% of the samples. > That basically means that pgbench is increasingly unable to keep up > with the server; for whatever reason, it suffers more than the server > does from the increasing lack of CPU resources. Third, > Lock/transactionid and Lock/tuple become more and more common as the > number of clients increases; these happen when two different pgbench > threads deciding to hit the same row at the same time. Due to the > birthday paradox this increases pretty quickly as the number of > clients ramps up, but it's not really a server issue: there's nothing > the server can do about the possibility that two or more clients pick > the same random number at the same time. >
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:
14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |
tps = 27651.562835 (including connections establishing)
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.
What happens if you turn fsync off? Once a xlog file is fully written, it is immediately fsynced, even if the backend is holding WALWriteLock or wal_insert (or both) at the time, and even if synchrounous_commit is off. Assuming this machine has a BBU so that it doesn't have to wait for disk rotation, still fsyncs are expensive because the kernel has to find all the data and get it sent over to the BBU, while holding locks.
....
Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.
I would be careful about that interpretation. If you asked pgbench, it would probably have the opposite opinion.
The backend tosses its response at the kernel (which will never block, because the pgbench responses are all small and the kernel will buffer them) and then goes into ClientRead. After the backend goes into ClientRead, the kernel needs to find and wake up the pgbench, deliver the response, and pgbench has to receive and process the response. Only then does it create a new query (I've toyed before with having pgbench construct the next query while it is waiting for a response on the previous one, but that didn't seem promising, and much of pgbench has been rewritten since then), pass the query back to the kernel. Then the kernel has to find and wake up the backend and deliver the new query. So for a reasonable chunk of the time that the server thinks it is waiting for the client, the client also thinks it is waiting for the server.
I think we need to come up with some benchmarking queries which get more work done per round-trip to the database. And build them into the binary, because otherwise people won't use them as much as they should if they have to pass "-f" files around mailing lists and blog postings. For example, we could enclose 5 statements of the TPC-B-like into a single function which takes aid, bid, tid, and delta as arguments. And presumably we could drop the other two statements (BEGIN and COMMIT) as well, and rely on autocommit to get that job done. So we could go from 7 statements to 1.
Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.
What I have done in the past is chop a zero off from:
#define naccounts 100000
and recompile pgbench. Then you can increase the scale factor so that you have less contention on pgbench_branches while still fitting the data in shared_buffers, or in RAM.
Cheers,
Jeff
Hi, On 2016-10-06 14:38:56 -0400, Robert Haas wrote: > Obviously, there's a vast increase in TPS, and the backends seem to > spend most of their time actually doing work. ClientRead is now the > overwhelmingly dominant wait event, although wal_insert and > WALWriteLock contention is clearly still a significant problem. > Contention on other locks is apparently quite rare. Notice that > client reads are really significant here - more than 20% of the time > we sample what a backend is doing, it's waiting for the next query. > It seems unlikely that performance on this workload can be improved > very much by optimizing anything other than WAL writing, because no > other wait event appears in more than 1% of the samples. I don't think you meant that, but this sounds a bit like that there's nothing to do performance-wise - I don't think that's true. There's no significant lock-contention, yes. But that obviously doesn't mean we can use resources more efficiently. > Second, ClientRead becomes a bigger and bigger issue as the number of > clients increases; by 192 clients, it appears in 45% of the samples. > That basically means that pgbench is increasingly unable to keep up > with the server; for whatever reason, it suffers more than the server > does from the increasing lack of CPU resources. Isn't that more a question of pgbench threads beeing preemted, so server processes can run? Since there's not enough hardware threads for both servers and clients to always run, you *have* to switch between them. Usually a read from the client after processing a query will cause a directed wakeup of the other thread (which is why it's not a very frequently observed state), but if the queue of to-be-run tasks is very long that'll not happen. Greetings, Andres Freund
Hi, On 2016-10-06 18:15:58 -0400, Robert Haas wrote: > That's pretty tight, especially since I now notice Andres also left a > postmaster running on this machine back in April, with > shared_buffers=8GB. Oops, sorry for that. - Andres
Hi, On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote: > This contention on WAL reminds me of another scenario I've heard about that > was similar. > > To fix things what happened was that anyone that the first person to block > would be responsible for writing out all buffers for anyone blocked behind > "him". We pretty much do that already. But while that's happening, the other would-be-writers show up as blocking on the lock. We don't use kind of an odd locking model for the waiters (LWLockAcquireOrWait()), which waits for the lock to be released, but doesn't try to acquire it afterwards. Instead the wal position is rechecked, and in many cases we'll be done afterwards, because enough has been written out. Greetings, Andres Freund
On 10/7/16 10:42 AM, Andres Freund wrote: > Hi, > > On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote: >> This contention on WAL reminds me of another scenario I've heard about that >> was similar. >> >> To fix things what happened was that anyone that the first person to block >> would be responsible for writing out all buffers for anyone blocked behind >> "him". > We pretty much do that already. But while that's happening, the other > would-be-writers show up as blocking on the lock. We don't use kind of > an odd locking model for the waiters (LWLockAcquireOrWait()), which > waits for the lock to be released, but doesn't try to acquire it > afterwards. Instead the wal position is rechecked, and in many cases > we'll be done afterwards, because enough has been written out. > > Greetings, > > Andres Freund > > Are the batched writes all done before fsync is called? Are you sure that A only calls fsync after flushing all the buffers from B, C, and D? Or will it fsync twice? Is there instrumentation to show that? I know there's a tremendous level of skill involved in this code, but simply asking in case there's some tricks. Another strategy that may work is actually intentionally waiting/buffering some few ms between flushes/fsync, for example, make sure that the number of flushes per second doesn't exceed some configurable amount because each flush likely eats at least one iop from the disk and there is a maximum iops per disk, so might as well buffer more if you're exceeding that iops count. You'll trade some latency, but gain throughput for doing that. Does this make sense? Again apologies if this has been covered. Is there a whitepaper or blog post or clear way I can examine the algorithm wrt locks/buffering for flushing WAL logs? -Alfred
On Fri, Oct 7, 2016 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-10-06 14:38:56 -0400, Robert Haas wrote: >> Obviously, there's a vast increase in TPS, and the backends seem to >> spend most of their time actually doing work. ClientRead is now the >> overwhelmingly dominant wait event, although wal_insert and >> WALWriteLock contention is clearly still a significant problem. >> Contention on other locks is apparently quite rare. Notice that >> client reads are really significant here - more than 20% of the time >> we sample what a backend is doing, it's waiting for the next query. >> It seems unlikely that performance on this workload can be improved >> very much by optimizing anything other than WAL writing, because no >> other wait event appears in more than 1% of the samples. > > I don't think you meant that, but this sounds a bit like that there's > nothing to do performance-wise - I don't think that's true. There's no > significant lock-contention, yes. But that obviously doesn't mean we can > use resources more efficiently. Yeah, right. Doing the same stuff faster will surely help. It just doesn't look like we can get anywhere by improving the locking, unless it's something related to WAL writing. >> Second, ClientRead becomes a bigger and bigger issue as the number of >> clients increases; by 192 clients, it appears in 45% of the samples. >> That basically means that pgbench is increasingly unable to keep up >> with the server; for whatever reason, it suffers more than the server >> does from the increasing lack of CPU resources. > > Isn't that more a question of pgbench threads beeing preemted, so server > processes can run? Since there's not enough hardware threads for both > servers and clients to always run, you *have* to switch between them. > Usually a read from the client after processing a query will cause a > directed wakeup of the other thread (which is why it's not a very > frequently observed state), but if the queue of to-be-run tasks is very > long that'll not happen. Yeah, maybe I shouldn't have said it suffers more than the server, but rather along with the server. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 7, 2016 at 11:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > What happens if you turn fsync off? Once a xlog file is fully written, it > is immediately fsynced, even if the backend is holding WALWriteLock or > wal_insert (or both) at the time, and even if synchrounous_commit is off. > Assuming this machine has a BBU so that it doesn't have to wait for disk > rotation, still fsyncs are expensive because the kernel has to find all the > data and get it sent over to the BBU, while holding locks. Scale factor 300, 32 clients, fsync=off: 5 Lock | tuple 18 LWLockTranche | lock_manager 24 LWLockNamed | WALWriteLock 88 LWLockTranche | buffer_content 265 LWLockTranche | wal_insert 373 LWLockNamed | CLogControlLock 496 LWLockNamed | ProcArrayLock 532 Lock | extend 540 LWLockNamed | XidGenLock 545 Lock | transactionid 27067 Client | ClientRead 85364 | But I'm not sure you're right about the way the fsync=off code works. I think pg_fsync()/pg_fsync_writethrough()/pg_fsync_no_writethrough() look at enableFsync and just do nothing if it's false. >> Second, ClientRead becomes a bigger and bigger issue as the number of >> clients increases; by 192 clients, it appears in 45% of the samples. >> That basically means that pgbench is increasingly unable to keep up >> with the server; for whatever reason, it suffers more than the server >> does from the increasing lack of CPU resources. > > I would be careful about that interpretation. If you asked pgbench, it > would probably have the opposite opinion. Yeah, that's possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 7, 2016 at 11:34 PM, Alfred Perlstein <alfred@freebsd.org> wrote: > > Are the batched writes all done before fsync is called? > In most cases, but if while writing, the wal segment (size - 16M) is finished, we do fsync and then do the remaining writes and at end again perform fsync. > Are you sure that A only calls fsync after flushing all the buffers from B, > C, and D? Or will it fsync twice? > I don't think it is guaranteed, however if B, C and D has finished writing to wal buffers before A starts writing out buffer, then A will ensure to combine all those writes and call fsync just once. > Is there instrumentation to show that? > Currently, I don't think we have instrumentation to show that, but you can do it if you want. In XLogFlush(), we do print the write request values (ifXLOG_DEBUG is set and we have compiled the code with WAL_DEBUG option), you can put the one print in XLOGWrite once the write/flush is done and another print when the request is satisfied without calling XLOGWrite. > I know there's a tremendous level of skill involved in this code, but simply > asking in case there's some tricks. > > Another strategy that may work is actually intentionally waiting/buffering > some few ms between flushes/fsync, We do that before attempting to write if user has set "commit_delay" and "commit_siblings" guc parameters. Now here, we can't buffer the fsync requests as current we are doing both writes and fsync under one lock. However, if we can split the work such that writes are done under one lock and fsync under separate lock, then probably we can try to buffer fsync requests and after fsyncing the current pending requests, we can recheck if there are more pending requests and try to flush them. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 7, 2016 at 1:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Oct 7, 2016 at 11:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> What happens if you turn fsync off? Once a xlog file is fully written, it
> is immediately fsynced, even if the backend is holding WALWriteLock or
> wal_insert (or both) at the time, and even if synchrounous_commit is off.
> Assuming this machine has a BBU so that it doesn't have to wait for disk
> rotation, still fsyncs are expensive because the kernel has to find all the
> data and get it sent over to the BBU, while holding locks.
Scale factor 300, 32 clients, fsync=off:
5 Lock | tuple
18 LWLockTranche | lock_manager
24 LWLockNamed | WALWriteLock
88 LWLockTranche | buffer_content
265 LWLockTranche | wal_insert
373 LWLockNamed | CLogControlLock
496 LWLockNamed | ProcArrayLock
532 Lock | extend
540 LWLockNamed | XidGenLock
545 Lock | transactionid
27067 Client | ClientRead
85364 |
Did the TPS go up appreciably?
But I'm not sure you're right about the way the fsync=off code works.
I think pg_fsync()/pg_fsync_writethrough()/pg_fsync_no_ writethrough()
look at enableFsync and just do nothing if it's false.
I think we are in agreement. I don't know which part you think is wrong. When I said immediately, I didn't mean unconditionally.
Anyway, based on the reduced wait events, I think this shows that if we need to do something in the xlog area, probably what it would be is to add a queue of fully written but un-synced xlog files, so that the syncing can be delegated to the background wal writer process. And of course anyone needing to actually flush their xlog would have to start by flushing the queue.
(Or perhaps just make the xlog files bigger, and call it a day)
Cheers,
Jeff
On Fri, Oct 7, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Another strategy that may work is actually intentionally waiting/buffering
> some few ms between flushes/fsync,
We do that before attempting to write if user has set "commit_delay"
and "commit_siblings" guc parameters.
If you have a fast, high resolution timer, then one thing you can do is keep track of when the previous xlog sync finished. Then instead of having commit_delay be an absolute amount of time to sleep, it would mean "wait until that amount of time has passed since the previous sync finished." So you would set it based on the RPM of your drive, so that the time it is sleeping to allow more work to happen from other processes is time it would have to spend waiting on rotational delay anyway.
But I dropped this, because it would be hard to tune, hard to implement in a cross-platform way, and because anyone with such high performance needs is just going to buy a nonvolatile write-cache and be done with it.
Now here, we can't buffer the fsync requests as current we are doing
both writes and fsync under one lock. However, if we can split the
work such that writes are done under one lock and fsync under separate
lock, then probably we can try to buffer fsync requests and after
fsyncing the current pending requests, we can recheck if there are
more pending requests and try to flush them.
What I implemented at one point was:
(Already have the lock before getting here)
Write to the extent it is ready to be written.
Update the shared structure to reflect written upto.
Drop the lock
fsync
Take the lock again
update the shared structure to reflect flushed upto.
Drop the lock again.
This way, multiple process could all be waiting on the kernel's fsync response, rather than on each others locks. What I was hoping would happen is that if one process wrote everything that was ready and called fsync, while it was waiting for the platter to come around to the writing head, more processes could make more data ready, write that more data, and call an fsync of their own. And the kernel would be smart enough to amalgamate them together. But the kernel evidently was not that smart, and performance did not improve.
Cheers,
Jeff
On Fri, Oct 7, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I think we need to come up with some benchmarking queries which get more work done per round-trip to the database. And build them into the binary, because otherwise people won't use them as much as they should if they have to pass "-f" files around mailing lists and blog postings. For example, we could enclose 5 statements of the TPC-B-like into a single function which takes aid, bid, tid, and delta as arguments. And presumably we could drop the other two statements (BEGIN and COMMIT) as well, and rely on autocommit to get that job done. So we could go from 7 statements to 1.
Here is an implementation of that. I've included the calling code as a patch to pgbench, because if I make it a separate -f file then it is a pain to get the correct scale and settings of naccounts, etc., into it.
The create script could be integrated into pgbench -i if this is something we might want to commit.
This gives me an almost 3 fold increase in performance on a system with fsync turned off:
pgbench -b tpcb-func -T30 -c8 -j8
tps = 24193.197877 (excluding connections establishing)
pgbench -b tpcb-like -T30 -c8 -j8
tps = 8434.746300 (excluding connections establishing)
Cheers,
Jeff
Вложения
On Thu, Oct 6, 2016 at 02:38:56PM -0400, Robert Haas wrote: > Hi, > > I decided to do some testing on hydra (IBM-provided community > resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using > the newly-enhanced wait event stuff to try to get an idea of what > we're waiting for during pgbench. I did 30-minute pgbench runs with > various configurations, but all had max_connections = 200, > shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = > off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, > log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = > on. During each run, I ran this psql script in another window and > captured the output: This is a great study that shows how the new instrumentation has given us a new window into performance. I am frankly surprised we got as far as we did in finding performance bottlenecks before we had this instrumentation. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Oct 10, 2016 at 9:44 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Oct 6, 2016 at 02:38:56PM -0400, Robert Haas wrote: >> I decided to do some testing on hydra (IBM-provided community >> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using >> the newly-enhanced wait event stuff to try to get an idea of what >> we're waiting for during pgbench. I did 30-minute pgbench runs with >> various configurations, but all had max_connections = 200, >> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = >> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, >> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = >> on. During each run, I ran this psql script in another window and >> captured the output: > > This is a great study that shows how the new instrumentation has given > us a new window into performance. I am frankly surprised we got as far > as we did in finding performance bottlenecks before we had this > instrumentation. Thanks, and +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company