[HACKERS] An attempt to reduce WALWriteLock contention

Поиск
Список
Период
Сортировка
От Kuntal Ghosh
Тема [HACKERS] An attempt to reduce WALWriteLock contention
Дата
Msg-id CAGz5QCLUZKRezjnhu2VtU5K-1-JGeGf+aJk8iqvF80z4QNywAw@mail.gmail.com
обсуждение исходный текст
Ответы Re: [HACKERS] An attempt to reduce WALWriteLock contention  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: [HACKERS] An attempt to reduce WALWriteLock contention  (jasrajd <jasrajd@microsoft.com>)
Список pgsql-hackers
Hello all,

In a recent post[1] by Robert, wait events for different LWLOCKS have been analyzed. The results clearly indicate a significant lock contention overhead on WAL Write locks. To get an idea of this overhead, we did the following two tests.

1. Hacked the code to comment out WAL write and flush calls to see the overhead of WAL writing. The TPS for read-write pgbench tests at 300 scale factor with 64 client count increased from 27871 to 45068.

2. Hacked the code to comment out WAL flush calls to see the overhead of WAL flushing (fsync=off). The TPS for read-write pgbench tests at 300 scale factor with 64 client count increased from 27871 to 41835.

All the tests have been performed for 15 minutes with following pg configurations:
max_wal_size: 40GB
checkpoint_timeout: 15mins
maintenance_work_mem: 4GB
checkpoint_completion_target: 0.9
Shared buffer: 8GB
(Other settings have default values)

From above experiments, it is clear that flush is the main cost in WAL writing which is no surprise, but still, the above data shows the exact overhead of flush. Robert and Amit suggested (in offline discussions) using separate WALFlushLock to flush the WAL data. The idea is to take WAL flush calls out of WAL Write Lock and introduce a new lock (WAL Flush Lock) to flush the data. This should allow simultaneous os writes when a fsync is in progress. LWLockAcquireOrWait is used for the newly introduced WAL Flush Lock to accumulate flush calls. We did a pgbench read/write (s.f. 300) test with above configurations for various clients. But, we didn't see any performance improvements, rather it decreased by 10%-12%. Hence to measure the wait events, we performed a run for 30 minutes with 64 clients.

\t
select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid();
\watch 0.5
HEAD
------------------------
48642 LWLockNamed | WALWriteLock

With Patch
----------------------------------
31889 LWLockNamed | WALFlushLock
25212 LWLockNamed | WALWriteLock

The contention on WAL Write Lock was reduced, but together with WAL Flush lock, the total contention got increased. We also measured the number of times fsync() and write() have been called for a 10-minutes pgbench read/write test with 16 clients. We noticed a huge increase in write() system calls and this is happening as we've reduced the contention on WAL Write Lock.

Due to reduced contention on WAL Write Lock, lot of backends are going for small os writes, sometimes on same 8KB page, i.e., write calls are not properly accumulated. For example,
backend 1 - 1 KB write() - 15-20 micro secs
backend 2 - 1 KB write() - 15-20 micro secs
backend 3 - 1 KB write() - 15-20 micro secs
backend 4 - 1 KB write() - 15-20 micro secs
But, if we accumulate these 4 requests, 4KB can be written in 50-60 micro secs. Apart from that, we are also paying for lock acquire and lock release for os write and lseek(). For the same reason, when a fsync is going, we are not able to accumulate sufficient data for next fsync. This also increases the contention on WAL Flush Lock. So, we tried adding delay(pg_usleep) before flush/write to accumulate data. But, this severely increases the contention on WAL flush locks.

To reduce the contention on WAL Write Lock further, Amit suggested the following change on top of the existing patch:
Backend as Write Leader:
Except one proc, all other proc's will wait for their write location to be written in OS buffer. Each proc will advertise it's write location and wait on the semaphore to check whether it's write location has been completed. Only the leader will compete for WALWriteLock.After data is written, it wakes all the procs for which it has written the WAL and once done with waking it will release the WALWriteLock. Ashutosh and Amit have helped a lot for the implementation of the above idea.  Even after this idea, we didn't see any noticeable performance improvement with synchronous_commit=on mode, however there was no regression. Again, to measure the wait events, we performed a 30 minutes run with 64 clients. (pgbench r/w test with s.f. 300)

\t
select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid();
\watch 0.5
HEAD
------------------------
48642  LWLockNamed | WALWriteLock

With Patch
----------------------------------
38952 LWLockNamed | WALFlushLock
1679 LWLockNamed | WALWriteLock

We reduced the contention on WAL write locks. The reason is that only the group leader is competing for write lock on behalf of a group of procs. Still, the number of small write requests is not reduced.

Finally, we performed some tests with synchronous_commit=off and data doesn't fit in shared buffer. This should accumulate the data properly for write without waiting on some locks or semaphores. Besides, write and fsync can be done simultaneously. Next results are for various scale factors and shared buffers. (Please see below for system configuration):

./pgbench -c $threads -j $threads -T 900 -M prepared postgres
non default param:
Scale Factor=1000
shared_buffers=10GB
max_connections=200

threads     HEAD        PATCH   %diff
48          18585       18618   +0.1
64          19631       19735   +0.5
128         19332       20556   +6.3

./pgbench -c $threads -j $threads -T 900 -M prepared postgres
non default param:
Scale Factor=1000
shared_buffers=14GB
max_connections=200

threads     HEAD        PATCH   %diff
48          42156       47398   +12.4
64          41737       45288   +8.36
128         37983       47443   +24.9

./pgbench -c $threads -j $threads -T 900 -M prepared postgres
non default param:
Scale Factor=300
shared_buffers=4GB
max_connections=200

threads     HEAD        PATCH   %diff
48          48151       48665   +1.06
64          52651       52789   +0.2
128         56656       60691   +7.1

We noticed some good improvement when most of the data fits in shared buffer. Apart from that, the improvements are not significant. It may happen due to high io for buffer evictions in less shared buffer.

In conclusions, we tried to take flush calls out of WAL write lock so that we can allow simultaneous os writes when fsync is going on. For synchronous_commit=off, we improved the performance significantly. For other cases, the reason may be that we are not accumulating write calls properly and thus issuing a lot of small write requests. Another possibility could be the overhead of adding an extra lock.

Thanks to Amit Kapila, Ashutosh Sharma and Robert Haas for helping me throughout the process with their valuable inputs.

I've attached the prototype patch as well. PFA. Any suggestions or comments will really be helpful in this regard.

System Configuration:

Model name:            Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
CPU(s):                56
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Kernel:          3.10.0-327.36.1.el7.x86_64
pg_wal on /mnt/ssd type ext4 (rw,relatime,data=ordered)

[1] [HACKERS] pgbench vs. wait events


--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: [HACKERS] Minor correction in alter_table.sgml