Re: checkpointer continuous flushing

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: checkpointer continuous flushing
Дата
Msg-id CA+TgmoYepS5ts9Qn_6HpAC-F8qWkvn6jT8duZF5=HtOH-2G2MQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Ответы Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Mon, Jan 18, 2016 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:
>> Hello Andres,
>>
>> >I measured it in a different number of cases, both on SSDs and spinning
>> >rust. I just reproduced it with:
>> >
>> >postgres-ckpt14 \
>> >       -D /srv/temp/pgdev-dev-800/ \
>> >       -c maintenance_work_mem=2GB \
>> >       -c fsync=on \
>> >       -c synchronous_commit=off \
>> >       -c shared_buffers=2GB \
>> >       -c wal_level=hot_standby \
>> >       -c max_wal_senders=10 \
>> >       -c max_wal_size=100GB \
>> >       -c checkpoint_timeout=30s
>> >
>> >Using a fresh cluster each time (copied from a "template" to save time)
>> >and using
>> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1
>
> So, I've analyzed the problem further, and I think I found something
> rater interesting. I'd profiled the kernel looking where it blocks in
> the IO request queues, and found that the wal writer was involved
> surprisingly often.
>
> So, in a workload where everything (checkpoint, bgwriter, backend
> writes) is flushed:                             2995 tps
> After I kill the wal writer with -STOP:         10887 tps
>
> Stracing the wal writer shows:
>
> 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, si_uid=1000} ---
> 17:29:02.001538 rt_sigreturn({mask=[]}) = 0
> 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.001615 write(3, "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 49152) = 49152
> 17:29:02.001671 fdatasync(3)            = 0
> 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, si_uid=1000} ---
> 17:29:02.005043 rt_sigreturn({mask=[]}) = 0
> 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.005111 write(3, "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 8192) = 8192
> 17:29:02.005147 fdatasync(3)            = 0
> 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, si_uid=1000} ---
> 17:29:02.008705 rt_sigreturn({mask=[]}) = 0
> 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 \331_/\0\0\0\267\30\0\0\0\0\0\0        "..., 98304) = 98304
> 17:29:02.008822 fdatasync(3)            = 0
> 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
> 17:29:02.016141 rt_sigreturn({mask=[]}) = 0
> 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.016204 write(3, "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 57344) = 57344
> 17:29:02.016281 fdatasync(3)            = 0
> 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
> 17:29:02.019199 rt_sigreturn({mask=[]}) = 0
> 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.019249 write(3, "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0        "..., 73728) = 73728
> 17:29:02.019355 fdatasync(3)            = 0
> 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
> 17:29:02.022696 rt_sigreturn({mask=[]}) = 0
>
> I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
> second. As soon as the wal writer is stopped, it's much bigger chunks,
> on the order of 50-130 pages. And, not that surprisingly, that improves
> performance, because there's far fewer cache flushes submitted to the
> hardware.

This seems like a problem with the WAL writer quite independent of
anything else.  It seems likely to be inadvertent fallout from this
patch:

Author: Simon Riggs <simon@2ndQuadrant.com>
Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000
   Wakeup WALWriter as needed for asynchronous commit performance.   Previously we waited for wal_writer_delay before
flushingWAL. Now   we also wake WALWriter as soon as a WAL buffer page has filled.   Significant effect observed on
performanceof asynchronous commits   by Robert Haas, attributed to the ability to set hint bits on tuples   earlier and
soreducing contention caused by clog lookups.
 

If I understand correctly, prior to that commit, WAL writer woke up 5
times per second and flushed just that often (unless you changed the
default settings).    But as the commit message explained, that turned
out to suck - you could make performance go up very significantly by
radically decreasing wal_writer_delay.  This commit basically lets it
flush at maximum velocity - as fast as we finish one flush, we can
start the next.  That must have seemed like a win at the time from the
way the commit message was written, but you seem to now be seeing the
opposite effect, where performance is suffering because flushes are
too frequent rather than too infrequent.  I wonder if there's an ideal
flush rate and what it is, and how much it depends on what hardware
you have got.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: PATCH: Extending the HyperLogLog API a bit
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Re: [JDBC] 9.4-1207 behaves differently with server side prepared statements compared to 9.2-1102