Re: Handing off SLRU fsyncs to the checkpointer

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Handing off SLRU fsyncs to the checkpointer
Дата
Msg-id CA+hUKGKErSJ-p8=XGwy2T9DHGqDiLr2Mj+-1nMfq2MjU=_aJjQ@mail.gmail.com
обсуждение исходный текст
Ответ на Handing off SLRU fsyncs to the checkpointer  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Handing off SLRU fsyncs to the checkpointer  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Wed, Feb 12, 2020 at 9:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> In commit 3eb77eba we made it possible for any subsystem that wants a
> file to be flushed as part of the next checkpoint to ask the
> checkpointer to do that, as previously only md.c could do.

Hello,

While working on recovery performance, I found my way back to this
idea and rebased the patch.

Problem statement:

Every time we have to write out a page of pg_commit_ts, pg_multixact
or pg_xact due to cache pressure, we immediately call fsync().  This
runs serially in the recovery process, and it's quite bad for
pg_commit_ts, because we need to dump out a page for every ~800
transactions (track_commit_timestamps is not enabled by default).  If
we ask the checkpointer to do it, it collapses the 2048 fsync calls
for each SLRU segment into one, and the kernel can write out the data
with larger I/Os, maybe even ahead of time, and update the inode only
once.

Experiment:

Run crash recovery for 1 million pgbench transactions:

  postgres -D pgdata \
    -c synchronous_commit=off \
    -c enable_commit_timestamps=on \
    -c max_wal_size=10GB \
    -c checkpoint_timeout=60min

  # in another shell
  pgbench -i -s10 postgres
  psql postgres -c checkpoint
  pgbench -t1000000 -Mprepared postgres
  killall -9 postgres

  # save the crashed pgdata dir for repeated experiments
  mv pgdata pgdata-save

  # now run experiments like this and see how long recovery takes
  rm -fr pgdata
  cp -r pgdata-save pgdata
  postgres -D pgdata

What I see on a system that has around 2.5ms latency for fsync:

  master: 16.83 seconds
  patched: 4.00 seconds

It's harder to see it without commit timestamps enabled since we only
need to flush a pg_xact page every 32k transactions (and multixacts
are more complicated to test), but you can still see the effect.  With
8x more transactions to make it clearer what's going on, I could
measure a speedup of around 6% from this patch, which I suppose scales
up fairly obviously with storage latency (every million transaction =
at least 30 fsyncs stalls, so you can multiply that by your fsync
latency and work out how much time your recovery process will be
asleep at the wheel instead of applying your records).

From a syscall overhead point of view, it's a bit unfortunate that we
open and close SLRU segments every time we write, but it's probably
not really enough to complain about... except for the (small) risk of
an inode dropping out of kernel caches in the time between closing it
and the checkpointer opening it.  Hmm.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: SSL TAP test fails due to default client certs.
Следующее
От: Takashi Menjo
Дата:
Сообщение: Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory