Re: Refactoring the checkpointer's fsync request queue

Поиск
Список
Период
Сортировка
От Shawn Debnath
Тема Re: Refactoring the checkpointer's fsync request queue
Дата
Msg-id 20190220232739.GA8280@f01898859afd.ant.amazon.com
обсуждение исходный текст
Ответ на Re: Refactoring the checkpointer's fsync request queue  (Shawn Debnath <sdn@amazon.com>)
Ответы Re: Refactoring the checkpointer's fsync request queue  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
As promised, here's a patch that addresses the points discussed by 
Andres and Thomas at FOSDEM. As a result of how we want checkpointer to 
track what files to fsync, the pending ops table now integrates the 
forknum and segno as part of the hash key eliminating the need for the 
bitmapsets, or vectors from the previous iterations. We re-construct the 
pathnames from the RelFileNode, ForkNumber and SegmentNumber and use 
PathNameOpenFile to get the file descriptor to use for fsync.

Apart from that, this patch moves the system for requesting and 
processing fsyncs out of md.c into smgr.c, allowing us to call on smgr 
component specific callbacks to retrieve metadata like relation and 
segment paths. This allows smgr components to maintain how relfilenodes, 
forks and segments map to specific files without exposing this knowledge 
to smgr.  It redefines smgrsync() behavior to be closer to that of 
smgrimmedsysnc(), i.e., if a regular sync is required for a particular 
file, enqueue it in locally or forward it to checkpointer.  
smgrimmedsync() retains the existing behavior and fsyncs the file right 
away. The processing of fsync requests has been moved from mdsync() to a 
new ProcessFsyncRequests() function.

Testing
-------

Checkpointer stats didn't cover what I wanted to verify, i.e., time 
spent dealing with the pending operations table. So I added temporary 
instrumentation to get the numbers by timing the code in 
ProcessFsyncRequests which starts by absorbing fsync requests from 
checkpointer queue, processing them and finally issuing sync on the 
files. Similarly, I added the same instrumentation in the mdsync code in 
master branch. The time to actually execute FileSync is irrelevant for 
this patch.

I did two separate runs for 30 mins, both with scale=10,000 on 
i3.8xlarge instances [1] with default params to force frequent 
checkpoints:

1. Single pgbench run having 1000 clients update 4 tables, as a result 
we get 4 relations and its forks and several segments in each being 
synced.

2. 10 parallel pgbench runs on 10 separate databases having 200 clients 
each. This results in more relations and more segments being touched 
letting us better compare against the bitmapset optimizations.

Results
--------

The important metric to look at would be the total time spent absorbing 
and processing the fsync requests as that's what the changes revolve 
around. The other metrics are here for posterity. The new code is about 
6% faster in total time taken to process the queue for the single 
pgbench run. For the 10x parallel pgbench run, we are seeing drops up to 
70% with the patch.

Would be great if some other folks can verify this. The temporary 
instrumentation patches for the master branch and one that applies after 
the main patch are attached. Enable log_checkpoints and then use grep 
and cut to extract the numbers from the log file after the runs.

[Requests Absorbed]

single pgbench run
            Min     Max     Average    Median   Mode    Std Dev
 -------- ------- -------- ---------- -------- ------- ----------
  patch    15144   144961   78628.84    76124   58619   24135.69
  master   25728   138422   81455.04    80601   25728   21295.83

10 parallel pgbench runs
            Min      Max      Average    Median    Mode    Std Dev
 -------- -------- -------- ----------- -------- -------- ----------
  patch     45098   282158    155969.4   151603   153049   39990.91
  master   191833   602512   416533.86   424946   191833   82014.48


[Files Synced]

single pgbench run
           Min   Max   Average   Median   Mode   Std Dev
 -------- ----- ----- --------- -------- ------ ---------
  patch    153   166    158.11      158    159      1.86
  master   154   166    158.29      159    159     10.29

10 parallel pgbench runs
           Min    Max    Average   Median   Mode   Std Dev
 -------- ------ ------ --------- -------- ------ ---------
  patch    1540   1662   1556.42     1554   1552     11.12
  master   1546   1546      1546     1559   1553     12.79


[Total Time in ProcessFsyncRequest/mdsync]

single pgbench run
           Min     Max     Average   Median   Mode   Std Dev
 -------- ----- --------- --------- -------- ------ ---------
  patch    500   3833.51   2305.22     2239    500    510.08
  master   806   4430.32   2458.77     2382    806    497.01

10 parallel pgbench runs
           Min     Max    Average    Median   Mode   Std Dev
 -------- ------ ------- ---------- -------- ------ ---------
  patch     908    6927    3022.58     2863    908    939.09
  master   4323   17858   10982.15    11154   4322   2760.47


 
[1] https://aws.amazon.com/ec2/instance-types/i3/

-- 
Shawn Debnath
Amazon Web Services (AWS)

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ryan David Sheasby
Дата:
Сообщение: Journal based VACUUM FULL
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: WAL insert delay settings