Re: design for parallel backup

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: design for parallel backup
Дата
Msg-id 20200503023643.x2o2244sa4vkikyb@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

On 2020-05-01 16:32:15 -0400, Robert Haas wrote:
> On Thu, Apr 30, 2020 at 6:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote:
> > > Why 8kb? That's smaller than what we currently do in pg_basebackup,
> > > afaictl, and you're actually going to be bottlenecked by syscall
> > > overhead at that point (unless you disable / don't have the whole intel
> > > security mitigation stuff).
> >
> > I just picked something. Could easily try other things.
> 
> I tried changing the write size to 64kB, keeping the rest the same.
> Here are the results:
> 
> filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB
> xfs mag 65 53 64 74 79
> ext4 mag 96 68 75 303 437
> xfs ssd 75 43 29 33 38
> ext4 ssd 96 68 63 214 254
> spread spread n/a n/a 43 38 40
> 
> And here again are the previous results with an 8kB write size:
> 
> xfs mag 97 53 60 67 71
> ext4 mag 94 68 66 335 549
> xfs ssd 97 55 33 27 25
> ext4 ssd 116 70 66 227 450
> spread spread n/a n/a 48 42 44
> 
> Generally, those numbers look better than the previous numbers, but
> parallelism still looks fairly appealing on the SSD storage - less so
> on magnetic disks, at least in this test.

I spent a fair bit of time analyzing this, and my conclusion is that you
might largely be seeing numa effects. Yay.

I don't have an as large numa machine at hand, but here's what I'm
seeing on my local machine, during a run of writing out 400GiB (this is
a run with noise on the machine, the benchmarks below are without
that). The machine has 192GiB of ram, evenly distributed to two sockets
/ numa domains.


At start I see
numastat -m|grep -E 'MemFree|MemUsed|Dirty|Writeback|Active\(file\)|Inactive\(file\)'"
MemFree                 91908.20        92209.85       184118.05
MemUsed                  3463.05         4553.33         8016.38
Active(file)              105.46          328.52          433.98
Inactive(file)             68.29          190.14          258.43
Dirty                       0.86            0.90            1.76
Writeback                   0.00            0.00            0.00
WritebackTmp                0.00            0.00            0.00

For a while there's pretty decent IO throughput (all 10s samples):
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1955.67   2299.32     0.00   0.00   42.48  1203.94
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   82.10  89.33
 

Then it starts to be slower on a sustained basis:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1593.33   1987.85     0.00   0.00   42.90  1277.55
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   67.55  76.53
 

And then performance tanks completely:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00  646.33    781.85     0.00   0.00  132.68  1238.70
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   85.43  58.63
 


That amount of degradation confused me for a while, especially because I
couldn't reproduce it the more controlled I made the setups. In
particular I stopped seeing the same magnitude of issues after pinnning
processes to one numa socket (both running and memory).

After a few seconds:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1882.00   2320.07     0.00   0.00   42.50  1262.35
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   79.05  88.07
 

MemFree                 35356.50        80986.46       116342.96
MemUsed                 60014.75        15776.72        75791.47
Active(file)              179.44          163.28          342.72
Inactive(file)          58293.18        13385.15        71678.33
Dirty                   18407.50          882.00        19289.50
Writeback                 235.78          335.43          571.21
WritebackTmp                0.00            0.00            0.00

A bit later io starts to get slower:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1556.30   1898.70     0.00   0.00   40.92  1249.29
0.00     0.00     0.00   0.00    0.00     0.00    0.20   24.00   62.90  72.01
 

MemFree                   519.56        36086.14        36605.70
MemUsed                 94851.69        60677.04       155528.73
Active(file)              303.84          212.96          516.80
Inactive(file)          92776.70        58133.28       150909.97
Dirty                   10913.20         5374.07        16287.27
Writeback                 812.94          331.96         1144.90
WritebackTmp                0.00            0.00            0.00


And then later it gets worse:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1384.70   1671.25     0.00   0.00   40.87  1235.91
0.00     0.00     0.00   0.00    0.00     0.00    0.20    7.00   55.89  63.45
 

MemFree                   519.54          242.98          762.52
MemUsed                 94851.71        96520.20       191371.91
Active(file)              175.82          246.03          421.85
Inactive(file)          92820.19        93985.79       186805.98
Dirty                   10482.75         4140.72        14623.47
Writeback                   0.00            0.00            0.00
WritebackTmp                0.00            0.00            0.00

When using a 1s iostat instead of a 10s, it's noticable that performance
swings widely between very slow (<100MB/s) and very high throughput (>
2500MB/s).

It's clearly visible that performance degrades substantially first when
all of a numa node's free memory is exhausted, then when the second numa
node's is.

Looking at profile I see a lot of cacheline bouncing between the kernel
threads that "reclaim" pages (i.e. make them available for reuse), the
kernel threads that write out dirty pages, the kernel threads where the
IO completes (i.e. where the dirty bit can be flipped / locks get
released), and the writing process.

I think there's a lot from the kernel side that can improve - but it's
not too surprising that letting the kernel cache / forcing it to make
caching decisions for a large streaming wide has substantial costs.


I changed Robert's test program to optionall fallocate,
sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
footprint in the page cache. The performance
differences are quite substantial:

gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
    rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
    /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024))
/srv/dev/bench/test1

running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
[/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220

comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
[/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161

Below are the results of running a the program with a variation of
parameters (both file and resutls attached).

I used perf stat in this run to measure the difference in CPU
usage.

ref_cycles are the number of CPU cycles, across all 20 cores / 40
threads, CPUs were doing *something*. It is not affected by CPU
frequency scaling, just by the time CPUs were not "halted". Whereas
cycles is affected by frequency scaling.

A high ref_cycles_sec, combined with a decent number of total
instructions/cycles is *good*, because it indicates fewer CPUs
used. Whereas a very high ref_cycles_tot means that more CPUs were
running doing something for the duration of the benchmark.

The run-to-run variations between the runs without cache control are
pretty large. So this is probably not the end-all-be-all numbers. But I
think the trends are pretty clear.

test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166      0.54
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0     243.609959554   1,802,385,405,203
184.970M/sec   1,449,560,513,247    0.149GHz        855,426,288,031      0.56
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0     230.880100449   1,328,417,418,799
143.846M/sec   1,148,924,667,393    0.124GHz        723,158,246,628      0.63
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1     253.591234992   1,548,485,571,798
152.658M/sec   1,229,926,994,613    0.121GHz        1,117,352,436,324    0.95
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0     421.580487642   2,756,486,952,728
163.449M/sec   1,387,708,033,752    0.082GHz        990,478,650,874      0.72
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0     169.854206542   1,333,619,626,854
196.282M/sec   1,036,261,531,134    0.153GHz        666,052,333,591      0.64
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1     305.078100578   1,970,042,289,192
161.445M/sec   1,505,706,462,812    0.123GHz        954,963,240,648      0.62
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0     334.932246893   2,366,388,662,460
176.628M/sec   1,216,049,589,993    0.091GHz        796,698,831,717      0.68
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0     161.697270285   1,866,036,713,483
288.576M/sec   1,068,181,502,433    0.165GHz        739,559,279,008      0.70
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1     231.440889430   1,965,389,749,057
212.391M/sec   1,407,927,406,358    0.152GHz        997,199,361,968      0.72
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0   644.521613661   3,688,449,404,537
143.079M/sec   2,020,128,131,309    0.078GHz        961,486,630,359      0.48
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0   243.830464632   1,499,608,983,445
153.756M/sec   1,227,468,439,403    0.126GHz        691,534,661,654      0.59
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1   292.866419420   1,753,376,415,877
149.677M/sec   1,483,169,463,392    0.127GHz        860,035,914,148      0.56
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1   162.152397194   925,643,754,128
142.719M/sec   743,208,501,601      0.115GHz        554,462,585,110      0.70
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0   211.369510165   1,558,996,898,599
184.401M/sec   1,359,343,408,200    0.161GHz        766,769,036,524      0.57
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0   233.315094908   1,427,133,080,540
152.927M/sec   1,166,000,868,597    0.125GHz        743,027,329,074      0.64
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1   290.698155820   1,732,849,079,701
149.032M/sec   1,441,508,612,326    0.124GHz        835,039,426,282      0.57
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1   159.945462440   850,162,390,626
132.892M/sec   724,286,281,548      0.113GHz        670,069,573,150      0.90
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0   163.244592275   1,524,807,507,173
233.531M/sec   1,398,319,581,978    0.214GHz        689,514,058,243      0.46
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0   231.795934322   1,731,030,267,153
186.686M/sec   1,124,935,745,020    0.121GHz        736,084,922,669      0.70
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1   315.564163702   1,958,199,733,216
155.128M/sec   1,405,115,546,716    0.111GHz        1,000,595,890,394    0.73
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1   210.945487961   1,527,169,148,899
180.990M/sec   906,023,518,692      0.107GHz        700,166,552,207      0.80
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0   161.759094088   1,468,321,054,671
226.934M/sec   1,221,167,105,510    0.189GHz        735,855,415,612      0.59 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0   158.578248952   1,354,770,825,277
213.586M/sec   936,436,363,752      0.148GHz        654,823,079,884      0.68
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1   274.628500801   1,792,841,068,080
163.209M/sec   1,343,398,055,199    0.122GHz        996,073,874,051      0.73
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1   179.140070123   1,383,595,004,328
193.095M/sec   850,299,722,091      0.119GHz        706,959,617,654      0.83
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0   445.496787199   2,663,914,572,687
149.495M/sec   1,267,340,496,930    0.071GHz        787,469,552,454      0.62
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0   261.866083604   2,325,884,820,091
222.043M/sec   1,094,814,208,219    0.105GHz        649,479,233,453      0.57
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1   172.963505544   1,717,387,683,260
248.228M/sec   1,356,381,335,831    0.196GHz        822,256,638,370      0.58
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1   157.934678897   1,650,503,807,778
261.266M/sec   970,705,561,971      0.154GHz        637,953,927,131      0.66
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0   225.623143601   1,804,402,820,599
199.938M/sec   1,086,394,788,362    0.120GHz        656,392,112,807      0.62
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0   157.930900998   1,797,506,082,342
284.548M/sec   1,001,509,813,741    0.159GHz        644,107,150,289      0.66
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1   165.772265335   1,805,895,001,689
272.353M/sec   1,514,173,918,970    0.228GHz        823,435,044,810      0.54
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1   187.664764448   1,964,118,348,429
261.660M/sec   978,060,510,880      0.130GHz        668,316,194,988      0.67
 


Greetings,

Andres Freund

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Atsushi Torikoshi
Дата:
Сообщение: Re: pg_stat_reset_slru(name) doesn't seem to work as documented
Следующее
От: Justin Pryzby
Дата:
Сообщение: Re: pg_ls_tmpdir to show directories and shared filesets (andpg_ls_*)