Re: design for parallel backup

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: design for parallel backup
Дата
Msg-id 20200503174922.mfzzdafa5g4rlhez@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

On 2020-05-03 09:12:59 -0400, Robert Haas wrote:
> On Sat, May 2, 2020 at 10:36 PM Andres Freund <andres@anarazel.de> wrote:
> > I changed Robert's test program to optionall fallocate,
> > sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
> > footprint in the page cache. The performance
> > differences are quite substantial:
> >
> > gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
> >     rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
> >     /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024))
/srv/dev/bench/test1
> >
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
> > [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220
> >
> > comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
> > [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161
>
> Ah, nice.

Btw, I forgot to include the result for 0 / 0 / 0 in the results
(off-by-one error in a script :))

numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
178.188M/sec   1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 

> > The run-to-run variations between the runs without cache control are
> > pretty large. So this is probably not the end-all-be-all numbers. But I
> > think the trends are pretty clear.
>
> Could you be explicit about what you think those clear trends are?

Largely that concurrency can help a bit, but also hurt
tremendously. Below is some more detailed analysis, it'll be a bit
long...

Taking the no concurrency / cache management as a baseline:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 

and comparing cache management with using some concurrency:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 

we can see very similar timing. Which makes sense, because that's
roughly the device's max speed. But then going to higher concurrency,
there's clearly regressions:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 

And I think it is instructive to look at the
ref_cycles_tot/cycles_tot/instructions_tot vs
ref_cycles_sec/cycles_sec/ipc. The units are confusing because they are
across all cores and most are idle. But it's pretty obvious that
numprocs=1 sfr=1 fadvise=1 has cores running for a lot shorter time
(reference cycles basically count the time cores were running on a
absolute time scale). Compared to numprocs=2 sfr=0 fadvise=0, which has
the same resulting performance, it's clear that cores were busier, but
less efficient (lower ipc).

With cache mangement there's very little benefit, and some risk (1->2
regression) in this workload with increasing concurrency:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 


And there's good benefit, but tremendous risk, of concurrency in the no
cache control case:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 


sync file range without fadvise isn't a benefit at low concurrency, but prevents bad regressions at high concurency:
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166    0.54
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 

fadvise alone is similar:
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 


There does not appear to be a huge of benefit in fallocate in this
workload, the OS's delayed allocation works well. Compare:

numprocs=1
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0     243.609959554   1,802,385,405,203
184.970M/sec   1,449,560,513,247    0.149GHz        855,426,288,031      0.56
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166      0.54
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0     230.880100449   1,328,417,418,799
143.846M/sec   1,148,924,667,393    0.124GHz        723,158,246,628      0.63
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1     253.591234992   1,548,485,571,798
152.658M/sec   1,229,926,994,613    0.121GHz        1,117,352,436,324    0.95
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 

numprocs=2
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0     421.580487642   2,756,486,952,728
163.449M/sec   1,387,708,033,752    0.082GHz        990,478,650,874      0.72
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0     169.854206542   1,333,619,626,854
196.282M/sec   1,036,261,531,134    0.153GHz        666,052,333,591      0.64
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1     305.078100578   1,970,042,289,192
161.445M/sec   1,505,706,462,812    0.123GHz        954,963,240,648      0.62
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 

numprocs=4
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0     334.932246893   2,366,388,662,460
176.628M/sec   1,216,049,589,993    0.091GHz        796,698,831,717      0.68
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0     161.697270285   1,866,036,713,483
288.576M/sec   1,068,181,502,433    0.165GHz        739,559,279,008      0.70
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1     231.440889430   1,965,389,749,057
212.391M/sec   1,407,927,406,358    0.152GHz        997,199,361,968      0.72
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 

I would say that it seems to help concurrent cases without cache
control, but not particularly reliably so. At higher concurrency it
seems to hurt with cache control, not sure I undstand why.


I was at first confused why 128kb write sizes hurt (128kb is probably on
the higher end of useful, but I wanted to have see a more extreme
difference):

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081   1,569,524,602,961
178.188M/sec   1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0   644.521613661   3,688,449,404,537
143.079M/sec   2,020,128,131,309    0.078GHz        961,486,630,359      0.48
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166      0.54
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0   243.830464632   1,499,608,983,445
153.756M/sec   1,227,468,439,403    0.126GHz        691,534,661,654      0.59
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1   292.866419420   1,753,376,415,877
149.677M/sec   1,483,169,463,392    0.127GHz        860,035,914,148      0.56
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1   162.152397194   925,643,754,128
142.719M/sec   743,208,501,601      0.115GHz        554,462,585,110      0.70 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0     243.609959554   1,802,385,405,203
184.970M/sec   1,449,560,513,247    0.149GHz        855,426,288,031      0.56
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0   211.369510165   1,558,996,898,599
184.401M/sec   1,359,343,408,200    0.161GHz        766,769,036,524      0.57
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0     230.880100449   1,328,417,418,799
143.846M/sec   1,148,924,667,393    0.124GHz        723,158,246,628      0.63
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0   233.315094908   1,427,133,080,540
152.927M/sec   1,166,000,868,597    0.125GHz        743,027,329,074      0.64
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1     253.591234992   1,548,485,571,798
152.658M/sec   1,229,926,994,613    0.121GHz        1,117,352,436,324    0.95
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1   290.698155820   1,732,849,079,701
149.032M/sec   1,441,508,612,326    0.124GHz        835,039,426,282      0.57
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1   159.945462440   850,162,390,626
132.892M/sec   724,286,281,548      0.113GHz        670,069,573,150      0.90
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0   163.244592275   1,524,807,507,173
233.531M/sec   1,398,319,581,978    0.214GHz        689,514,058,243      0.46
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0   231.795934322   1,731,030,267,153
186.686M/sec   1,124,935,745,020    0.121GHz        736,084,922,669      0.70
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1   315.564163702   1,958,199,733,216
155.128M/sec   1,405,115,546,716    0.111GHz        1,000,595,890,394    0.73
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1   210.945487961   1,527,169,148,899
180.990M/sec   906,023,518,692      0.107GHz        700,166,552,207      0.80
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0     421.580487642   2,756,486,952,728
163.449M/sec   1,387,708,033,752    0.082GHz        990,478,650,874      0.72
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0   161.759094088   1,468,321,054,671
226.934M/sec   1,221,167,105,510    0.189GHz        735,855,415,612      0.59
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0     169.854206542   1,333,619,626,854
196.282M/sec   1,036,261,531,134    0.153GHz        666,052,333,591      0.64
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0   158.578248952   1,354,770,825,277
213.586M/sec   936,436,363,752      0.148GHz        654,823,079,884      0.68
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1     305.078100578   1,970,042,289,192
161.445M/sec   1,505,706,462,812    0.123GHz        954,963,240,648      0.62
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1   274.628500801   1,792,841,068,080
163.209M/sec   1,343,398,055,199    0.122GHz        996,073,874,051      0.73
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1   179.140070123   1,383,595,004,328
193.095M/sec   850,299,722,091      0.119GHz        706,959,617,654      0.83
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0   445.496787199   2,663,914,572,687
149.495M/sec   1,267,340,496,930    0.071GHz        787,469,552,454      0.62
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0   261.866083604   2,325,884,820,091
222.043M/sec   1,094,814,208,219    0.105GHz        649,479,233,453      0.57
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1   172.963505544   1,717,387,683,260
248.228M/sec   1,356,381,335,831    0.196GHz        822,256,638,370      0.58
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1   157.934678897   1,650,503,807,778
261.266M/sec   970,705,561,971      0.154GHz        637,953,927,131      0.66
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0     334.932246893   2,366,388,662,460
176.628M/sec   1,216,049,589,993    0.091GHz        796,698,831,717      0.68
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0   225.623143601   1,804,402,820,599
199.938M/sec   1,086,394,788,362    0.120GHz        656,392,112,807      0.62
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0     161.697270285   1,866,036,713,483
288.576M/sec   1,068,181,502,433    0.165GHz        739,559,279,008      0.70
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0   157.930900998   1,797,506,082,342
284.548M/sec   1,001,509,813,741    0.159GHz        644,107,150,289      0.66
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1     231.440889430   1,965,389,749,057
212.391M/sec   1,407,927,406,358    0.152GHz        997,199,361,968      0.72
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1   165.772265335   1,805,895,001,689
272.353M/sec   1,514,173,918,970    0.228GHz        823,435,044,810      0.54
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1   187.664764448   1,964,118,348,429
261.660M/sec   978,060,510,880      0.130GHz        668,316,194,988      0.67
 

It's pretty clear that the larger write block size can hurt quite
badly. I was somewhat confused by this at first, but after thinking
about it for a while longer it actually makes sense: For the OS to
finish an 8k write it needs to find two free pagecache pages. For an
128k write it needs to find 32. Which means that it's much more likely
that kernel threads and the writes are going to fight over locks /
cachelines: In the 8k page it's quite likely that ofen the kernel
threads will do so while the memcpy() from userland is happening, but
that's less the case with 32 pages that need to be acquired before the
memcpy() can happen.

With cache control that problem doesn't exist, which is why the larger
block size is beneficial:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1   162.152397194   925,643,754,128
142.719M/sec   743,208,501,601      0.115GHz        554,462,585,110      0.70
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1   159.945462440   850,162,390,626
132.892M/sec   724,286,281,548      0.113GHz        670,069,573,150      0.90
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1   210.945487961   1,527,169,148,899
180.990M/sec   906,023,518,692      0.107GHz        700,166,552,207      0.80
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1   179.140070123   1,383,595,004,328
193.095M/sec   850,299,722,091      0.119GHz        706,959,617,654      0.83
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1   157.934678897   1,650,503,807,778
261.266M/sec   970,705,561,971      0.154GHz        637,953,927,131      0.66
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1   187.664764448   1,964,118,348,429
261.660M/sec   978,060,510,880      0.130GHz        668,316,194,988      0.67
 

Note how especially in the first few cases the total number of
instructions required is improved (although due to the way I did the
perf stat the sampling error is pretty large).


I haven't run that test yest, but after looking at all this I would bet
that reducing the block size to 4kb (i.e. a single os/hw page) would
help the no cache control case significantly, in particular in the
concurrent case.

And conversely, I'd expect that the CPU efficiency will be improved by
larger block size for the cache control case for just about any
realistic block size.


I'd love to have a faster storage available (faster NVMes, or multiple
ones I can use for benchmarking) to see what the cutoff point for
actually benefiting from concurrency is.


Also worthwhile to note that even the "best case" from a CPU usage point
here absolutely *pales* against using direct-IO. It's not an
apples/apples comparison, but comparing buffered io using
write_and_fsync, and unbuffered io using fio:

128KiB blocksize:

write_and_fsync:
echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions
/tmp/write_and_fsync--blocksize $((128*1024)) --sync_file_range=1 --fallocate=1 --fadvise=1 --sequential=0
--filesize=$((400*1024*1024*1024))/srv/dev/bench/test1
 

 Performance counter stats for 'system wide':

      6,377,903.65 msec cpu-clock                 #   39.999 CPUs utilized
   628,014,590,200      ref-cycles                #   98.467 M/sec
   634,468,623,514      cycles                    #    0.099 GHz
   795,771,756,320      instructions              #    1.25  insn per cycle

     159.451492209 seconds time elapsed

fio:
rm -f /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e
cpu-clock,ref-cycles,cycles,instructionsfio --name=test --iodepth=512 --iodepth_low=8 --iodepth_batch_submit=8
--iodepth_batch_complete_min=8--iodepth_batch_complete_max=128 --ioengine=libaio --rw=write --bs=128k
--filesize=$((400*1024*1024*1024)) --direct=1 --numjobs=1
 

 Performance counter stats for 'system wide':

      6,313,522.71 msec cpu-clock                 #   39.999 CPUs utilized
   458,476,185,800      ref-cycles                #   72.618 M/sec
   196,148,015,054      cycles                    #    0.031 GHz
   158,921,457,853      instructions              #    0.81  insn per cycle

     157.842080440 seconds time elapsed

CPU usage for fio most of the time was around 98% for write_and_fsync
and 40% for fio.

I.e. system-wide CPUs were active 0.73x the time, and 0.2x as many
instructions had to be executed in the DIO case.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: Should program exit, When close() failed for O_RDONLY mode
Следующее
От: Ranier Vilela
Дата:
Сообщение: Re: Postgres Windows build system doesn't work with python installedin Program Files