Re: checkpointer continuous flushing
| От | Fabien COELHO |
|---|---|
| Тема | Re: checkpointer continuous flushing |
| Дата | |
| Msg-id | alpine.DEB.2.10.1508171431580.28260@sto обсуждение исходный текст |
| Ответ на | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
| Список | pgsql-hackers |
Hello Andres,
> On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:
>> sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
>> on on : 631 +- 131 (0.1%)
>> on off : 564 +- 303 (12.0%)
>> off on : 167 +- 315 (76.8%) # stuck...
>> off off : 177 +- 305 (71.2%) # ~ current pg
>
> What exactly do you mean with 'stuck'?
I mean that the during the I/O storms induced by the checkpoint pgbench
sometimes get stuck, i.e. does not report its progression every second (I
run with "-P 1"). This occurs when sort is off, either with or without
flush, for instance an extract from the off/off medium run:
progress: 573.0 s, 5.0 tps, lat 933.022 ms stddev 83.977 progress: 574.0 s, 777.1 tps, lat 7.161 ms stddev 37.059
progress:575.0 s, 148.9 tps, lat 4.597 ms stddev 10.708 progress: 814.4 s, 0.0 tps, lat -nan ms stddev -nan progress:
815.0s, 0.0 tps, lat -nan ms stddev -nan progress: 816.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 817.0 s, 0.0
tps,lat -nan ms stddev -nan progress: 818.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 819.0 s, 0.0 tps, lat -nan ms
stddev-nan progress: 820.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 821.0 s, 0.0 tps, lat -nan ms stddev -nan
progress:822.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 823.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 824.0
s,0.0 tps, lat -nan ms stddev -nan progress: 825.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 826.0 s, 0.0 tps, lat
-nanms stddev -nan
There is a 239.4 seconds gap in pgbench output. This occurs from time to
time and may represent a significant part of the run, and I count these
"stuck" times as 0 tps. Sometimes pgbench is stuck performance wise but
manages nevetheless to report a "0.0 tps" every second, as above after it
unstuck.
The actual origin of the issue with a stuck client (pgbench, libpq, OS,
postgres...) is unclear to me, but the whole system does not behave well
under an I/O storm anyway, and I have not succeeded in understanding where
pgbench is stuck when it does not report its progress. I tried some runs
with gdb but it did not get stuck and reported a lot of "0.0 tps" during
the storms.
Here are a few more figures with the v8 version of the patch, on a host
with 8 cores, 16 GB, RAID 1 HDD, under Ubuntu precise. I already reported
the medium case, and the small case turned afterwards.
small postgresql.conf: shared_buffers = 2GB checkpoint_timeout = 300s # this is the default
checkpoint_completion_target= 0.8 # initialization: pgbench -i -s 120
medium postgresql.conf: ## ALREADY REPORTED shared_buffers = 4GB checkpoint_timeout = 15min
checkpoint_completion_target= 0.8 max_wal_size = 4GB # initialization: pgbench -i -s 250
warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4
# 400 tps throttled test sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4
options / percent of skipped/late transactions sort/flush / small medium on on : 3.5 2.7
on off : 24.6 16.2 off on : 66.1 68.4 off off : 63.2 68.7
# 200 tps throttled test sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4
options / percent of skipped/late transactions sort/flush / small medium on on : 1.9 2.7
on off : 14.3 9.5 off on : 45.6 47.4 off off : 47.4 48.8
# 100 tps throttled test sh> pgbench -M prepared -N -P 1 -T 4000 -R 100 -L 100 -j 2 -c 4
options / percent of skipped/late transactions sort/flush / small medium on on : 0.9 1.8
on off : 9.3 7.9 off on : 5.0 13.0 off off : 31.2 31.9
# full speed 1 client sh> pgbench -M prepared -N -P 1 -T 4000
options / tps avg & stddev (percent of time below 10.0 tps) sort/flush / small medium on
on : 564 +- 148 ( 0.1%) 631 +- 131 ( 0.1%) on off : 470 +- 340 (21.7%) 564 +- 303 (12.0%) off on :
157+- 296 (66.2%) 167 +- 315 (76.8%) off off : 154 +- 251 (61.5%) 177 +- 305 (71.2%)
# full speed 2 threads 4 clients sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
options / tps avg & stddev (percent of time below 10.0 tps) sort/flush / small medium on
on : 757 +- 417 ( 0.1%) 1058 +- 455 ( 0.1%) on off : 752 +- 893 (48.4%) 1056 +- 942 (32.8%) off on :
173+- 521 (83.0%) 170 +- 500 (88.3%) off off : 199 +- 512 (82.5%) 209 +- 506 (82.0%)
In all cases, the "sort on & flush on" provides the best results, with tps
speedup from 3-5, and overall high responsiveness (& lower latency).
--
Fabien.
В списке pgsql-hackers по дате отправления: