Re: Spread checkpoint sync

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Spread checkpoint sync
Дата
Msg-id 4D54AD64.3010306@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
Ответы Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Looks like it's time to close the book on this one for 9.1 
development...the unfortunate results are at 
http://www.2ndquadrant.us/pgbench-results/index.htm  Test set #12 is the 
one with spread sync I was hoping would turn out better than #9, the 
reference I was trying to improve on.  TPS is about 5% slower on the 
scale=500 and 15% slower on the scale=1000 tests with sync spread out.  
Even worse, maximum latency went up a lot. 

I am convinced of a couple of things now:

1) Most of the benefit we were seeing from the original patch I 
submitted was simply from doing much better at absorbing fsync requests 
from backends while the checkpoint sync was running.  The already 
committed fsync compaction patch effectively removes that problem 
though, to the extent it's possible to do so, making the remaining 
pieces here not as useful in its wake.

2) I need to start over testing here with something that isn't 100% 
write all of the time the way pgbench is.  It's really hard to isolate 
out latency improvements when the test program guarantees all associated 
write caches will be completely filled at every moment.  Also, I can't 
see any benefit if I make changes that improve performance only for 
readers with it, which is quite unrealistic relative to real-world 
workloads.

3) The existing write spreading code in the background writer needs to 
be overhauled, too, before spreading the syncs around is going to give 
the benefits I was hoping for.

Given all that, I'm going to take my feedback and give the test server a 
much deserved break.  I'm happy that the fsync compaction patch has made 
9.1 much more tolerant of write-heavy loads than earlier versions, so 
it's not like no progress was made in this release.

For anyone who wants more details here...the news on this spread sync 
implementation is not all bad.  If you compare this result from HEAD, 
with scale=1000 and clients=256:

http://www.2ndquadrant.us/pgbench-results/611/index.html

Against its identically configured result with spread sync:

http://www.2ndquadrant.us/pgbench-results/708/index.html

There are actually significantly less times in the >2000 ms latency 
area.  That shows up as a reduction in the 90th percentile latency 
figures I compute, and you can see it in the graph if you look at how 
much denser the points are in the 2000 - 4000 ms are on #611.  But 
that's a pretty weak change.

But the most disappointing part here relative to what I was hoping is 
what happens with bigger buffer caches.  The main idea driving this 
approach was that it would enable larger values of shared_buffers 
without the checkpoint spikes being as bad.  Test set #13 tries that 
out, by increasing shared_buffers from 256MB to 4GB, along with a big 
enough increase in checkpoint_segments to make most checkpoints time 
based.  Not only did smaller scale TPS drop in half, all kinds of bad 
things happened to latency.  Here's a sample of the sort of 
dysfunctional checkpoints that came out of that:

2011-02-10 02:41:17 EST: LOG:  checkpoint starting: xlog
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync:  estimated segments=22
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=1 
file=base/16384/16768 time=150.008 msec
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=2 
file=base/16384/16749 time=0.002 msec
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=3 
file=base/16384/16749_fsm time=0.001 msec
2011-02-10 02:53:23 EST: DEBUG:  checkpoint sync: number=4 
file=base/16384/16761 time=8014.102 msec
2011-02-10 02:53:23 EST: DEBUG:  checkpoint sync: number=5 
file=base/16384/16752_vm time=0.002 msec
2011-02-10 02:53:35 EST: DEBUG:  checkpoint sync: number=6 
file=base/16384/16761.5 time=11739.038 msec
2011-02-10 02:53:37 EST: DEBUG:  checkpoint sync: number=7 
file=base/16384/16761.6 time=2205.721 msec
2011-02-10 02:53:45 EST: DEBUG:  checkpoint sync: number=8 
file=base/16384/16761.2 time=8273.849 msec
2011-02-10 02:54:06 EST: DEBUG:  checkpoint sync: number=9 
file=base/16384/16766 time=20874.167 msec
2011-02-10 02:54:06 EST: DEBUG:  checkpoint sync: number=10 
file=base/16384/16762 time=0.002 msec
2011-02-10 02:54:08 EST: DEBUG:  checkpoint sync: number=11 
file=base/16384/16761.3 time=2440.441 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=12 
file=base/16384/16766.1 time=635.839 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=13 
file=base/16384/16752_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=14 
file=base/16384/16764 time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=15 
file=base/16384/16768_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=16 
file=base/16384/16761_vm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=17 
file=base/16384/16761.4 time=150.702 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=18 
file=base/16384/16752 time=0.002 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=19 
file=base/16384/16761_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=20 
file=base/16384/16749_vm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=21 
file=base/16384/16385 time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=22 
file=base/16384/16761.1 time=175.575 msec
2011-02-10 02:54:10 EST: LOG:  checkpoint complete: wrote 242614 buffers 
(46.3%); 0 transaction log file(s) added, 0 removed, 34 recycled; 
write=716.637 s, sync=54.659 s, total=772.976 s; sync files=22, 
longest=20.874 s, average=2.484 s

That's 12 minutes for the write phase, even though checkpoints should be 
happening every 5 minutes here.  With that bad of a write phase overrun, 
spread sync had no room to work, so no net improvement at all.  What is 
happening here is similar to the behavior I described seeing on my 
client system but didn't have an example to share until now.  During the 
write phase, looking at "Dirty:" in /proc/meminfo showed the value 
peaking at over 1GB while writes were happening, and eventually the 
background writer process wasn't getting any serious CPU time compared 
to the backends; this is what it looked like via ps:
%CPU     %MEM        TIME+     COMMAND
4    0    01:51.28     /home/gsmith/pgwork/inst/spread-sync/bin/pgbench 
-f /home/gsmith/pgbench-tools
2    8.1    00:39.71     postgres: gsmith pgbench ::1(43871) UPDATE
2    8    00:39.28     postgres: gsmith pgbench ::1(43875) UPDATE
2    8.1    00:39.92     postgres: gsmith pgbench ::1(43865) UPDATE
2    8.1    00:39.54     postgres: gsmith pgbench ::1(43868) UPDATE
2    8    00:39.36     postgres: gsmith pgbench ::1(43870) INSERT
2    8.1    00:39.47     postgres: gsmith pgbench ::1(43877) UPDATE
1    8    00:39.39     postgres: gsmith pgbench ::1(43864) COMMIT
1    8.1    00:39.78     postgres: gsmith pgbench ::1(43866) UPDATE
1    8    00:38.99     postgres: gsmith pgbench ::1(43867) UPDATE
1    8.1    00:39.55     postgres: gsmith pgbench ::1(43872) UPDATE
1    8.1    00:39.90     postgres: gsmith pgbench ::1(43873) UPDATE
1    8.1    00:39.64     postgres: gsmith pgbench ::1(43876) UPDATE
1    8.1    00:39.93     postgres: gsmith pgbench ::1(43878) UPDATE
1    8.1    00:39.83     postgres: gsmith pgbench ::1(43863) UPDATE
1    8    00:39.47     postgres: gsmith pgbench ::1(43869) UPDATE
1    8.1    00:40.11     postgres: gsmith pgbench ::1(43874) UPDATE
1    0    00:11.91     [flush-9:1]
0    0    27:43.75     [xfsdatad/6]
0    9.4    00:02.21     postgres: writer process

I want to make this problem go away, but as you can see spreading the 
sync calls around isn't enough.  I think the main write loop needs to 
get spread out more, too, so that the background writer is trying to 
work at a more reasonable pace.  I am pleased I've been able to 
reproduce this painful behavior at home using test data, because that 
much improves my odds of being able to isolate its cause and test 
solutions.  But it's a tricky problem, and I'm certainly not going to
fix it in the next week.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [PERFORM] pgbench to the MAXINT
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Spread checkpoint sync