Re: WIP(!) Double Writes

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: WIP(!) Double Writes
Дата
Msg-id 4F0BBB82.50703@2ndQuadrant.com
обсуждение исходный текст
Ответ на WIP(!) Double Writes  (David Fetter <david@fetter.org>)
Список pgsql-hackers
On 1/5/12 1:19 AM, David Fetter wrote:
> To achieve efficiency, the checkpoint writer and bgwriter should batch
> writes to multiple pages together.  Currently, there is an option
> "batched_buffer_writes" that specifies how many buffers to batch at a
> time.  However, we may want to remove that option from view, and just
> force batched_buffer_writes to a default (32) if double_writes is
> enabled.

The idea that PostgreSQL has better information about how to batch 
writes than the layers below it is controversial, and has failed to 
match expectations altogether for me in many cases.  The nastiest 
regressions here I ran into were in VACUUM, where the ring buffer 
implementation means the database has extremely limited room to work. 
Just dumping the whole write mess of that into a large OS cache as 
quickly as possible, and letting it sort things out, was dramatically 
faster in some of my test cases.  If you don't have one already, I'd 
recommend adding a performance test that dirties a lot of pages and then 
runs VACUUM against them to your test suite.  Since you're not crippling 
the OS cache to the same extent I was the problem may not be so bad, but 
it's something worth checking.

I scribbled some notes on this problem area at 
http://blog.2ndquadrant.com/en/2011/01/tuning-linux-for-low-postgresq.html 
; the links that are broken due to our web site being rearranged are now 
at http://highperfpostgres.com/pgbench-results/index.htm (test summary) 
and http://www.highperfpostgres.com/pgbench-results/435/index.html 
(Really bad latency spike example)

> Given the batching functionality, double writes by the checkpoint
> writer (and bgwriter) is implemented efficiently by writing a batch of
> pages to the double-write file and fsyncing, and then writing the
> pages to the appropriate data files, and fsyncing all the necessary
> data files.  While the data fsyncing might be viewed as expensive, it
> does help eliminate a lot of the fsync overhead at the end of
> checkpoints.  FlushRelationBuffers() and FlushDatabaseBuffers() can be
> similarly batched.

There's a fundamental struggle here between latency and throughput.  The 
longer you delay between writes and their subsequent sync, the more the 
OS gets a chance to reorder and combine them for better throughput. 
Ditto for any storage level optimizations, controller write caches and 
the like.  All that increases throughput, and more batching helps move 
in that direction.  But when you overload those caches and writes won't 
squeeze into them anymore...now there's a latency spike.  And as 
throughput increases, with it goes the amount of dirty cache that needs 
to be cleared per unit of time.

Eventually, all this disk I/O turns into a series of random writes.  You 
can postpone those in various ways, resequence them in ways that help 
some tests.  But if they're the true bottleneck, eventually all caches 
will fill, and clients will be stuck waiting for them.  And it's hard to 
imagine anything that causes the amount of data written to increase to 
ever move that problem in the right direction for the worst case. 
Adjusting the sync sequence just moves the problem to somewhere else. 
If you get lucky, that's a better place most of the time; how that bet 
turns out will be very workload dependent though.  I've lost a lot of 
those bets when trying to resequence syncs in the last two years, where 
benefits were extremely test dependent.

> We have some other code (not included) that sorts buffers to be
> checkpointed in file/block order -- this can reduce fsync overhead
> further by ensuring that each batch writes to only one or a few data
> files.

Again, the database doesn't necessarily have the information to make 
this level of decision better than the underlying layers do.  We've been 
through two runs at this idea already that ended inconclusively.  The 
one I did last year you can see at 
http://highperfpostgres.com/pgbench-results/index.htm ; set 9 and 11 are 
the same test without (9) and with (11) write sorting.  If there's 
really a difference there, it's below the noise floor as far as I could 
see.  Whether sorting helps or hurts is both workload and hardware 
dependent.

> As Jignesh has mentioned on this list, we see significant performance
> gains when enabling double writes&  disabling full_page_writes for
> OLTP runs with sufficient buffer cache size.  We are now trying to
> measure some runs where the dirty buffer eviction rate by the backends
> is high.

We'd need to have positive results published along with a publicly 
reproducible benchmark to go at this usefully.  I aimed for a much 
smaller goal than this in a similar area, around this same time last 
year.  I didn't get very far down that path before 9.1 development 
closed; it just takes too long to run enough benchmarks to really 
validate performance code in the write path.  This is a pretty obtrusive 
change to drop into the codebase for 9.2 at this point in the 
development cycle.

P.S. I got the impression you're testing these changes primarily against 
a modified 9.0.  One of the things that came out of the 9.1 performance 
testing was the "compact fsync queue" modification.  That significant 
improvement rippled out enough that several things that used to matter 
in my tests didn't anymore, once it was committed.  If your baseline 
doesn't include that feature already, you may have an uphill battle to 
prove any performance gains you've been seeing will still happen in the 
current 9.2 code.  Performance for that version has advanced even 
further forward in ways 9.0 can't emulate.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: LWLOCK_STATS
Следующее
От: Greg Smith
Дата:
Сообщение: Re: 16-bit page checksums for 9.2