Re: Controlling Load Distributed Checkpoints

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Controlling Load Distributed Checkpoints
Дата
Msg-id Pine.GSO.4.64.0706110316020.9600@westnet.com
обсуждение исходный текст
Ответ на Re: Controlling Load Distributed Checkpoints  (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>)
Ответы Sorted writes in checkpoint  (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>)
Список pgsql-hackers
On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:

> If the kernel can treat sequential writes better than random writes, is 
> it worth sorting dirty buffers in block order per file at the start of 
> checkpoints?

I think it has the potential to improve things.  There are three obvious 
and one subtle argument against it I can think of:

1) Extra complexity for something that may not help.  This would need some 
good, robust benchmarking improvements to justify its use.

2) Block number ordering may not reflect actual order on disk.  While 
true, it's got to be better correlated with it than writing at random.

3) The OS disk elevator should be dealing with this issue, particularly 
because it may really know the actual disk ordering.

Here's the subtle thing:  by writing in the same order the LRU scan occurs 
in, you are writing dirty buffers in the optimal fashion to eliminate 
client backend writes during BuferAlloc.  This makes the checkpoint a 
really effective LRU clearing mechanism.  Writing in block order will 
change that.

I spent some time trying to optimize the elevator part of this operation, 
since I knew that on the system I was using block order was actual order. 
I found that under Linux, the behavior of the pdflush daemon that manages 
dirty memory had a more serious impact on writing behavior at checkpoint 
time than playing with the elevator scheduling method did.  The way 
pdflush works actually has several interesting implications for how to 
optimize this patch.  For example, how writes get blocked when the dirty 
memory reaches certain thresholds means that you may not get the full 
benefit of the disk elevator at checkpoint time the way most would expect.

Since much of that was basically undocumented, I had to write my own 
analysis of the actual workings, which is now available at 
http://www.westnet.com/~gsmith/content/linux-pdflush.htm  I hope that 
anyone who wants more information about how Linux kernel parameters like 
dirty_background_ratio actually work, and how they impact the writing 
strategy, should find that article uniquely helpful.

> Some kernels or storage subsystems treat all I/Os too fairly so that 
> user transactions waiting for reads are blocked by checkpoints writes.

In addition to that (which I've seen happen quite a bit), in the Linux 
case another fairness issue is that the code that handles writes allows a 
single process writing a lot of data to block writes for everyone else. 
That means that in addition to being blocked on actual reads, if a client 
backend starts a write in order to complete a buffer allocation to hold 
new information, that can grind to a halt because of the checkpoint 
process as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kris Jurka
Дата:
Сообщение: Re: So, why isn't *every* buildfarm member failing ecpg right now?
Следующее
От: Magnus Hagander
Дата:
Сообщение: Re: ecpg leaves broken files around