Re: Improvement of checkpoint IO scheduler for stable transaction responses

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Improvement of checkpoint IO scheduler for stable transaction responses
Дата
Msg-id 51EDF01A.4050006@2ndQuadrant.com
обсуждение исходный текст
Ответ на Re: Improvement of checkpoint IO scheduler for stable transaction responses  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Ответы Re: Improvement of checkpoint IO scheduler for stable transaction responses  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Список pgsql-hackers
On 7/22/13 4:52 AM, KONDO Mitsumasa wrote:
> The writeback source code which I indicated part of writeback is almost
> same as community kernel (2.6.32.61). I also read linux kernel 3.9.7,
> but it is almost same this part.

The main source code difference comes from going back to the RedHat 5 
kernel, which means 2.6.18.  For many of these versions, you are right 
that it is only the tuning parameters that were changed in newer versions.

Optimizing performance for the old RHEL5 kernel isn't the most important 
thing, but it's helpful to know the things it does very badly.

> My fsync patch is only sleep returned succece of fsync and maximum sleep
> time is set to 10 seconds. It does not cause bad for this problem.

It's easy to have hundreds of relations that are getting fsync calls 
during a checkpoint.  If you have 100 relations getting a 10 second 
sleep each, you could potentially delay checkpoints by 17 minutes this 
way.  I regularly see systems where shared_buffers=8GB and there are 200 
to 400 relation segments that need a sync during a checkpoint.

This is the biggest problem with your submission.  Once you give up 
following the checkpoint schedule carefully, it is very easy to end up 
with large checkpoint deadline misses on production servers.  If someone 
thinks they are doing a checkpoint every 5 minutes, but your patch makes 
them take 20 minutes instead, that is bad.  They will not expect that a 
crash might have to replay that much activity before the server is 
useful again.

>> You also don't seem afraid of how exceeding the
>> checkpoint timeout is a very bad thing yet.
> I think it is important that why this problem was caused. We should try
> to find the cause of which program has bug or problem.

The checkpointer process is the problem.  There's no filesystem bug or 
complicated issues involved in many of the bad cases.  Here is a simple 
example that shows how the toughest problem cases happen:

-64GB of RAM
-10% dirty_background_ratio = 6GB of dirty writes = 6144MB
-2MB/s random I/O when concurrent reads are heavy
-3027 seconds to clear the cache = 51 minutes

That's how you get to an example like the one in my slides:

LOG: checkpoint complete: wrote 33282 buers (3.2%); 0 transaction log 
file(s) added, 60 removed, 129 recycled; write=228.848 s, sync=4628.879 
s, total=4858.859 s

It's very hard to do better on these, and I don't expect any change to 
help this a lot.  But I don't want to see a change committed that makes 
this sort of checkpoint 17 minutes longer if there's 100 relations 
involved either.

> My patch not only improvement of throughput but also
> realize stable response time at fsync phase in checkpoint.

The main reason your patch improves latency and throughput is that it 
makes checkpoints farther apart.  That's why I drew you a graph showing 
how the time between checkpoints lined up perfectly with TPS.  If it was 
only a small problem it would be worth considering, but I think it's 
likely to end up with these >15 minute I've outlined here instead.

> And I servey about ext3 file system.

I wouldn't worry too much about the problems ext3 has.  Like the old 
RHEL5 kernel I was commenting about above, there are a lot of ext3 
systems out there.  But we can't do a lot about getting good performance 
from them.  It's only important to test that you're not making them a 
lot worse with a change.

> My system block size is 4096, but
> 8192 or more seems to better. It will decrease number of inode and get
> more large sequential disk fields.

I normally increase read-ahead on Linux systems to get faster speed on 
sequential disk throughput.  Changing the block size might work better 
in some cases, but not many people are willing to do that.  Read-ahead 
is very easy to change at any time.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [9.4 CF 1] And then there were 5
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: [9.4 CF 1] And then there were 5