Re: Spread checkpoint sync
От | Robert Haas |
---|---|
Тема | Re: Spread checkpoint sync |
Дата | |
Msg-id | AANLkTim2uE36E1oZ0aWt4XN_Bin-==noh7QUwVKusaX_@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Spread checkpoint sync (Greg Smith <greg@2ndquadrant.com>) |
Ответы |
Re: Spread checkpoint sync
Re: Spread checkpoint sync Re: Spread checkpoint sync |
Список | pgsql-hackers |
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I've attached an updated version of the initial sync spreading patch here, > one that applies cleanly on top of HEAD and over top of the sync > instrumentation patch too. The conflict that made that hard before is gone > now. With the fsync queue compaction patch applied, I think most of this is now not needed. Attached please find an attempt to isolate the portion that looks like it might still be useful. The basic idea of what remains here is to make the background writer still do its normal stuff even when it's checkpointing. In particular, with this patch applied, PG will: 1. Absorb fsync requests a lot more often during the sync phase. 2. Still try to run the cleaning scan during the sync phase. 3. Pause for 3 seconds after every fsync. I suspect that #1 is probably a good idea. It seems pretty clear based on your previous testing that the fsync compaction patch should be sufficient to prevent us from hitting the wall, but if we're going to any kind of nontrivial work here then cleaning the queue is a sensible thing to do along the way, and there's little downside. I also suspect #2 is a good idea. The fact that we're checkpointing doesn't mean that the system suddenly doesn't require clean buffers, and the experimentation I've done recently (see: limiting hint bit I/O) convinces me that it's pretty expensive from a performance standpoint when backends have to start writing out their own buffers, so continuing to do that work during the sync phase of a checkpoint, just as we do during the write phase, seems pretty sensible. I think something along the lines of #3 is probably a good idea, but the current coding doesn't take checkpoint_completion_target into account. The underlying problem here is that it's at least somewhat reasonable to assume that if we write() a whole bunch of blocks, each write() will take approximately the same amount of time. But this is not true at all with respect to fsync() - they neither take the same amount of time as each other, nor is there any fixed ratio between write() time and fsync() time to go by. So if we want the checkpoint to finish in, say, 20 minutes, we can't know whether the write phase needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. One idea I have is to try to get some of the fsyncs out of the queue at times other than end-of-checkpoint. Even if this resulted in some modest increase in the total number of fsync() calls, it might improve performance by causing data to be flushed to disk in smaller chunks. For example, suppose we kept an LRU list of pending fsync requests - every time we remember an fsync request for a particular relation, we move it to the head (hot end) of the LRU. And periodically we pull the tail entry off the list and fsync it - say, after checkpoint_timeout / (# of items in the list). That way, when we arrive at the end of the checkpoint and starting syncing everything, the syncs hopefully complete more quickly because we've already forced a bunch of the data down to disk. That algorithm may well be too stupid or just not work in real life, but perhaps there's some variation that would be sensible. The point is: instead of or in addition to trying to spread out the sync phase, we might want to investigate whether it's possible to reduce its size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: