Re: Spreading full-page writes

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Spreading full-page writes
Дата
Msg-id CAM-w4HPnbzEP0QZrc7ELkAWUEyEmYfGrE0164dEqmt7KhP4a9A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Spreading full-page writes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: Spreading full-page writes
Re: Spreading full-page writes
Список pgsql-hackers
On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>
> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>
>>> Another idea would be to have separate checkpoints for each buffer
>> partition. You would have to start recovery from the oldest checkpoint of
>> any of the partitions.
>
> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I
donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record
fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer
partitionsis simpler. 

Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.

> For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages,
andanother for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL
insertlocation, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd. 

Hm, I had convinced myself that the LSN on the pages would mean you
skip the replay anyways but I think I was wrong and you would need to
keep a bitmap of which partitions were in recovery mode as you replay
and keep adding partitions until they're all in recovery mode and then
keep going until you've seen the checkpoint record for all of them.

I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.

> Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain
fromreplaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal. 
>
> Hmm, that seems actually doable...



--
greg



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Spreading full-page writes
Следующее
От: Ronan Dunklau
Дата:
Сообщение: Re: IMPORT FOREIGN SCHEMA statement