Обсуждение: Spreading full-page writes
Here's an idea I tried to explain to Andres and Simon at the pub last night, on how to reduce the spikes in the amount of WAL written at beginning of a checkpoint that full-page writes cause. I'm just writing this down for the sake of the archives; I'm not planning to work on this myself. When you are replaying a WAL record that lies between the Redo-pointer of a checkpoint and the checkpoint record itself, there are two possibilities: a) You started WAL replay at that checkpoint's Redo-pointer. b) You started WAL replay at some earlier checkpoint, and are already in a consistent state. In case b), you wouldn't need to replay any full-page images, normal differential WAL records would be enough. In case a), you do, and you won't be consistent until replaying all the WAL up to the checkpoint record. We can exploit those properties to spread out the spike. When you modify a page and you're about to write a WAL record, check if the page has the BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page against the *previous* checkpoints redo-pointer, instead of the one's that's currently in-progress. If no full-page image is required based on that comparison, IOW if the page was modified and a full-page image was already written after the earlier checkpoint, write a normal WAL record without full-page image and set a new flag in the buffer header (BM_NEEDS_FPW). Also set a new flag on the WAL record, XLR_FPW_SKIPPED. When checkpointer (or any other backend that needs to evict a buffer) is about to flush a page from the buffer cache that has the BM_NEEDS_FPW flag set, write a new WAL record, containing a full-page-image of the page, before flushing the page. Here's how this works out during replay: a) You start WAL replay from the latest checkpoint's Redo-pointer. When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't replay that record at all. It's OK because we know that there will be a separate record containing the full-page image of the page later in the stream. b) You are continuing WAL replay that started from an earlier checkpoint, and have already reached consistency. When you see a WAL record that's been marked with XLR_FPW_SKIPPED, replay it normally. It's OK, because the flag means that the page was modified after the earlier checkpoint already, and hence we must have seen a full-page image of it already. When you see one of the WAL records containing a separate full-page-image, ignore it. This scheme make the b-case behave just as if the new checkpoint was never started. The regular WAL records in the stream are identical to what they would've been if the redo-pointer pointed to the earlier checkpoint. And the additional FPW records are simply ignored. In the a-case, it's not be safe to replay the records marked with XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the usual torn-page hazards that comes with that. However, the separate FPW records that come later in the stream will fix-up those pages. Now, I'm sure there are issues with this scheme I haven't thought about, but I wanted to get this written down. Note this does not reduce the overall WAL volume - on the contrary - but it ought to reduce the spike. - Heikki
On Mon, May 26, 2014 at 6:52 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Here's an idea I tried to explain to Andres and Simon at the pub last night, > on how to reduce the spikes in the amount of WAL written at beginning of a > checkpoint that full-page writes cause. I'm just writing this down for the > sake of the archives; I'm not planning to work on this myself. > > > When you are replaying a WAL record that lies between the Redo-pointer of a > checkpoint and the checkpoint record itself, there are two possibilities: > > a) You started WAL replay at that checkpoint's Redo-pointer. > > b) You started WAL replay at some earlier checkpoint, and are already in a > consistent state. > > In case b), you wouldn't need to replay any full-page images, normal > differential WAL records would be enough. In case a), you do, and you won't > be consistent until replaying all the WAL up to the checkpoint record. > > We can exploit those properties to spread out the spike. When you modify a > page and you're about to write a WAL record, check if the page has the > BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page > against the *previous* checkpoints redo-pointer, instead of the one's that's > currently in-progress. If no full-page image is required based on that > comparison, IOW if the page was modified and a full-page image was already > written after the earlier checkpoint, write a normal WAL record without > full-page image and set a new flag in the buffer header (BM_NEEDS_FPW). Also > set a new flag on the WAL record, XLR_FPW_SKIPPED. > > When checkpointer (or any other backend that needs to evict a buffer) is > about to flush a page from the buffer cache that has the BM_NEEDS_FPW flag > set, write a new WAL record, containing a full-page-image of the page, > before flushing the page. How does this mechanism work during base backup? pg_stop_backup needs to flush all buffers with BM_NEEDS_FPW flag? > > Here's how this works out during replay: > > a) You start WAL replay from the latest checkpoint's Redo-pointer. > > When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't > replay that record at all. It's OK because we know that there will be a > separate record containing the full-page image of the page later in the > stream. > > b) You are continuing WAL replay that started from an earlier checkpoint, > and have already reached consistency. > > When you see a WAL record that's been marked with XLR_FPW_SKIPPED, replay it > normally. It's OK, because the flag means that the page was modified after > the earlier checkpoint already, and hence we must have seen a full-page > image of it already. When you see one of the WAL records containing a > separate full-page-image, ignore it. > > This scheme make the b-case behave just as if the new checkpoint was never > started. The regular WAL records in the stream are identical to what they > would've been if the redo-pointer pointed to the earlier checkpoint. And the > additional FPW records are simply ignored. > > In the a-case, it's not be safe to replay the records marked with > XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the usual > torn-page hazards that comes with that. However, the separate FPW records > that come later in the stream will fix-up those pages. > > > Now, I'm sure there are issues with this scheme I haven't thought about, but > I wanted to get this written down. Note this does not reduce the overall WAL > volume - on the contrary - but it ought to reduce the spike. ISTM that this can increase WAL volume because one data change can generate both normal WAL and FPW. No? Regards, -- Fujii Masao
On May 25, 2014, at 5:52 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Here's how this works out during replay: > > a) You start WAL replay from the latest checkpoint's Redo-pointer. > > When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't replay that record at all. It's OK because weknow that there will be a separate record containing the full-page image of the page later in the stream. I don't think we know that. The server might have crashed before that second record got generated. (This appears to be anunfixable flaw in this proposal.) ...Robert
On 26 May 2014 20:16:33 EEST, Robert Haas <robertmhaas@gmail.com> wrote: >On May 25, 2014, at 5:52 PM, Heikki Linnakangas ><hlinnakangas@vmware.com> wrote: >> Here's how this works out during replay: >> >> a) You start WAL replay from the latest checkpoint's Redo-pointer. >> >> When you see a WAL record that's been marked with XLR_FPW_SKIPPED, >don't replay that record at all. It's OK because we know that there >will be a separate record containing the full-page image of the page >later in the stream. > >I don't think we know that. The server might have crashed before that >second record got generated. (This appears to be an unfixable flaw in >this proposal.) The second record is generated before the checkpoint is finished and the checkpoint record is written. So it will be there. (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't beused) - Heikki
On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
The second record is generated before the checkpoint is finished and the checkpoint record is written. So it will be there.
(if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't be used)
Another idea would be to have separate checkpoints for each buffer partition. You would have to start recovery from the oldest checkpoint of any of the partitions.
greg
On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >>I don't think we know that. The server might have crashed before that >>second record got generated. (This appears to be an unfixable flaw in >>this proposal.) > > The second record is generated before the checkpoint is finished and the checkpoint record is written. So it will be there. > > (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't beused) Hmm, I see. It's not great to have to generate WAL at buffer-eviction time, though. Normally, when we go to evict a buffer, the WAL is already written. We might have to wait for it to be flushed, but if the WAL writer is doing its job, hopefully not. But here we'll definitely have to wait for the WAL flush. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 25 May 2014 17:52, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Here's an idea I tried to explain to Andres and Simon at the pub last night, > on how to reduce the spikes in the amount of WAL written at beginning of a > checkpoint that full-page writes cause. I'm just writing this down for the > sake of the archives; I'm not planning to work on this myself. ... Thanks for that idea, and dinner. It looks useful. I'll call this idea "Background FPWs" > Now, I'm sure there are issues with this scheme I haven't thought about, but > I wanted to get this written down. Note this does not reduce the overall WAL > volume - on the contrary - but it ought to reduce the spike. The requirements we were discussing were around A) reducing WAL volume B) reducing foreground overhead of writing FPWs - which spikes badly after checkpoint and the overhead is paid by the user processes themselves C) need for FPWs during base backup So that gives us a few approaches * Compressing FPWs gives A * Background FPWs gives us B which look like we can combine both ideas * Double-buffering would give us A and B, but not C and would be incompatible with other two ideas Will think some more. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 27, 2014 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 May 2014 17:52, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > >> Here's an idea I tried to explain to Andres and Simon at the pub last night, >> on how to reduce the spikes in the amount of WAL written at beginning of a >> checkpoint that full-page writes cause. I'm just writing this down for the >> sake of the archives; I'm not planning to work on this myself. > ... > > Thanks for that idea, and dinner. It looks useful. > > I'll call this idea "Background FPWs" > >> Now, I'm sure there are issues with this scheme I haven't thought about, but >> I wanted to get this written down. Note this does not reduce the overall WAL >> volume - on the contrary - but it ought to reduce the spike. > > The requirements we were discussing were around > > A) reducing WAL volume > B) reducing foreground overhead of writing FPWs - which spikes badly > after checkpoint and the overhead is paid by the user processes > themselves > C) need for FPWs during base backup > > So that gives us a few approaches > > * Compressing FPWs gives A > * Background FPWs gives us B > which look like we can combine both ideas > > * Double-buffering would give us A and B, but not C > and would be incompatible with other two ideas Double-buffering would allow us to disable FPW safely but which would make a recovery slow. So if we adopt double-buffering, I think that we would also need to overhaul the recovery. Regards, -- Fujii Masao
On 05/26/2014 11:15 PM, Robert Haas wrote: > On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> I don't think we know that. The server might have crashed before that >>> second record got generated. (This appears to be an unfixable flaw in >>> this proposal.) >> >> The second record is generated before the checkpoint is finished and the checkpoint record is written. So it will bethere. >> >> (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won'tbe used) > > Hmm, I see. > > It's not great to have to generate WAL at buffer-eviction time, > though. Normally, when we go to evict a buffer, the WAL is already > written. We might have to wait for it to be flushed, but if the WAL > writer is doing its job, hopefully not. But here we'll definitely > have to wait for the WAL flush. Yeah. You would want to batch the flushes somehow, instead of flushing the WAL for every buffer being flushed. For example, after writing the FPW WAL record, just continue with the checkpoint without flushing the buffer, and do a second pass later doing buffer flushes. - Heikki
On 05/26/2014 02:26 PM, Greg Stark wrote: > On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com >> wrote: > >> The second record is generated before the checkpoint is finished and the >> checkpoint record is written. So it will be there. >> >> (if you crash before the checkpoint is finished, the in-progress >> checkpoint is no good for recovery anyway, and won't be used) > > Another idea would be to have separate checkpoints for each buffer > partition. You would have to start recovery from the oldest checkpoint of > any of the partitions. Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I do now. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record for a given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitions is simpler. For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages, and another for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL insert location, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd. Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain from replaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal. Hmm, that seems actually doable... - Heikki
On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > > On 05/26/2014 02:26 PM, Greg Stark wrote: >> >>> Another idea would be to have separate checkpoints for each buffer >> partition. You would have to start recovery from the oldest checkpoint of >> any of the partitions. > > Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitionsis simpler. Interesting. I just thought of it independently. Incidentally you wouldn't actually want to use the buffer partitions per se since the new server might start up with a different number of partitions. You would want an algorithm for partitioning the block space that xlog replay can reliably reproduce regardless of the size of the buffer lock partition table. It might make sense to set it up so it coincidentally ensures all the buffers being flushed are in the same partition or maybe the reverse would be better. Probably it doesn't actually matter. > For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages, andanother for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL insertlocation, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd. Hm, I had convinced myself that the LSN on the pages would mean you skip the replay anyways but I think I was wrong and you would need to keep a bitmap of which partitions were in recovery mode as you replay and keep adding partitions until they're all in recovery mode and then keep going until you've seen the checkpoint record for all of them. I'm assuming you would keep N checkpoint positions in the control file. That also means we can double the checkpoint timeout with only a marginal increase in the worst case recovery time. Since the worst case will be (1 + 1/n)*timeout's worth of wal to replay rather than 2*n. The amount of time for recovery would be much more predictable. > Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain fromreplaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal. > > Hmm, that seems actually doable... -- greg
On 05/27/2014 02:42 PM, Greg Stark wrote: > On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> >> On 05/26/2014 02:26 PM, Greg Stark wrote: >>> >>>> Another idea would be to have separate checkpoints for each buffer >>> partition. You would have to start recovery from the oldest checkpoint of >>> any of the partitions. >> >> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitionsis simpler. > > Interesting. I just thought of it independently. > > Incidentally you wouldn't actually want to use the buffer partitions > per se since the new server might start up with a different number of > partitions. You would want an algorithm for partitioning the block > space that xlog replay can reliably reproduce regardless of the size > of the buffer lock partition table. It might make sense to set it up > so it coincidentally ensures all the buffers being flushed are in the > same partition or maybe the reverse would be better. Probably it > doesn't actually matter. Since you will be flushing the buffers one "redo partition" at a time, you would want to allow the OS to do merge the writes within a partition as much as possible. So my even-odd split would in fact be pretty bad. Some sort of striping, e.g. mapping each contiguous 1 MB chunk to the same partition, would be better. > I'm assuming you would keep N checkpoint positions in the control > file. That also means we can double the checkpoint timeout with only a > marginal increase in the worst case recovery time. Since the worst > case will be (1 + 1/n)*timeout's worth of wal to replay rather than > 2*n. The amount of time for recovery would be much more predictable. Good point. - Heikki
On 27 May 2014 03:49, Fujii Masao <masao.fujii@gmail.com> wrote: >> So that gives us a few approaches >> >> * Compressing FPWs gives A >> * Background FPWs gives us B >> which look like we can combine both ideas >> >> * Double-buffering would give us A and B, but not C >> and would be incompatible with other two ideas > > Double-buffering would allow us to disable FPW safely but which would make > a recovery slow. So if we adopt double-buffering, I think that we would also > need to overhaul the recovery. Which is also true of Background FPWs So our options are 1. Compressed FPWs only 2. Compressed FPWs plus BackgroundFPWs plus Recovery Buffer Prefetch 3. Double Buffering plus Recovery Buffer Prefetch IIRC Koichi had a patch for prefetch during recovery. Heikki, is that the reason you also discussed changing the WAL record format to allow us to identify the blocks touched by recovery more easily? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 05/27/2014 03:18 PM, Simon Riggs wrote: > IIRC Koichi had a patch for prefetch during recovery. Heikki, is that > the reason you also discussed changing the WAL record format to allow > us to identify the blocks touched by recovery more easily? Yeah, that was one use case I had in mind for the WAL format changes. See http://www.postgresql.org/message-id/533D6CBF.6080203@vmware.com. - Heikki
On 27 May 2014 07:42, Greg Stark <stark@mit.edu> wrote: > On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> >> On 05/26/2014 02:26 PM, Greg Stark wrote: >>> >>>> Another idea would be to have separate checkpoints for each buffer >>> partition. You would have to start recovery from the oldest checkpoint of >>> any of the partitions. >> >> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitionsis simpler. > > Interesting. I just thought of it independently. Actually, I heard it from Doug Tolbert in 2005, based on how another DBMS coped with that issue. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, May 26, 2014 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, May 26, 2014 at 1:22 PM, Heikki LinnakangasHmm, I see.
<hlinnakangas@vmware.com> wrote:
>>I don't think we know that. The server might have crashed before that
>>second record got generated. (This appears to be an unfixable flaw in
>>this proposal.)
>
> The second record is generated before the checkpoint is finished and the checkpoint record is written. So it will be there.
>
> (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't be used)
It's not great to have to generate WAL at buffer-eviction time,
though. Normally, when we go to evict a buffer, the WAL is already
written. We might have to wait for it to be flushed, but if the WAL
writer is doing its job, hopefully not. But here we'll definitely
have to wait for the WAL flush.
I'm not sure we do need to flush it. If the checkpoint finishes, then the WAL surely got flushed as part of the process of recording the end of the checkpoint. If the checkpoint does not finish, recovery will start from the previous checkpoint, which does contain the FPW (because if it didn't, the page would not be eligible for this treatment) and so the possibly torn page will get overwritten in full.
Cheers,
Jeff
On Tue, May 27, 2014 at 1:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, May 27, 2014 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > The requirements we were discussing were around
> >
> > A) reducing WAL volume
> > B) reducing foreground overhead of writing FPWs - which spikes badly
> > after checkpoint and the overhead is paid by the user processes
> > themselves
> > C) need for FPWs during base backup
> >
> > So that gives us a few approaches
> >
> > * Compressing FPWs gives A
> > * Background FPWs gives us B
> > which look like we can combine both ideas
> >
> > * Double-buffering would give us A and B, but not C
> > and would be incompatible with other two ideas
>
> Double-buffering would allow us to disable FPW safely but which would make
> a recovery slow.
> On Tue, May 27, 2014 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > The requirements we were discussing were around
> >
> > A) reducing WAL volume
> > B) reducing foreground overhead of writing FPWs - which spikes badly
> > after checkpoint and the overhead is paid by the user processes
> > themselves
> > C) need for FPWs during base backup
> >
> > So that gives us a few approaches
> >
> > * Compressing FPWs gives A
> > * Background FPWs gives us B
> > which look like we can combine both ideas
> >
> > * Double-buffering would give us A and B, but not C
> > and would be incompatible with other two ideas
>
> Double-buffering would allow us to disable FPW safely but which would make
> a recovery slow.
Is it due to the fact that during recovery, it needs to check the
contents of double buffer as well as the page in original location
for consistency or there is something else also which will lead
to slow recovery?
Won't DBW (double buffer write) reduce the need for number of
pages that needs to be read from disk as compare to FPW which
will suffice the performance degradation due to any other impact?
IIUC in DBW mechanism, we need to have a temporary sequential
log file of fixed size which will be used to write data before the data
gets written to its actual location in tablespace. Now as the temporary
log file is of fixed size, the number of pages that needs to be read
during recovery should be less as compare to FPW because in FPW
it needs to read all the pages written in WAL log after last successful
checkpoint.
On 27 May 2014 13:20, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 05/27/2014 03:18 PM, Simon Riggs wrote: >> >> IIRC Koichi had a patch for prefetch during recovery. Heikki, is that >> the reason you also discussed changing the WAL record format to allow >> us to identify the blocks touched by recovery more easily? > > > Yeah, that was one use case I had in mind for the WAL format changes. See > http://www.postgresql.org/message-id/533D6CBF.6080203@vmware.com. Those proposals suggest some very big changes to the way WAL works. Prefetch can work easily enough for most records - do we really need that much churn? You mentioned Btree vacuum records, but I'm planning to optimize those another way. Why don't we just have the prefetch code in core and forget the WAL format changes? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 27 May 2014 18:18, Jeff Janes <jeff.janes@gmail.com> wrote: > On Mon, May 26, 2014 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >> >>I don't think we know that. The server might have crashed before that >> >>second record got generated. (This appears to be an unfixable flaw in >> >>this proposal.) >> > >> > The second record is generated before the checkpoint is finished and the >> > checkpoint record is written. So it will be there. >> > >> > (if you crash before the checkpoint is finished, the in-progress >> > checkpoint is no good for recovery anyway, and won't be used) >> >> Hmm, I see. >> >> It's not great to have to generate WAL at buffer-eviction time, >> though. Normally, when we go to evict a buffer, the WAL is already >> written. We might have to wait for it to be flushed, but if the WAL >> writer is doing its job, hopefully not. But here we'll definitely >> have to wait for the WAL flush. > > > I'm not sure we do need to flush it. If the checkpoint finishes, then the > WAL surely got flushed as part of the process of recording the end of the > checkpoint. If the checkpoint does not finish, recovery will start from the > previous checkpoint, which does contain the FPW (because if it didn't, the > page would not be eligible for this treatment) and so the possibly torn page > will get overwritten in full. I think Robert is correct, you would need to flush WAL before writing the disk buffer. That is the current invariant of WAL before data. However, we don't need to do this in a simple way: FPW-flush-buffer, we can do that with more buffering. So it seems like a reasonable idea to do this using a 64 buffer BulkAccessStrategy object and flush the WAL every 64 buffers. That's beginning to look more like double buffering though... -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 05/28/2014 09:41 AM, Simon Riggs wrote: > On 27 May 2014 13:20, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> On 05/27/2014 03:18 PM, Simon Riggs wrote: >>> >>> IIRC Koichi had a patch for prefetch during recovery. Heikki, is that >>> the reason you also discussed changing the WAL record format to allow >>> us to identify the blocks touched by recovery more easily? >> >> >> Yeah, that was one use case I had in mind for the WAL format changes. See >> http://www.postgresql.org/message-id/533D6CBF.6080203@vmware.com. > > Those proposals suggest some very big changes to the way WAL works. > > Prefetch can work easily enough for most records - do we really need > that much churn? > > You mentioned Btree vacuum records, but I'm planning to optimize those > another way. > > Why don't we just have the prefetch code in core and forget the WAL > format changes? Well, the prefetching was just one example of why the proposed WAL format changes are a good idea. The changes will make life easier for any external (or internal, for that matter) tool that wants to read WAL records. The thing that finally really got me into doing that was pg_rewind. For pg_rewind it's not enough to cover most records, you have to catch all modifications to data pages for correctness, and that's difficult to maintain as new WAL record types are added and old ones are modified in every release. Also, the changes make WAL-logging and -replaying code easier to write. Which reduces the potential for bugs. - Heikki
On Tue, May 27, 2014 at 8:15 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Since you will be flushing the buffers one "redo partition" at a time, you > would want to allow the OS to do merge the writes within a partition as much > as possible. So my even-odd split would in fact be pretty bad. Some sort of > striping, e.g. mapping each contiguous 1 MB chunk to the same partition, > would be better. I suspect you'd actually want to stripe by segment (1GB partition). If you striped by 1MB partitions, there might still be writes to the parts of the file you weren't checkpointing that would be flushed by the fsync(). That would lead to more physical I/O overall, if those pages were written again before you did the next half-checkpoint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 28, 2014 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, May 27, 2014 at 1:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, May 27, 2014 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> >> wrote: >> > The requirements we were discussing were around >> > >> > A) reducing WAL volume >> > B) reducing foreground overhead of writing FPWs - which spikes badly >> > after checkpoint and the overhead is paid by the user processes >> > themselves >> > C) need for FPWs during base backup >> > >> > So that gives us a few approaches >> > >> > * Compressing FPWs gives A >> > * Background FPWs gives us B >> > which look like we can combine both ideas >> > >> > * Double-buffering would give us A and B, but not C >> > and would be incompatible with other two ideas >> >> Double-buffering would allow us to disable FPW safely but which would make >> a recovery slow. > > Is it due to the fact that during recovery, it needs to check the > contents of double buffer as well as the page in original location > for consistency or there is something else also which will lead > to slow recovery? > > Won't DBW (double buffer write) reduce the need for number of > pages that needs to be read from disk as compare to FPW which > will suffice the performance degradation due to any other impact? > > IIUC in DBW mechanism, we need to have a temporary sequential > log file of fixed size which will be used to write data before the data > gets written to its actual location in tablespace. Now as the temporary > log file is of fixed size, the number of pages that needs to be read > during recovery should be less as compare to FPW because in FPW > it needs to read all the pages written in WAL log after last successful > checkpoint. Hmm... maybe I'm misunderstanding how WAL replay works in DBW case. Imagine the case where we try to replay two WAL records for the page A and the page has not been cached in shared_buffers yet. If FPW is enabled, the first WAL record is FPW and firstly it's just read to shared_buffers. The page doesn't neeed to be read from the disk. Then the second WAL record will be applied. OTOH, in DBW case, how does this example case work? I was thinking that firstly we try to apply the first WAL record but find that the page A doesn't exist in shared_buffers yet. We try to read the page from the disk, check whether its CRC is valid or not, and read the same page from double buffer if it's invalid. After reading the page into shared_buffers, the first WAL record can be applied. Then the second WAL record will be applied. Is my understanding right? Regards, -- Fujii Masao
On Mon, Jun 2, 2014 at 6:04 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, May 28, 2014 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:>
> > IIUC in DBW mechanism, we need to have a temporary sequential
> > log file of fixed size which will be used to write data before the data
> > gets written to its actual location in tablespace. Now as the temporary
> > log file is of fixed size, the number of pages that needs to be read
> > during recovery should be less as compare to FPW because in FPW
> > it needs to read all the pages written in WAL log after last successful
> > checkpoint.
>
> Hmm... maybe I'm misunderstanding how WAL replay works in DBW case.
> Imagine the case where we try to replay two WAL records for the page A and
> the page has not been cached in shared_buffers yet. If FPW is enabled,
> the first WAL record is FPW and firstly it's just read to shared_buffers.
> The page doesn't neeed to be read from the disk. Then the second WAL record
> will be applied.
>
> OTOH, in DBW case, how does this example case work? I was thinking that
> firstly we try to apply the first WAL record but find that the page A doesn't
> exist in shared_buffers yet. We try to read the page from the disk, check
> whether its CRC is valid or not, and read the same page from double buffer
> if it's invalid. After reading the page into shared_buffers, the first WAL
> record can be applied. Then the second WAL record will be applied. Is my
> understanding right?
> > IIUC in DBW mechanism, we need to have a temporary sequential
> > log file of fixed size which will be used to write data before the data
> > gets written to its actual location in tablespace. Now as the temporary
> > log file is of fixed size, the number of pages that needs to be read
> > during recovery should be less as compare to FPW because in FPW
> > it needs to read all the pages written in WAL log after last successful
> > checkpoint.
>
> Hmm... maybe I'm misunderstanding how WAL replay works in DBW case.
> Imagine the case where we try to replay two WAL records for the page A and
> the page has not been cached in shared_buffers yet. If FPW is enabled,
> the first WAL record is FPW and firstly it's just read to shared_buffers.
> The page doesn't neeed to be read from the disk. Then the second WAL record
> will be applied.
>
> OTOH, in DBW case, how does this example case work? I was thinking that
> firstly we try to apply the first WAL record but find that the page A doesn't
> exist in shared_buffers yet. We try to read the page from the disk, check
> whether its CRC is valid or not, and read the same page from double buffer
> if it's invalid. After reading the page into shared_buffers, the first WAL
> record can be applied. Then the second WAL record will be applied. Is my
> understanding right?
I think the way DBW works is that before reading WAL, it first makes
data pages consistent. It will first check the doublewrite buffer
contents and pages in their original location. If page is inconsistent
in double write buffer it is simply discarded, if it is inconsistent in
the tablespace it is recovered from double write buffer. After reaching
the double buffer end, it will start reading WAL.
So in above example case, it will read the first record from WAL
and check if page is already in shared_buffers, then apply WAL
change, else read the page into shared_buffers, then apply WAL.
For second record, it doesn't need to read the page.
The saving during recovery will come from the fact that in case
of DBW, it will not read the FPI from WAL, rather just 2 records
(it has to read a WAL page, but that will contain many records).
So it seems to be a net win.
Now incase of DBW, the extra workdone (reading the double buffer,
checking the consistency of same with actual page) is always fixed
as size of double buffer is fixed, so the impact due to it should
be much less than reading FPI's from WAL after last successful
checkpoint.
If my above understanding is right, then performance of recovery
should be better with DBW in most cases.
I think the cases where DBW might need to take care is when
there are lot of backend evictions. For such scenario's backend
might itself need to write both to double buffer and actual page.
It can have more impact during bulk reads (when it has to set hint
bit) and Vacuum which gets performed in ring buffer.
One of the improvement that can be done here is to change the buffer
eviction algorithm such that it can give up the buffer which needs
to be written to double buffer. There can be other improvements as
well depending on DBW implementation.