Обсуждение: PageIsAllVisible()'s trustworthiness in Hot Standby
 261     /*
 262      * If the all-visible flag indicates that all tuples on the page are
 263      * visible to everyone, we can skip the per-tuple visibility tests. But
 264      * not in hot standby mode. A tuple that's already visible to all
 265      * transactions in the master might still be invisible to a read-only
 266      * transaction in the standby.
 267      */
 268     all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;
Isn't the check for !snapshot->takenDuringRecovery redundant now in master or whenever since we added crash-safety for VM ? In fact, this comment made me think if we are really handling index-only scans correctly or not on the Hot Standby. But apparently we are by forcing conflicting transactions to abort before redoing VM bit set operation on the standby. The same mechanism should protect us against the above case. Now I concede that the entire magic around setting and clearing the page level all-visible bit and the VM bit and our ability to keep them in sync is something I don't fully understand, but I see that every operation that sets the page level PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer lock and emits a WAL record. So AFAICS  the conflict resolution logic will take care of the above too.
			
		On 2012-12-04 18:38:11 +0530, Pavan Deolasee wrote: > I was looking at the following code in heapam.c: > > 261 /* > 262 * If the all-visible flag indicates that all tuples on the page > are > 263 * visible to everyone, we can skip the per-tuple visibility > tests. But > 264 * not in hot standby mode. A tuple that's already visible to all > 265 * transactions in the master might still be invisible to a > read-only > 266 * transaction in the standby. > 267 */ > 268 all_visible = PageIsAllVisible(dp) && > !snapshot->takenDuringRecovery; > > Isn't the check for !snapshot->takenDuringRecovery redundant now in master > or whenever since we added crash-safety for VM ? In fact, this comment made > me think if we are really handling index-only scans correctly or not on the > Hot Standby. But apparently we are by forcing conflicting transactions to > abort before redoing VM bit set operation on the standby. The same > mechanism should protect us against the above case. Now I concede that the > entire magic around setting and clearing the page level all-visible bit and > the VM bit and our ability to keep them in sync is something I don't fully > understand, but I see that every operation that sets the page > level PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer > lock and emits a WAL record. So AFAICS the conflict resolution logic will > take care of the above too. Yes, that really seems strange. As you say, were already relying on the VM to be correct. I think it was simply missed that we can rely on this since the conflict handling on all-visible was added in 3424bff90f40532527b9cf4f2ad9eaff750682f7. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 4, 2012 at 8:08 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > I was looking at the following code in heapam.c: > > 261 /* > 262 * If the all-visible flag indicates that all tuples on the page > are > 263 * visible to everyone, we can skip the per-tuple visibility tests. > But > 264 * not in hot standby mode. A tuple that's already visible to all > 265 * transactions in the master might still be invisible to a > read-only > 266 * transaction in the standby. > 267 */ > 268 all_visible = PageIsAllVisible(dp) && > !snapshot->takenDuringRecovery; > > Isn't the check for !snapshot->takenDuringRecovery redundant now in master > or whenever since we added crash-safety for VM ? In fact, this comment made > me think if we are really handling index-only scans correctly or not on the > Hot Standby. But apparently we are by forcing conflicting transactions to > abort before redoing VM bit set operation on the standby. The same mechanism > should protect us against the above case. Now I concede that the entire > magic around setting and clearing the page level all-visible bit and the VM > bit and our ability to keep them in sync is something I don't fully > understand, but I see that every operation that sets the page level > PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer lock and > emits a WAL record. So AFAICS the conflict resolution logic will take care > of the above too. I wasn't sure whether that could be safely changed. There's a subtle distinction here: the PD_ALL_VISIBLE bit isn't the same as the visibility map bit. And, technically, the WAL record only fully protects the setting of *the visibility map bit* not the PD_ALL_VISIBLE page-level bit. The purpose of that WAL logging is to make sure that the page-level bit is never clear while the visibility-map bit is set; it does not guarantee that the page-level bit can never be set without issuing a WAL record. So, for example, it's perfectly possible for a crash on the master might leave the page-level bit set while the VM bit is clear. Now, if that page somehow makes its way to the standby - via a base backup or a full-page image - before the tuples it contains are all-visible according to the standby's xmin horizon, we've got a problem. Can that happen? It seems unlikely, but can we prove it's not possible? Perhaps, but I wasn't sure. Index-only scans are safe, because they're looking at the visibility map itself, not the page-level bit, but the analysis is a little murkier for sequential scans. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-04 08:38:48 -0500, Robert Haas wrote:
> On Tue, Dec 4, 2012 at 8:08 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> >
> > I was looking at the following code in heapam.c:
> >
> >  261     /*
> >  262      * If the all-visible flag indicates that all tuples on the page
> > are
> >  263      * visible to everyone, we can skip the per-tuple visibility tests.
> > But
> >  264      * not in hot standby mode. A tuple that's already visible to all
> >  265      * transactions in the master might still be invisible to a
> > read-only
> >  266      * transaction in the standby.
> >  267      */
> >  268     all_visible = PageIsAllVisible(dp) &&
> > !snapshot->takenDuringRecovery;
> >
> > Isn't the check for !snapshot->takenDuringRecovery redundant now in master
> > or whenever since we added crash-safety for VM ? In fact, this comment made
> > me think if we are really handling index-only scans correctly or not on the
> > Hot Standby. But apparently we are by forcing conflicting transactions to
> > abort before redoing VM bit set operation on the standby. The same mechanism
> > should protect us against the above case. Now I concede that the entire
> > magic around setting and clearing the page level all-visible bit and the VM
> > bit and our ability to keep them in sync is something I don't fully
> > understand, but I see that every operation that sets the page level
> > PD_ALL_VISIBLE flag also sets the VM bit while holding the buffer lock and
> > emits a WAL record. So AFAICS  the conflict resolution logic will take care
> > of the above too.
>
> I wasn't sure whether that could be safely changed.  There's a subtle
> distinction here: the PD_ALL_VISIBLE bit isn't the same as the
> visibility map bit.  And, technically, the WAL record only fully
> protects the setting of *the visibility map bit* not the
> PD_ALL_VISIBLE page-level bit.  The purpose of that WAL logging is to
> make sure that the page-level bit is never clear while the
> visibility-map bit is set; it does not guarantee that the page-level
> bit can never be set without issuing a WAL record.  So, for example,
> it's perfectly possible for a crash on the master might leave the
> page-level bit set while the VM bit is clear.  Now, if that page
> somehow makes its way to the standby - via a base backup or a
> full-page image - before the tuples it contains are all-visible
> according to the standby's xmin horizon, we've got a problem.  Can
> that happen?  It seems unlikely, but can we prove it's not possible?
> Perhaps, but I wasn't sure.
Youre right, it currently seems to be possible, there's no LSN interlock
prohibiting this as far as I can see.
in lazy_scan_heap we do:
        if (!PageIsAllVisible(page))        {            PageSetAllVisible(page);            MarkBufferDirty(buf);
     visibilitymap_set(onerel, blkno, InvalidXLogRecPtr, vmbuffer,                              visibility_cutoff_xid);
      }
 
So buf can be written out independently from the xlog record that
visibilitymap_set emits. ISTM we should set the LSN of the heap page as
well as the one of the visibilitymap page. Not sure about the
performance impact of that.
Greetings,
Andres Freund
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services
			
		On Tue, Dec 4, 2012 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Youre right, it currently seems to be possible, there's no LSN interlock
> prohibiting this as far as I can see.
Yeah, there certainly isn't that.  Now you could perhaps make an
argument that no operation that can propagate a set bit from master to
standby can arrive until after the standby's xmin horizon is
sufficiently current, but that does feel a little fragile to me...
even if it's true now, new WAL record types might break it, for
example.
> in lazy_scan_heap we do:
>
>                         if (!PageIsAllVisible(page))
>                         {
>                                 PageSetAllVisible(page);
>                                 MarkBufferDirty(buf);
>                                 visibilitymap_set(onerel, blkno, InvalidXLogRecPtr, vmbuffer,
>                                                                   visibility_cutoff_xid);
>                         }
>
> So buf can be written out independently from the xlog record that
> visibilitymap_set emits. ISTM we should set the LSN of the heap page as
> well as the one of the visibilitymap page. Not sure about the
> performance impact of that.
It would result in a massive increase in WAL traffic when vacuuming an
insert-only table; that's why we didn't do it originally.
-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
			
		On 2012-12-04 09:33:28 -0500, Robert Haas wrote:
> On Tue, Dec 4, 2012 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Youre right, it currently seems to be possible, there's no LSN interlock
> > prohibiting this as far as I can see.
>
> Yeah, there certainly isn't that.  Now you could perhaps make an
> argument that no operation that can propagate a set bit from master to
> standby can arrive until after the standby's xmin horizon is
> sufficiently current, but that does feel a little fragile to me...
> even if it's true now, new WAL record types might break it, for
> example.
I wouldn't want to rely on it...
> > in lazy_scan_heap we do:
> >
> >                         if (!PageIsAllVisible(page))
> >                         {
> >                                 PageSetAllVisible(page);
> >                                 MarkBufferDirty(buf);
> >                                 visibilitymap_set(onerel, blkno, InvalidXLogRecPtr, vmbuffer,
> >                                                                   visibility_cutoff_xid);
> >                         }
> >
> > So buf can be written out independently from the xlog record that
> > visibilitymap_set emits. ISTM we should set the LSN of the heap page as
> > well as the one of the visibilitymap page. Not sure about the
> > performance impact of that.
>
> It would result in a massive increase in WAL traffic when vacuuming an
> insert-only table; that's why we didn't do it originally.
I wonder if we could solve that by having an in-memory-only LSN that
only interlocks the hint bit writes, but doesn't cause full page
writes...
Greetings,
Andres Freund
--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services
			
		On Tue, Dec 4, 2012 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I wonder if we could solve that by having an in-memory-only LSN that > only interlocks the hint bit writes, but doesn't cause full page > writes... It's not really a hint bit, because if it fails to get set when the visibility map bit gets set, you've got queries returning wrong answers, because the next insert/update/delete on the heap page will fail to clear the visibility-map bit. But leaving that aside, I think that might work. You'd essentially be preventing the page from being written out of shared_buffers until the WAL record has hit the disk, and it seems like that should be sufficient. Whether it's worth adding that much mechanism for this problem, I'm less sure about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 4, 2012 at 8:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 4, 2012 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:Yeah, there certainly isn't that. Now you could perhaps make an
> Youre right, it currently seems to be possible, there's no LSN interlock
> prohibiting this as far as I can see.
argument that no operation that can propagate a set bit from master to
standby can arrive until after the standby's xmin horizon is
sufficiently current, but that does feel a little fragile to me...
even if it's true now, new WAL record types might break it, for
example.
Another piece of code that caught my attention is this (sorry for digging up topics that probably have been discussed to death before and I might not have paid attention to then):
 Whenever we clear PD_ALL_VISIBLE flag, we do that while holding the heap page lock and also WAL log that action (in fact, piggy-back with insert/update/delete). The only exception that I saw is in lazy_scan_heap()
 916         else if (PageIsAllVisible(page) && has_dead_tuples)
 917         {
 918             elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
 919                  relname, blkno);
  920             PageClearAllVisible(page);
 921             MarkBufferDirty(buf);
 922             visibilitymap_clear(onerel, blkno, vmbuffer);
 923         }
Obviously, this is not a case that we expect to occur. But if it does occur for whatever reason, then even though we will correct the page flag on master, this would never be populated to the standby. So the standby may continue to have that tiny bit set on the page, at least until another insert/update/delete happens on the page. Not sure if there is anything to worry about, but looks a bit strange. I looked at the archive but can't see any report of the warning, so may be we are safe after all.
Another side issue: The way we set visibility map bits in the VACUUM, we do that in the first phase of the vacuum. Now the very fact that we have selected the page for vacuuming, it will have one or more dead tuples. As soon as we find a dead tuple in the page, we decide not to mark the page all-visible and rightly so. But this page would not get marked all-visible even if we remove all dead tuples in the second phase. Why don't we push that work to the second phase or at least retry that again ? Of course, we can't just do it that way without scanning the page again because new tuples may get added or updated/deleted in the meanwhile. But that can be addressed with some tricks - like tracking LSN of every such (which has only dead and live tuples, but not recently dead or insert-in-progress or delete-in-progress) and then comparing that LSN with the page LSN during the second phase. If an update had happened in between, we will know that and we won't deal with the visibility bits.
Another idea would be to track this intermediate state in the page header itself, something like PD_ALL_VISIBLE_OR_DEAD. If an intermediate update/delete/insert comes in, it will clear the bit. If we see the bit set during the second phase of the vacuum, we will know that it contains only dead and live tuples and the dead ones are being removed now. So we can mark the page PD_ALL_VISIBLE.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Tue, Dec 4, 2012 at 12:10 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Hmm. Yeah, I do not have guts to prove that either. I'll probably write up a
> comment for your consideration to explain why we don't trust PD_ALL_VISIBLE
> in Hot standby for seq scans, but still trust VM for index-only scans.
Sure.
> Another piece of code that caught my attention is this (sorry for digging up
> topics that probably have been discussed to death before and I might not
> have paid attention to then):
>
> Whenever we clear PD_ALL_VISIBLE flag, we do that while holding the heap
> page lock and also WAL log that action (in fact, piggy-back with
> insert/update/delete). The only exception that I saw is in lazy_scan_heap()
>
>  916         else if (PageIsAllVisible(page) && has_dead_tuples)
>  917         {
>  918             elog(WARNING, "page containing dead tuples is marked as
> all-visible in relation \"%s\" page %u",
>  919                  relname, blkno);
>  920             PageClearAllVisible(page);
>  921             MarkBufferDirty(buf);
>  922             visibilitymap_clear(onerel, blkno, vmbuffer);
>  923         }
>
> Obviously, this is not a case that we expect to occur. But if it does occur
> for whatever reason, then even though we will correct the page flag on
> master, this would never be populated to the standby. So the standby may
> continue to have that tiny bit set on the page, at least until another
> insert/update/delete happens on the page. Not sure if there is anything to
> worry about, but looks a bit strange. I looked at the archive but can't see
> any report of the warning, so may be we are safe after all.
The text of that warning message was different in previous releases,
and I have seen the older text come up.  I don't really want to invent
a new WAL record type just to cover that case, because that seems like
it's more likely to cause problems than to fix them.  We could
consider emitting an XLOG_HEAP_NEWPAGE record, perhaps.  I'm not sure
whether it's worth it, though.  This is only going to arise
(hopefully) if your master is corrupted - that little hunk of code is
really a $0.02 solution to a $1.00 problem.  I don't know that we want
to pretend that it's any more than a band-aid.
-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
			
		On Wed, Dec 12, 2012 at 1:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 4, 2012 at 12:10 PM, Pavan Deolasee > <pavan.deolasee@gmail.com> wrote: >> Hmm. Yeah, I do not have guts to prove that either. I'll probably write up a >> comment for your consideration to explain why we don't trust PD_ALL_VISIBLE >> in Hot standby for seq scans, but still trust VM for index-only scans. > > Sure. > Here is a small patch that adds comments to heap scan code explaining why we don't trust the all-visible flag in the page, still continue to support index-only scans on hot standby. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Вложения
On Wed, Dec 12, 2012 at 12:31 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Wed, Dec 12, 2012 at 1:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Dec 4, 2012 at 12:10 PM, Pavan Deolasee >> <pavan.deolasee@gmail.com> wrote: >>> Hmm. Yeah, I do not have guts to prove that either. I'll probably write up a >>> comment for your consideration to explain why we don't trust PD_ALL_VISIBLE >>> in Hot standby for seq scans, but still trust VM for index-only scans. >> >> Sure. >> > > Here is a small patch that adds comments to heap scan code explaining > why we don't trust the all-visible flag in the page, still continue to > support index-only scans on hot standby. Committed, with a few modifications to the last part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company