Обсуждение: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

Поиск
Список
Период
Сортировка

Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

От
Andres Freund
Дата:
Hi,

By happenstance I just had slru.h open after lunch and noticed the
following comment:/* * Optional array of WAL flush LSNs associated with entries in the SLRU * pages.  If not zero/NULL,
wemust flush WAL before writing pages (true * for pg_clog, false for multixact, pg_subtrans, pg_notify).  group_lsn[] *
haslsn_groups_per_page entries per buffer slot, each containing the * highest LSN known for a contiguous group of SLRU
entrieson that slot's * page. */XLogRecPtr *group_lsn;int            lsn_groups_per_page;
 

Uhm. multixacts historically didn't need to follow the
write-WAL-before-data rule because it was zapped at restart. But it's
now persistent.

There are no comments about this choice anywhere in multixact.c, leading
me to believe that this was not an intentional decision.

Right now I think the "missing" flushes are fairly unlikely to cause
problems in practice. Mostly because the multixact SLRUs are essentially
append only.

But I'd like some more brains to think about potential danger. If we
decide it's ok, let's update the comments.

Greetings,

Andres Freund



Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

От
Noah Misch
Дата:
On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
>     /*
>      * Optional array of WAL flush LSNs associated with entries in the SLRU
>      * pages.  If not zero/NULL, we must flush WAL before writing pages (true
>      * for pg_clog, false for multixact, pg_subtrans, pg_notify).  group_lsn[]
>      * has lsn_groups_per_page entries per buffer slot, each containing the
>      * highest LSN known for a contiguous group of SLRU entries on that slot's
>      * page.
>      */
>     XLogRecPtr *group_lsn;
>     int            lsn_groups_per_page;
> 
> Uhm. multixacts historically didn't need to follow the
> write-WAL-before-data rule because it was zapped at restart. But it's
> now persistent.
> 
> There are no comments about this choice anywhere in multixact.c, leading
> me to believe that this was not an intentional decision.

Here's the multixact.c comment justifying it:
* XLOG interactions: this module generates an XLOG record whenever a new* OFFSETs or MEMBERs page is initialized to
zeroes,as well as an XLOG record* whenever a new MultiXactId is defined.  This allows us to completely* rebuild the
dataentered since the last checkpoint during XLOG replay.* Because this is possible, we need not follow the normal rule
of*"write WAL before data"; the only correctness guarantee needed is that* we flush and sync all dirty OFFSETs and
MEMBERspages to disk before a* checkpoint is considered complete.  If a page does make it to disk ahead* of
correspondingWAL records, it will be forcibly zeroed before use anyway.* Therefore, we don't need to mark our pages
withLSN information; we have* enough synchronization already.
 

The comment's justification is incomplete, though.  What of pages filled over
the course of multiple checkpoint cycles?



Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

От
Noah Misch
Дата:
On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:
> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
> >     /*
> >      * Optional array of WAL flush LSNs associated with entries in the SLRU
> >      * pages.  If not zero/NULL, we must flush WAL before writing pages (true
> >      * for pg_clog, false for multixact, pg_subtrans, pg_notify).  group_lsn[]
> >      * has lsn_groups_per_page entries per buffer slot, each containing the
> >      * highest LSN known for a contiguous group of SLRU entries on that slot's
> >      * page.
> >      */
> >     XLogRecPtr *group_lsn;
> >     int            lsn_groups_per_page;
> >

> Here's the multixact.c comment justifying it:
>
>  * XLOG interactions: this module generates an XLOG record whenever a new
>  * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
>  * whenever a new MultiXactId is defined.  This allows us to completely
>  * rebuild the data entered since the last checkpoint during XLOG replay.
>  * Because this is possible, we need not follow the normal rule of
>  * "write WAL before data"; the only correctness guarantee needed is that
>  * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
>  * checkpoint is considered complete.  If a page does make it to disk ahead
>  * of corresponding WAL records, it will be forcibly zeroed before use anyway.
>  * Therefore, we don't need to mark our pages with LSN information; we have
>  * enough synchronization already.
>
> The comment's justification is incomplete, though.  What of pages filled over
> the course of multiple checkpoint cycles?

Having studied this further, I think it is safe for a reason given elsewhere:

     * Note: we need not flush this XLOG entry to disk before proceeding. The
     * only way for the MXID to be referenced from any data page is for
     * heap_lock_tuple() to have put it there, and heap_lock_tuple() generates
     * an XLOG record that must follow ours.  The normal LSN interlock between
     * the data page and that XLOG record will ensure that our XLOG record
     * reaches disk first.  If the SLRU members/offsets data reaches disk
     * sooner than the XLOG record, we do not care because we'll overwrite it
     * with zeroes unless the XLOG record is there too; see notes at top of
     * this file.

I find no flaw in the first three sentences.  In most cases, one of
multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
overwrite the widowed mxid data with zeroes before the system again writes
data in that vicinity.  We can fail to do that if a backend stalls just after
calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
other backends filled the balance of those pages.  That's still okay; if we
never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
will survive recovery.  I drafted the attached comment update to consolidate
and slightly correct this justification.  (If we were designing the multixact
subsystem afresh today, I'd vote for following the write-WAL-before-data rule
despite believing it is not needed.  The simplicity would be worth it.)

While contemplating the "stalls ... just after calling GetNewMultiXactId()"
scenario, I noticed a race condition not involving any write-WAL-before-data
violation.  MultiXactIdCreateFromMembers() and callees have this skeleton:

GetNewMultiXactId
  LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
  ExtendMultiXactOffset()
  ExtendMultiXactMember()
  START_CRIT_SECTION()
  (MultiXactState->nextMXact)++
  MultiXactState->nextOffset += nmembers
  LWLockRelease(MultiXactGenLock)
XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
RecordNewMultiXact
  LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
  *offptr = offset;  /* update pg_multixact/offsets SLRU page */
  LWLockRelease(MultiXactOffsetControlLock)
  LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
  *memberptr = members[i].xid;  /* update pg_multixact/members SLRU page */
  *flagsptr = flagsval;  /* update pg_multixact/members SLRU page */
  LWLockRelease(MultiXactMemberControlLock)
END_CRIT_SECTION

Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
create higher-numbered mxids.  They will skip over SLRU space the current
backend reserved.  Future GetMultiXactIdMembers() calls rely critically on the
current backend eventually finishing:

     * 2. The next multixact may still be in process of being filled in: that
     * is, another process may have done GetNewMultiXactId but not yet written
     * the offset entry for that ID.  In that scenario, it is guaranteed that
     * the offset entry for that multixact exists (because GetNewMultiXactId
     * won't release MultiXactGenLock until it does) but contains zero
     * (because we are careful to pre-zero offset pages). Because
     * GetNewMultiXactId will never return zero as the starting offset for a
     * multixact, when we read zero as the next multixact's offset, we know we
     * have this case.  We sleep for a bit and try again.

A crash while paused this way makes permanent the zero offset.  After
recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
on the immediate previous mxid hangs indefinitely.  Also,
pg_get_multixact_members(zero_offset_mxid) gets an assertion failure.  I have
not thoroughly considered how best to fix these.  Test procedure:

-- backend 1
checkpoint;
create table victim0 (c) as select 1;
create table stall1 (c) as select 1;
create table last2 (c) as select 1;
begin;
select c from victim0 for key share;
select c from stall1 for key share;
select c from last2 for key share;

-- backend 2
begin; update victim0 set c = c + 1; rollback; -- burn one mxid
-- In a shell, attach GDB to backend 2.  GDB will stop the next SQL command at
-- XLogInsert() in MultiXactIdCreateFromMembers():
--   gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
select c from stall1 for key share; -- stall in gdb while making an mxid

-- backend 3
select c from last2 for key share; -- use one more mxid; flush WAL

-- in GDB session, issue command: kill

-- backend 1, after recovery
select c from victim0;        -- hangs, uncancelable

-- backend 2, after recovery: assertion failure
select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;

Вложения

Re: Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

От
Robert Haas
Дата:
On Sat, Nov 28, 2015 at 11:15 PM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:
>> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
>> >     /*
>> >      * Optional array of WAL flush LSNs associated with entries in the SLRU
>> >      * pages.  If not zero/NULL, we must flush WAL before writing pages (true
>> >      * for pg_clog, false for multixact, pg_subtrans, pg_notify).  group_lsn[]
>> >      * has lsn_groups_per_page entries per buffer slot, each containing the
>> >      * highest LSN known for a contiguous group of SLRU entries on that slot's
>> >      * page.
>> >      */
>> >     XLogRecPtr *group_lsn;
>> >     int                     lsn_groups_per_page;
>> >
>
>> Here's the multixact.c comment justifying it:
>>
>>  * XLOG interactions: this module generates an XLOG record whenever a new
>>  * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
>>  * whenever a new MultiXactId is defined.  This allows us to completely
>>  * rebuild the data entered since the last checkpoint during XLOG replay.
>>  * Because this is possible, we need not follow the normal rule of
>>  * "write WAL before data"; the only correctness guarantee needed is that
>>  * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
>>  * checkpoint is considered complete.  If a page does make it to disk ahead
>>  * of corresponding WAL records, it will be forcibly zeroed before use anyway.
>>  * Therefore, we don't need to mark our pages with LSN information; we have
>>  * enough synchronization already.
>>
>> The comment's justification is incomplete, though.  What of pages filled over
>> the course of multiple checkpoint cycles?
>
> Having studied this further, I think it is safe for a reason given elsewhere:
>
>          * Note: we need not flush this XLOG entry to disk before proceeding. The
>          * only way for the MXID to be referenced from any data page is for
>          * heap_lock_tuple() to have put it there, and heap_lock_tuple() generates
>          * an XLOG record that must follow ours.  The normal LSN interlock between
>          * the data page and that XLOG record will ensure that our XLOG record
>          * reaches disk first.  If the SLRU members/offsets data reaches disk
>          * sooner than the XLOG record, we do not care because we'll overwrite it
>          * with zeroes unless the XLOG record is there too; see notes at top of
>          * this file.
>
> I find no flaw in the first three sentences.  In most cases, one of
> multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
> overwrite the widowed mxid data with zeroes before the system again writes
> data in that vicinity.  We can fail to do that if a backend stalls just after
> calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
> other backends filled the balance of those pages.  That's still okay; if we
> never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
> will survive recovery.  I drafted the attached comment update to consolidate
> and slightly correct this justification.  (If we were designing the multixact
> subsystem afresh today, I'd vote for following the write-WAL-before-data rule
> despite believing it is not needed.  The simplicity would be worth it.)
>
> While contemplating the "stalls ... just after calling GetNewMultiXactId()"
> scenario, I noticed a race condition not involving any write-WAL-before-data
> violation.  MultiXactIdCreateFromMembers() and callees have this skeleton:
>
> GetNewMultiXactId
>   LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
>   ExtendMultiXactOffset()
>   ExtendMultiXactMember()
>   START_CRIT_SECTION()
>   (MultiXactState->nextMXact)++
>   MultiXactState->nextOffset += nmembers
>   LWLockRelease(MultiXactGenLock)
> XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
> RecordNewMultiXact
>   LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
>   *offptr = offset;  /* update pg_multixact/offsets SLRU page */
>   LWLockRelease(MultiXactOffsetControlLock)
>   LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
>   *memberptr = members[i].xid;  /* update pg_multixact/members SLRU page */
>   *flagsptr = flagsval;  /* update pg_multixact/members SLRU page */
>   LWLockRelease(MultiXactMemberControlLock)
> END_CRIT_SECTION
>
> Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
> create higher-numbered mxids.  They will skip over SLRU space the current
> backend reserved.  Future GetMultiXactIdMembers() calls rely critically on the
> current backend eventually finishing:
>
>          * 2. The next multixact may still be in process of being filled in: that
>          * is, another process may have done GetNewMultiXactId but not yet written
>          * the offset entry for that ID.  In that scenario, it is guaranteed that
>          * the offset entry for that multixact exists (because GetNewMultiXactId
>          * won't release MultiXactGenLock until it does) but contains zero
>          * (because we are careful to pre-zero offset pages). Because
>          * GetNewMultiXactId will never return zero as the starting offset for a
>          * multixact, when we read zero as the next multixact's offset, we know we
>          * have this case.  We sleep for a bit and try again.
>
> A crash while paused this way makes permanent the zero offset.  After
> recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
> on the immediate previous mxid hangs indefinitely.  Also,
> pg_get_multixact_members(zero_offset_mxid) gets an assertion failure.  I have
> not thoroughly considered how best to fix these.  Test procedure:
>
> -- backend 1
> checkpoint;
> create table victim0 (c) as select 1;
> create table stall1 (c) as select 1;
> create table last2 (c) as select 1;
> begin;
> select c from victim0 for key share;
> select c from stall1 for key share;
> select c from last2 for key share;
>
> -- backend 2
> begin; update victim0 set c = c + 1; rollback; -- burn one mxid
> -- In a shell, attach GDB to backend 2.  GDB will stop the next SQL command at
> -- XLogInsert() in MultiXactIdCreateFromMembers():
> --   gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
> select c from stall1 for key share; -- stall in gdb while making an mxid
>
> -- backend 3
> select c from last2 for key share; -- use one more mxid; flush WAL
>
> -- in GDB session, issue command: kill
>
> -- backend 1, after recovery
> select c from victim0;          -- hangs, uncancelable
>
> -- backend 2, after recovery: assertion failure
> select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;

For tracking purposes, I updated
https://wiki.postgresql.org/wiki/MultiXact_Bugs and added this. We're
probably past due to spend some time cleaning up the older issues that
Andres's patch didn't fix (although it fixed a lot).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company