Re: 16-bit page checksums for 9.2

Поиск
Список
Период
Сортировка
От Simon Riggs
Тема Re: 16-bit page checksums for 9.2
Дата
Msg-id CA+U5nMLT4aO6NkSuPFqv9j+ycf94nhkyaqS+gu8tTu1O0hKgPQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: 16-bit page checksums for 9.2  (Noah Misch <noah@leadboat.com>)
Ответы Re: 16-bit page checksums for 9.2  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
>> v7

Thanks very much for the review. Just realised I hadn't actually replied...

> This patch uses FPIs to guard against torn hint writes, even when the
> checksums are disabled.  One could not simply condition them on the
> page_checksums setting, though.  Suppose page_checksums=off and we're hinting
> a page previously written with page_checksums=on.  If that write tears,
> leaving the checksum intact, that block will now fail verification.  A couple
> of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
> old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
> Whenever the cluster starts with checksums disabled, set the field to
> InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
> field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
> failure occurs in a page with LSN older than the stored one, emit either a
> softer warning or no message at all.

We can only change page_checksums at restart (now) so the above seems moot.

If we crash with FPWs enabled we repair any torn pages.

> Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
> the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
> bumping the page LSN:
>        heap_xlog_visible()
>        visibilitymap_clear()
>        visibilitymap_truncate()
>        lazy_scan_heap()
>        XLogRecordPageWithFreeSpace()
>        FreeSpaceMapTruncateRel()
>        fsm_set_and_search()
>        fsm_vacuum_page()
>        fsm_search_avail()
> Though I have not investigated each of these in detail, I suspect most or all
> represent continued torn-page hazards.  Since the FSM is just a hint, we do
> not WAL-log it at all; it would be nicest to always skip checksums for it,
> too.  The visibility map shortcuts are more nuanced.

Still checking all, but not as bad as the list looks.

I'm removing chksum code from smgr and putting back into bufmgr and
other locations.

Patch footprint now looks like this.
doc/src/sgml/config.sgml                      |   40 +++src/backend/access/hash/hashpage.c            |
1src/backend/access/heap/rewriteheap.c        |    4src/backend/access/heap/visibilitymap.c       |
1src/backend/access/nbtree/nbtree.c           |    1src/backend/access/nbtree/nbtsort.c           |
3src/backend/access/spgist/spginsert.c        |    2src/backend/access/transam/twophase.c         |   16
-src/backend/access/transam/xact.c            |    8src/backend/access/transam/xlog.c             |  114
++++++++src/backend/commands/tablecmds.c             |    2src/backend/storage/buffer/bufmgr.c           |   75
+++++src/backend/storage/buffer/localbuf.c        |    2src/backend/storage/ipc/procarray.c           |   63
+++-src/backend/storage/lmgr/proc.c              |    4src/backend/storage/page/bufpage.c            |  331
+++++++++++++++++++++++++-src/backend/utils/misc/guc.c                 |   14
+src/backend/utils/misc/postgresql.conf.sample|   15 -src/include/access/xlog.h                     |
1src/include/catalog/pg_control.h             |    3src/include/catalog/storage.h                 |
1src/include/storage/bufpage.h                |  107 ++++++--src/include/storage/proc.h                    |
3src/include/storage/procarray.h              |    424 files changed, 743 insertions(+), 72 deletions(-) 

Patch is *not* attached here, yet. Still working on it.

> When page_checksums=off and we read a page last written by page_checksums=on,
> we still verify its checksum.  If a mostly-insert/read site uses checksums for
> some time and eventually wishes to shed the overhead, disabling the feature
> will not stop the costs for reads of old data.  They would need a dump/reload
> to get the performance of a never-checksummed database.  Let's either
> unconditionally skip checksum validation under page_checksums=off or add a new
> state like page_checksums=ignore for that even-weaker condition.

Agreed, changed.

>> --- a/doc/src/sgml/config.sgml
>> +++ b/doc/src/sgml/config.sgml
>
>> +       <para>
>> +        Turning this parameter off speeds normal operation, but
>> +        might allow data corruption to go unnoticed. The checksum uses
>> +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
>> +        parameter enabled there is still a non-zero probability that an error
>> +        could go undetected, as well as a non-zero probability of false
>> +        positives.
>> +       </para>
>
> What sources of false positives do you intend to retain?

I see none. Will reword.

>> --- a/src/backend/catalog/storage.c
>> +++ b/src/backend/catalog/storage.c
>> @@ -20,6 +20,7 @@
>>  #include "postgres.h"
>>
>>  #include "access/visibilitymap.h"
>> +#include "access/transam.h"
>>  #include "access/xact.h"
>>  #include "access/xlogutils.h"
>>  #include "catalog/catalog.h"
>> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
>>  /* XLOG gives us high 4 bits */
>>  #define XLOG_SMGR_CREATE     0x10
>>  #define XLOG_SMGR_TRUNCATE   0x20
>> +#define XLOG_SMGR_HINT               0x40
>>
>>  typedef struct xl_smgr_create
>>  {
>> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
>>       smgrDoPendingDeletes(false);
>>  }
>>
>> +/*
>> + * Write a backup block if needed when we are setting a hint.
>> + *
>> + * Deciding the "if needed" bit is delicate and requires us to either
>> + * grab WALInsertLock or check the info_lck spinlock. If we check the
>> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
>> + * so the design choice here is to just go straight for the WALInsertLock
>> + * and trust that calls to this function are minimised elsewhere.
>> + *
>> + * Callable while holding share lock on the buffer content.
>> + *
>> + * Possible that multiple concurrent backends could attempt to write
>> + * WAL records. In that case, more than one backup block may be recorded
>> + * though that isn't important to the outcome and the backup blocks are
>> + * likely to be identical anyway.
>> + */
>> +#define      SMGR_HINT_WATERMARK             13579
>> +void
>> +smgr_buffer_hint(Buffer buffer)
>> +{
>> +     /*
>> +      * Make an XLOG entry reporting the hint
>> +      */
>> +     XLogRecPtr      lsn;
>> +     XLogRecData rdata[2];
>> +     int                     watermark = SMGR_HINT_WATERMARK;
>> +
>> +     /*
>> +      * Not allowed to have zero-length records, so use a small watermark
>> +      */
>> +     rdata[0].data = (char *) (&watermark);
>> +     rdata[0].len = sizeof(int);
>> +     rdata[0].buffer = InvalidBuffer;
>> +     rdata[0].buffer_std = false;
>> +     rdata[0].next = &(rdata[1]);
>> +
>> +     rdata[1].data = NULL;
>> +     rdata[1].len = 0;
>> +     rdata[1].buffer = buffer;
>> +     rdata[1].buffer_std = true;
>> +     rdata[1].next = NULL;
>> +
>> +     lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
>> +
>> +     /*
>> +      * Set the page LSN if we wrote a backup block.
>> +      */
>> +     if (!XLByteEQ(InvalidXLogRecPtr, lsn))
>> +     {
>> +             Page    page = BufferGetPage(buffer);
>> +             PageSetLSN(page, lsn);
>
> PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> is insufficient.

Am serialising this by only writing PageLSN while holding buf hdr lock.

> Need a PageSetTLI() here, too.

Agreed

>> +             elog(LOG, "inserted backup block for hint bit");
>> +     }
>> +}
>
>> --- a/src/backend/storage/buffer/bufmgr.c
>> +++ b/src/backend/storage/buffer/bufmgr.c
>
>> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>>       if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
>>               (BM_DIRTY | BM_JUST_DIRTIED))
>>       {
>> +             /*
>> +              * If we're writing checksums and we care about torn pages then we
>> +              * cannot dirty a page during recovery as a result of a hint.
>> +              * We can set the hint, just not dirty the page as a result.
>> +              *
>> +              * See long discussion in bufpage.c
>> +              */
>> +             if (HintsMustNotDirtyPage())
>> +                     return;
>
> This expands to
>        if (page_checksums && fullPageWrites && RecoveryInProgress())
> If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
> WAL record during recovery.

Yes, logic corrected.

>> +
>> +             /*
>> +              * Write a full page into WAL iff this is the first change on the
>> +              * block since the last checkpoint. That will never be the case
>> +              * if the block is already dirty because we either made a change
>> +              * or set a hint already. Note that aggressive cleaning of blocks
>> +              * dirtied by hint bit setting would increase the call rate.
>> +              * Bulk setting of hint bits would reduce the call rate...
>> +              *
>> +              * We must issue the WAL record before we mark the buffer dirty.
>> +              * Otherwise we might write the page before we write the WAL.
>> +              * That causes a race condition, since a checkpoint might
>> +              * occur between writing the WAL record and marking the buffer dirty.
>> +              * We solve that with a kluge, but one that is already in use
>> +              * during transaction commit to prevent race conditions.
>> +              * Basically, we simply prevent the checkpoint WAL record from
>> +              * being written until we have marked the buffer dirty. We don't
>> +              * start the checkpoint flush until we have marked dirty, so our
>> +              * checkpoint must flush the change to disk successfully or the
>> +              * checkpoint never gets written, so crash recovery will set us right.
>> +              *
>> +              * XXX rename PGPROC variable later; keep it same now for clarity
>> +              */
>> +             MyPgXact->inCommit = true;
>> +             smgr_buffer_hint(buffer);
>> +
>>               LockBufHdr(bufHdr);
>>               Assert(bufHdr->refcount > 0);
>>               if (!(bufHdr->flags & BM_DIRTY))
>> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>>               }
>>               bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
>>               UnlockBufHdr(bufHdr);
>> +             MyPgXact->inCommit = false;
>>       }
>>  }


Found another bug in that section also, relating to inCommit handling.

>> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
>> index 096d36a..a220310 100644
>> --- a/src/backend/storage/buffer/localbuf.c
>> +++ b/src/backend/storage/buffer/localbuf.c
>> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>>               /* Find smgr relation for buffer */
>>               oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>>
>> +             /* XXX do we want to write checksums for local buffers? An option? */
>> +
>
> Don't we have that, now that it happens in mdwrite()?
>
> I can see value in an option to exclude local buffers, since corruption there
> may be less exciting.  It doesn't seem important for an initial patch, though.

I'm continuing to exclude local buffers. Let me know if that should change.

>> --- a/src/backend/storage/page/bufpage.c
>> +++ b/src/backend/storage/page/bufpage.c
>
>> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
>
> I get a 404 for that link.

Changed

>> --- a/src/include/storage/bufpage.h
>> +++ b/src/include/storage/bufpage.h
>
>> -#define PD_VALID_FLAG_BITS   0x0007          /* OR of all valid pd_flags bits */
>> +#define PD_VALID_FLAG_BITS   0x800F          /* OR of all non-checksum pd_flags bits */
>
> The comment is not consistent with the character of the value.

OK


--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: Text-any concatenation volatility acting as optimization barrier
Следующее
От: Greg Smith
Дата:
Сообщение: Re: Checkpoint sync pause