Re: WAL format and API changes (9.5)

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: WAL format and API changes (9.5)
Дата
Msg-id 53774DDF.2090600@vmware.com
обсуждение исходный текст
Ответ на Re: WAL format and API changes (9.5)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: WAL format and API changes (9.5)
Список pgsql-hackers
On 04/03/2014 06:37 PM, Tom Lane wrote:
>> >Let's simplify that, and have one new function, XLogOpenBuffer, which
>> >returns a return code that indicates which of the four cases we're
>> >dealing with. A typical redo function looks like this:
>> >    if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
>> >    {
>> >        /* Modify the page */
>> >        ...
>> >        PageSetLSN(page, lsn);
>> >        MarkBufferDirty(buffer);
>> >    }
>> >    if (BufferIsValid(buffer))
>> >        UnlockReleaseBuffer(buffer);
>> >The '0' in the XLogOpenBuffer call is the ID of the block reference
>> >specified in the XLogRegisterBuffer call, when the WAL record was created.
> +1, but one important step here is finding the data to be replayed.
> That is, a large part of the complexity of replay routines has to do
> with figuring out which parts of the WAL record were elided due to
> full-page-images, and locating the remaining parts.  What can we do
> to make that simpler?
>
> Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
> also calculate and hand back the address/size of the logged data that
> had been pointed to by the associated XLogRecData chain item.  The
> trouble here is that there might've been multiple XLogRecData items
> pointing to the same buffer.  Perhaps the magic ID number you give to
> XLogOpenBuffer should be thought of as identifying an XLogRecData chain
> item, not so much a buffer?  It's fairly easy to see what to do when
> there's just one chain item per buffer, but I'm not sure what to do
> if there's more than one.

Ok, I wrote the first version of a patch for this. The bulk of the patch
is fairly mechanical changes to all the routines that generate WAL
records and their redo routines, to use the new facilities. I'm sure
we'll have to go through a few rounds of discussions and review on the
APIs and names, but I wanted to go and change all the rest of the code
now anyway, to make sure the APIs cover the needs of all the WAL records.

The interesting part is the new APIs for constructing a WAL record, and
replaying it. I haven't polished the implementation of the APIs, but I'm
fairly happy with the way the WAL generation and redo routines now look
like.

Constructing a WAL record is now more convenient; no more constructing
of XLogRecData structs. There's a function, XLogRegisterData(int id,
char *data, int len), that takes the pointer and length directly as
arguments. You can call it multiple times with the same 'id', and the
data will be appended. (although I kept an API function that still
allows you to pass a chain of XLogRecDatas too, for the odd case where
that's more convenient)

Internally, XLogRegisterData() still constructs a chain of XLogRecDatas,
using a small number of static XLogRecData structs. That's a tad messy,
and I hope there is a better way to do that, but I wanted to focus on
the API for now.


Here's some more details on the API (also included in the README in the
patch):


Constructing a WAL record
=========================

A WAL record consists of multiple chunks of data. Each chunk is
identified by an ID number. A WAL record also contains information about
the blocks that it applies to, and each block reference is also
identified by an ID number. If the same ID number is used for data and a
block reference, the data is left out of the WAL record if a full-page
image of the block is taken.

The API for constructing a WAL record consists of four functions:
XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, and XLogInsert.
First, call XLogBeginInsert(). Then register all the buffers modified,
and data needed to replay the changes, using XLogRegister* functions.
Finally, insert the constructed record to the WAL with XLogInsert(). For
example:

    XLogBeginInsert();

    /* register buffers modified as part of this action */
    XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
    XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);

    /* register data that is always included in the WAL record */
    XLogRegisterData(-1, &xlrec, SizeOfFictionalAction);

    /*
     * register data associated with lbuffer. This will not be
     * included in the record if a full-page image is taken.
     */
    XLogRegisterData(0, footuple->data, footuple->len);

    /* data associated with rbuffer */
    XLogRegisterData(0, data2, len2);

    /*
     * Ok, all the data and buffers to include in the WAL record
     * have been registered. Insert the record.
     */
    recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DOSTUFF);

Details of the API functions:

void XLogRegisterBuffer(int id, Buffer buf, int flags);
-----

XLogRegisterBuffer adds information about a data block to the WAL
record. 'id' is an arbitrary number used to identify this page reference
in the redo routine. The information needed to re-find the page at redo
- the relfilenode, fork, and block number - are included in the WAL record.

XLogInsert will automatically include a full copy of the page contents,
if this is the first modification of the buffer since the last
checkpoint. It is important to register every buffer modified by the
action with XLogRegisterBuffer, to avoid torn-page hazards.

The flags control when and how the buffer contents are included in the
WAL record. REGBUF_STANDARD means that the page follows the standard
page layout, and causes the area between pd_lower and pd_upper to be
left out from the image, reducing the WAL volume. Normally, a full-page
image is taken only if the page has not been modified since the last
checkpoint, and only if full_page_writes=on or an online backup is in
progress. The REGBUF_FORCE_IMAGE flag can be used to force a full-page
image to always be included. If the REGBUF_WILL_INIT flag is given, a
full-page image is never taken. The redo routine must completely
re-generate the page contents, using the data included in the WAL record.

void XLogRegisterData(int id, char *data, int len);
-----

XLogRegisterData is used to include arbitrary data in the WAL record. If
XLogRegisterData() is called multiple times with the same 'id', the data
are appended, and will be made available to the redo routine as one
contiguous chunk.

If the same 'id' is used in an XLogRegisterBuffer and XLogRegisterData()
call, the data is not included in the WAL record if a full-page image of
the page is taken. Data registered with id -1 is not related with a
buffer, and is always included.


Writing a REDO routine
======================

A REDO routine uses the data and page references included in the WAL
record to reconstruct the new state of the page. To access the data and
pages included in the WAL record, the following functions are available:

char *XLogRecGetData(XLogRecord *record)
-----

Returns the "main" chunk of data included in the WAL record. That is,
the data included in the record with XLogRegisterData(-1, ...).


XLogReplayResult XLogReplayBuffer(int id, Buffer *buf)
-----

Reads a block associated with the WAL record currently being replayed,
with the given 'id'. The shared buffer is returned in *buf, or
InvalidBuffer if the page could not be found. The block is read into a
shared buffer and locked in exclusive mode. Returns one of the following
result codes:

     typedef enum
     {
    BLK_NEEDS_REDO,        /* block needs to be replayed */
    BLK_DONE,        /* block was already replayed */
    BLK_RESTORED,        /* block was restored from a full-page image */
    BLK_NOTFOUND        /* block was not found (and hence does not need
                 * to be replayed) */
     } XLogReplayResult;

The REDO routine must redo the actions to the page if XLogReplayBuffer
returns BLK_NEEDS_REDO. In other cases, no further action is no
required, although the result code can be used to distinguish the reason.

After modifying the page (if it was necessary), the REDO routine must
unlock and release the buffer. Note that the buffer must be unlocked and
released even if no action was required.


XLogReplayResult XLogReplayBufferExtended(int id, ReadBufferMode mode,
                  bool get_cleanup_lock, Buffer *buf)
-----

Like XLogReplayBuffer(), but with a few extra options.

mode' can be passed to e.g force the page to be zeroed, instead of
reading it from disk. This RBM_ZERO mode should be used to re-initialize
pages registered in the REGBUF_WILL_INIT mode in XLogRegisterBuffer().

if 'get_cleanup_lock' is TRUE, a stronger "cleanup" lock on the page is
acquired, instead of a reguler exclusive-lock.


char *XLogGetPayload(XLogRecord *record, int id, int *len)
-----

Returns a chunk of data included in the WAL record (with
XLogRegisterData(id, ...)). The length of the data is returned in *len.
This is typically used after XLogReplayBuffer() returned BLK_NEEDS_REDO,
with the same 'id', to get the information required to redo the actions
on the page. If no data with the given id is included, perhaps because a
full-page image of the associated buffer was taken instead, an error is
thrown.


void XLogBlockRefGetTag(XLogRecord *record, int id, RelFileNode *rnode,
           ForkNumber *forknum, BlockNumber *blknum);
-----

Returns the relfilenode, fork number, and block number of the page
registered with the given ID.


Finally, here's the usual pattern for how a REDO routine works (taken
from btree_xlog_insert):

if (XLogReplayBuffer(0, &buffer) == BLK_NEEDS_REDO)
{
     int           datalen;
     char         *datapos = XLogGetPayload(record, 0, &datalen);

     page = (Page) BufferGetPage(buffer);

     if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
                     false, false) == InvalidOffsetNumber)
         elog(PANIC, "btree_insert_redo: failed to add item");

     PageSetLSN(page, lsn);
     MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
     UnlockReleaseBuffer(buffer);


- Heikki

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Rowley
Дата:
Сообщение: Re: Allowing join removals for more join types
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: 9.4 checksum error in recovery with btree index