WAL format and API changes (9.5)

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема WAL format and API changes (9.5)
Дата
Msg-id 533D6CBF.6080203@vmware.com
обсуждение исходный текст
Ответы Re: WAL format and API changes (9.5)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: WAL format and API changes (9.5)  (Robert Haas <robertmhaas@gmail.com>)
Re: WAL format and API changes (9.5)  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
I'd like to do some changes to the WAL format in 9.5. I want to annotate 
each WAL record with the blocks that they modify. Every WAL record 
already includes that information, but it's done in an ad hoc way, 
differently in every rmgr. The RelFileNode and block number are 
currently part of the WAL payload, and it's the REDO routine's 
responsibility to extract it. I want to include that information in a 
common format for every WAL record type.

That makes life a lot easier for tools that are interested in knowing 
which blocks a WAL record modifies. One such tool is pg_rewind; it 
currently has to understand every WAL record the backend writes. There's 
also a tool out there called pg_readahead, which does prefetching of 
blocks accessed by WAL records, to speed up PITR. I don't think that 
tool has been actively maintained, but at least part of the reason for 
that is probably that it's a pain to maintain when it has to understand 
the details of every WAL record type.

It'd also be nice for contrib/pg_xlogdump and backend code itself. The 
boilerplate code in all WAL redo routines, and writing WAL records, 
could be simplified.

So, here's my proposal:

Insertion
---------

The big change in creating WAL records is that the buffers involved in 
the WAL-logged operation are explicitly registered, by calling a new 
XLogRegisterBuffer function. Currently, buffers that need full-page 
images are registered by including them in the XLogRecData chain, but 
with the new system, you call the XLogRegisterBuffer() function instead. 
And you call that function for every buffer involved, even if no 
full-page image needs to be taken, e.g because the page is going to be 
recreated from scratch at replay.

It is no longer necessary to include the RelFileNode and BlockNumber of 
the modified pages in the WAL payload. That information is automatically 
included in the WAL record, when XLogRegisterBuffer is called.

Currently, the backup blocks are implicitly numbered, in the order the 
buffers appear in XLogRecData entries. With the new API, the blocks are 
numbered explicitly. This is more convenient when a WAL record sometimes 
modifies a buffer and sometimes not. For example, a B-tree split needs 
to modify four pages: the original page, the new page, the right sibling 
(unless it's the rightmost page) and if it's an internal page, the page 
at the lower level whose split the insertion completes. So there are two 
pages that are sometimes missing from the record. With the new API, you 
can nevertheless always register e.g. original page as buffer 0, new 
page as 1, right sibling as 2, even if some of them are actually 
missing. SP-GiST contains even more complicated examples of that.

The new XLogRegisterBuffer would look like this:

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is used 
in the redo routine to open/restore the same block.
buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for also 
registering buffers that don't need full-page images, and perhaps also a 
function for registering a page that *always* needs a full-page image, 
regardless of the LSN. A few existing WAL record types just WAL-log the 
whole page, so those ad-hoc full-page images could be replaced with this.

With these changes, a typical WAL insertion would look like this:
/* register the buffer with the WAL record, with ID 0 */XLogRegisterBuffer(0, buf, true);
rdata[0].data = (char *) &xlrec;rdata[0].len = sizeof(BlahRecord);rdata[0].buffer_id = -1; /* -1 means the data is
alwaysincluded */rdata[0].next = &(rdata[1]);
 
rdata[1].data = (char *) mydata;rdata[1].len = mydatalen;rdata[1].buffer_id = 0; /* 0 here refers to the buffer
registeredabove */rdata[1].next = NULL
 
...recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);
PageSetLSN(buf, recptr);


(While we're at it, perhaps we should let XLogInsert set the LSN of all 
the registered buffers, to reduce the amount of boilerplate code).

(Instead of using a new XLogRegisterBuffer() function to register the 
buffers, perhaps they should be passed to XLogInsert as a separate list 
or array. I'm not wedded on the details...)

Redo
----

There are four different states a block referenced by a typical WAL 
record can be in:

1. The old page does not exist at all (because the relation was 
truncated later)
2. The old page exists, but has an LSN higher than current WAL record, 
so it doesn't need replaying.
3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

With the current API, that leads to a long boilerplate:
/* If we have a full-page image, restore it and we're done */if (HasBackupBlock(record, 0)){    (void)
RestoreBackupBlock(lsn,record, 0, false, false);    return;}buffer = XLogReadBuffer(xlrec->node, xlrec->block,
false);/*If the page was truncated away, we're done */if (!BufferIsValid(buffer))    return;
 
page = (Page) BufferGetPage(buffer);
/* Has this record already been replayed? */if (lsn <= PageGetLSN(page)){    UnlockReleaseBuffer(buffer);    return;}
/* Modify the page */...PageSetLSN(page, lsn);MarkBufferDirty(buffer);UnlockReleaseBuffer(buffer);

Let's simplify that, and have one new function, XLogOpenBuffer, which 
returns a return code that indicates which of the four cases we're 
dealing with. A typical redo function looks like this:
if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY){    /* Modify the page */    ...
    PageSetLSN(page, lsn);    MarkBufferDirty(buffer);}if (BufferIsValid(buffer))    UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block reference 
specified in the XLogRegisterBuffer call, when the WAL record was created.

WAL format
----------

The registered block references need to be included in the WAL record. 
We already do that for backup blocks, so a naive implementation would be 
to just include a BkpBlock struct for all the block references, even 
those that don't need a full-page image. That would be rather bulky, 
though, so that needs some optimization. Shouldn't be difficult to omit 
duplicated/unnecessary information, and add a flags field indicating 
which fields are present. Overall, I don't expect there to be any big 
difference in the amount of WAL generated by a typical application.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: GSoC proposal - "make an unlogged table logged"
Следующее
От: Tom Lane
Дата:
Сообщение: Re: quiet inline configure check misses a step for clang