Re: WAL format and API changes (9.5)
| От | Andres Freund |
|---|---|
| Тема | Re: WAL format and API changes (9.5) |
| Дата | |
| Msg-id | 20141030191909.GB12193@alap3.anarazel.de обсуждение исходный текст |
| Ответ на | Re: WAL format and API changes (9.5) (Heikki Linnakangas <hlinnakangas@vmware.com>) |
| Ответы |
Re: WAL format and API changes (9.5)
Re: WAL format and API changes (9.5) |
| Список | pgsql-hackers |
Hi,
On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> The second patch contains the interesting changes.
Heikki's pushed the newest version of this to the git tree.
Some things I noticed while reading the patch:
* potential mismerge:
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -408,7 +408,7 @@ main(int argc, char **argv) } }
- while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv",
+ while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv", long_options, &option_index))
!=-1) { switch (c)
* Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you don't need to test both.
* Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we
everactually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries).
* At least in Assert mode XLogRegisterBuffer() should ensure we're not registering the same buffer twice. It's a pretty
easyto make mistake to do it twice for some kind of actions (think UPDATE), and the consequences will be far from
obviousafaics.
* There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something
like:
if (XLogRecHasBlockRef(record, 1)) XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else oldblk = newblk;
I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible
format.Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't
makesense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once.
* Maybe XLogRegisterData() and XLogRegisterBufData() should accept void* instead of char*? Right now there's lots of
pointlesscasts in callers needed that don't add any safety. The one additional line that's then needed in these
functionsseems well worth it.
* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only refer to the relation, but not to the block number.
Thesestill log their rnode manually. Shouldn't we somehow deal with those in a similar way explicit block references
aredealt with?
* You've added an assert in RestoreBackupBlockContents() that ensures the page isn't new. That wasn't there before,
insteadthere was a special case for it. I can't immediately think how that previously could happen, but I do wonder why
itwas there.
* The following comment in +RestoreBackupBlock probably is two lines to early
+ /* Found it, apply the update */
+ if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE))
+ return InvalidBuffer; This new InvalidBuffer case probably could use a comment in either
XLogReadBufferForRedoExtendedor here.
* Most of the commentary above RestoreBackupBlock() probably doesn't belong there anymore.
* Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just into flags_fork? Right now it makes at least me
thinkthat it's fork specific flags, which really isn't the actual meaning.
* I wonder if it wouldn't be worthwile, for the benefit of the FPI compression patch, to keep the bkp block data after
*all*the "headers". That'd make it easier to just compress the data.
* +InitXLogRecordConstruct() seems like a remainder of the xlogconstruct.c naming. I'd just go for InitXLogInsert()
* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the other slru stuff) doesn't include a reference
tothe page. Isn't that bad? Shouldn't we make XLogRegisterBlock() usable for that case? Otherwise I fail to see how
pg_rewindlike tools can sanely deal with this?
* Why did you duplicate the identical cases in btree_desc()? I guess due to rebasing ontop the xlogdump stats series?
* I haven't actually looked at the xlogdump output so I might be completely wrong, but won't the output of e.g. updates
beharder to read now because it's not clear which of the references old/new buffers are? Not that it matters that
much.
* s/recdataend/recdata_end/? The former is somewhat hard to parse for me.
* I think heap_xlog_update is buggy for wal_level=logical because it computes the length of the tuple using tuplen =
recdataend- recdata; But the old primary key/old tuple value might be stored there as well. Afaics that code has to
continueto use xl_heap_header_len.
* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA to get rid of:
+ /*
+ * The new tuple is normally stored as buffer 0's data. But if
+ * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+ * data, after the xl_heap_insert struct.
+ */
+ if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+ {
+ data = XLogRecGetData(record) + SizeOfHeapInsert;
+ datalen = record->xl_len - SizeOfHeapInsert;
+ }
+ else
+ data = XLogRecGetBlockData(record, 0, &datalen);or have I misunderstood how that works?
* spurious reindent
@@ -7306,9 +7120,9 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record) * XLOG record's LSN, we mustn't mark
thepage all-visible, because * the subsequent update won't be replayed to clear the flag. */
- page = BufferGetPage(buffer);
- PageSetAllVisible(page);
- MarkBufferDirty(buffer);
+ page = BufferGetPage(buffer);
+ PageSetAllVisible(page);
+ MarkBufferDirty(buffer); }
* Somewhat long line:XLogRegisterBuffer(1, heap_buffer, XLogHintBitIsNeeded() ? REGBUF_STANDARD : (REGBUF_STANDARD |
REGBUF_NO_IMAGE));
Man. This page is friggin huge. Now I'm tired ;)
This patch needs at the very least:
* Careful benchmarking of both primary and standbys
* Very careful review of the many changed XLogInsert/replay sites. I'd be very surprised if there weren't bugs hidden
somewhere.I've lost the energy to read them all now.
* A good amount of consideration about the increase in WAL volume. I think there's some cases where that's not
inconsiderable.
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services
В списке pgsql-hackers по дате отправления: