RecoveryInProgress() has critical side effects

Поиск
Список
Период
Сортировка
От Robert Haas
Тема RecoveryInProgress() has critical side effects
Дата
Msg-id CA+TgmoY7b65qRjzHN_tWUk8B4sJqk1vj1d31uepVzmgPnZKeLg@mail.gmail.com
обсуждение исходный текст
Ответы Re: RecoveryInProgress() has critical side effects  ("Bossart, Nathan" <bossartn@amazon.com>)
Re: RecoveryInProgress() has critical side effects  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

I have noticed that $SUBJECT, and I think it's bad, and I'd like to
propose some patches to fix it. 0001 is hopefully uncontroversial, but
0002 might need some discussion as I'm not quite sure what approach to
take.

The reason why RecoveryInProgress() has critical side effects is that
it calls InitXLOGAccess(). It turns out that the only critical thing
that InitXLOGAccess() does is call InitXLogInsert().[1] If a backend
doesn't call InitXLogInsert(), max_rdatas won't be initialized, and
the first attempt to insert WAL will probably fail with something like
"ERROR: too much WAL data". But there's no real reason why
InitXLogInsert() needs to happen only after recovery is finished. The
work that it does is just setting up data structures in backend-local
memory, and not much is lost if they are set up and not used. So, in
0001, I propose to move the call to InitXLogInsert() from
InitXLOGAccess() to BaseInit(). That means every backend will do it
the first time it starts up. It also means that CreateCheckPoint()
will no longer need a special case call to InitXLogInsert(), which
seems like a good thing.

Now, what about the other stuff that InitXLOGAccess() does? As far as
I can tell, "wal_segment_size = ControlFile->xlog_seg_size" is always
redundant, because LocalProcessControlFile() has always been called,
and it does the same thing. The postmaster does that call, and in
non-EXEC_BACKEND builds, the global variable setting will be inherited
by all child processes. In EXEC_BACKEND builds, every child process
will go through SubPostmasterMain() which also calls
LocalProcessControlFile() very early in the startup sequence.
InitXLOGAccess() also sets RedoRecPtr and doPageWrites ... but these
are non-critical values. If you set them to 0 and false respectively
and then call XLogInsert(), GetFullPagesWritesInfo() will return 0 and
false, XLogRecordAssemble() will use those values to make wrong
decisions, and then XLogInsertRecord() will notice that the wrong
values were used and will update them and cause XLogInsert() to loop
around. Similarly, XLogCheckpointNeeded() relies on the caller to set
RedoRecPtr appropriately, but all callers recheck after updating
RedoRecPtr, so it's fine if the initial value is zero. The only other
interesting case is XLogCheckBufferNeedsBackup(), which calls
GetFullPageWriteInfo(), but if somehow it gets the wrong answer, the
only thing that can happen is we might come the wrong conclusion about
whether a heap-update WAL record should attempt to compress by looking
for a common prefix and suffix between the old and new tuples. That's
non-critical, and if it happens or doesn't happen at most once per
backend, fine.

Nevertheless, what I did in 0002 is remove InitXLOGAccess() but move
the initialization of RedoRecPtr and doPageWrites into
GetFullPageWritesInfo(), only in the case where RedoRecPtr hasn't been
set yet. This has one property that I really really like, which is
that makes the code more understandable. There is no suggestion that
the correctness of WAL insertion somehow depends on a
RecoveryInProgress() call which may or may not have occurred at some
arbitrarily distant time in the past. Initializing the value at the
point of first use is just way clearer than initializing it as a
side-effect of some other function that's not even that tightly
connected. However, it does have the effect of introducing a branch
into GetFullPageWritesInfo(), which I'm not 100% sure is cheap enough
not to matter. I think the obvious alternative is to just not
initialize RedoRecPtr and doPageWrites at all and document in the
comments that the first time each backend does something with them
it's going to end up retrying once; perhaps that is preferable. Going
the other way, we could get rid of the global variables altogether and
have GetFullPageWrites() read the data from shared memory. However,
unless 8-byte atomicity is guaranteed, that requires a spinlock, so it
seems likely unappealing.

(In case the explanation above is unclear to casual readers, here's a
bit more detail: RedoRecPtr is the redo pointer from what we believe
to be the most recent checkpoint record, and doFullPageWrites is
whether full_page_writes are required at the moment. The latter can be
true either because full_page_writes=on or because a base backup is in
progress. When assembling an XLOG record, we have to decide whether to
include full-page images in that or not, and we do that based on the
value of these variables: if full page writes are required in general
and if the page LSN precedes RedoRecPtr, we include a full page image.
If it turns out that our RedoRecPtr was out of date - i.e. there's a
newer checkpoint we didn't know about - then we go back and
re-assemble the record to make sure all required full-page images are
included.)

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] Isn't it great that these two similarly-named functions capitalize
"xlog" differently? Wa-hoo.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Should AT TIME ZONE be volatile?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: unexpected plan with id = any('{}') condition