Re: Online checksums patch - once again

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Online checksums patch - once again
Дата
Msg-id b832a496-f0aa-c7e4-cca4-6bd0af4d4a90@iki.fi
обсуждение исходный текст
Ответ на Re: Online checksums patch - once again  (Daniel Gustafsson <daniel@yesql.se>)
Ответы Re: Online checksums patch - once again  (Daniel Gustafsson <daniel@yesql.se>)
Список pgsql-hackers
I looked at patch v22, and I can see two main issues:

1. The one that Robert talked about earlier: A backend checks the local 
"checksums" state. If it's 'off', it writes a page without checksums. 
How do you guarantee that the local state doesn't change in between? The 
implicit assumption seems to be that there MUST NOT be any 
CHECK_FOR_INTERRUPTS() calls between DataChecksumsNeedWrite() and the 
write (or read and DataChecksumsNeedVerify()).

In most code, the DataChecksumsNeedWrite() call is very close to writing 
out the page, often in the same critical section. But this is an 
undocumented assumption.

The code in sendFile() in basebackup.c seems suspicious in that regard. 
It calls DataChecksumsNeedVerify() once before starting to read the 
file. Isn't it possible for the checksums flag to change while it's 
reading the file and sending it to the client? I hope there are 
CHECK_FOR_INTERRUPTS() calls buried somewhere in the loop, because it 
could take minutes to send the whole file.

I would feel better if the state transition of the "checksums" flag 
could only happen in a few safe places, or there were some other 
safeguards for this. I think that's what Andres was trying to say 
earlier in the thread on ProcSignalBarriers. I'm not sure what the 
interface to that should be. It could be something like 
HOLD/RESUME_INTERRUPTS(), where normally all procsignals are handled on 
CHECK_FOR_INTERRUPTS(), but you could "hold off" some if needed. Or 
something else. Or maybe we can just use HOLD/RESUME_INTERRUPTS() for 
this. It's more coarse-grained than necessary, but probably doesn't 
matter in practice.

At minimum, there needs to be comments in DataChecksumsNeedWrite() and 
DataChecksumsNeedVerify(), instructing how to use them safely. Namely, 
you must ensure that there are no interrupts between the 
DataChecksumsNeedWrite() and writing out the page, or between reading 
the page and the DataChecksumsNeedVerify() call. You can achieve that 
with HOLD_INTERRUPTS() or a critical section, or simply ensuring that 
there is no substantial code in between that could call 
CHECK_FOR_INTERRUPTS(). And sendFile() in basebackup.c needs to be fixed.

Perhaps you could have "Assert(InteruptHoldOffCount > 0)" in 
DataChecksumsNeedWrite() and DataChecksumsNeedVerify()? There could be 
other ways that callers could avoid the TOCTOU issue, but it would 
probably catch most of the unsafe call patterns, and you could always 
wrap the DataChecksumsNeedWrite/verify() call in a dummy 
HOLD_INTERRUPTS() block to work around the assertion if you know what 
you're doing.


2. The signaling between enable_data_checksums() and the launcher 
process looks funny to me. The general idea seems to be that 
enable_data_checksums() just starts the launcher process, and the 
launcher process figures out what it need to do and makes all the 
changes to the global state. But then there's this violation of the 
idea: enable_data_checksums() checks DataChecksumsOnInProgress(), and 
tells the launcher process whether it should continue a previously 
crashed operation or start from scratch. I think it would be much 
cleaner if the launcher process figured that out itself, and 
enable_data_checksums() would just tell the launcher what the target 
state is.

enable_data_checksums() and disable_data_checksums() seem prone to race 
conditions. If you call enable_data_checksums() in two backends 
concurrently, depending on the timing, there are two possible outcomes:

a) One call returns true, and launches the background process. The other 
call returns false.

b) Both calls return true, but one of them emits a "NOTICE: data 
checksums worker is already running".

In disable_data_checksum() imagine what happens if another backend calls 
enable_data_checksums() in between the 
ShutdownDatachecksumsWorkerIfRunning() and SetDataChecksumsOff() calls.


>         /*
>          * Mark the buffer as dirty and force a full page write.  We have to
>          * re-write the page to WAL even if the checksum hasn't changed,
>          * because if there is a replica it might have a slightly different
>          * version of the page with an invalid checksum, caused by unlogged
>          * changes (e.g. hintbits) on the master happening while checksums
>          * were off. This can happen if there was a valid checksum on the page
>          * at one point in the past, so only when checksums are first on, then
>          * off, and then turned on again. Iff wal_level is set to "minimal",
>          * this could be avoided iff the checksum is calculated to be correct.
>          */
>         START_CRIT_SECTION();
>         MarkBufferDirty(buf);
>         log_newpage_buffer(buf, false);
>         END_CRIT_SECTION();

It's really unfortunate that we have to dirty the page even if the 
checksum already happens to match. Could we only do the 
log_newpage_buffer() call and skip MarkBufferDirty() in that case?

Could we get away with a more lightweight WAL record that doesn't 
contain the full-page image, but just the block number? On replay, the 
redo routine would read the page from disk.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: deferred primary key and logical replication
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: Support for OUT parameters in procedures