Re: Changing the state of data checksums in a running cluster

Поиск
Список
Период
Сортировка
От	Andres Freund
Тема	Re: Changing the state of data checksums in a running cluster
Дата	16 декабря 2025 г. 01:21:46
Msg-id	5mv5vtbqj2xm2wmevsi22smyn32jtznva3ya3daih3ixh3onex@s5fyab63xt3c обсуждение исходный текст
Ответ на	Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>)
Список	pgsql-hackers
Дерево обсуждения
Hi,


One high level issue first: I don't think the way this uses checkpoints and
restartpoints is likely to work out.

Synchronously having to wait for restartpoints during recovery seems generally
operationally a huge issue, but actually could also easily lead to undetected
deadlocks, particularly with syncrep.

Using the checksum state from the control file seems very fraught,
particularly with PITR, as the control file can be "from the future". Which
can be a problem e.g. if checksums were disabled, but we start recovery with a
control file with checksums enabled.

Forcing synchronous checkpoints in a bunch of places also will make this a
good bit slower than necessary, particularly for testing.


My suggestion for how to do this instead is to put the checksum state into the
XLOG_CHECKPOINT_* records.  When starting recovery from an online checkpoint,
I think we should use the ChecksumType from the XLOG_CHECKPOINT_REDO record,
that way the standby/recovery environment's assumption about whether checksums
were enabled is the same as it was at the time the WAL was generated. For
shutdown checkpoints, we could either start to emit a XLOG_CHECKPOINT_REDO, or
we can use the information from the checkpoint record itself.

I think if we only ever use the checksum state from the point where we start
recovery, we might not need to force *any* checkpoints.


Daniel and I chatted about that proposal, and couldn't immediately come up
with scenarios where that would be wrong.  For a while I thought there would
be problems when doing PITR from a base backup that had checksums initially
enabled, but where checksums were disabled before the base backup was
completed. My worry was that a later (once checksums were disabled) hint bit
write (which would not necessarily be WAL logged) would corrupt the checksum,
but I don't think that's a problem, because the startup process will only read
data pages in the process of processing WAL records, and if there's a WAL
record, there would also have to be an FPW, which would "cure" the
unchecksummed page.



More comments below, inline - I wrote these first, so it's possible that I
missed updating some of them in light of what I now wrote above.

> +/*
> + * This must match largest number of sets in barrier_eq and barrier_ne in the
> + * below checksum_barriers definition.
> + */
> +#define MAX_BARRIER_CONDITIONS 2
> +
> +/*
> + * Configuration of conditions which must match when absorbing a procsignal
> + * barrier during data checksum enable/disable operations.  A single function
> + * is used for absorbing all barriers, and the set of conditions to use is
> + * looked up in the checksum_barriers struct.  The struct member for the target
> + * state defines which state the backend must currently be in, and which it
> + * must not be in.
> + */
> +typedef struct ChecksumBarrierCondition
> +{
> +    /* The target state of the barrier */
> +    int            target;
> +    /* A set of states in which at least one MUST match the current state */
> +    int            barrier_eq[MAX_BARRIER_CONDITIONS];
> +    /* The number of elements in the barrier_eq set */
> +    int            barrier_eq_sz;
> +    /* A set of states which all MUST NOT match the current state */
> +    int            barrier_ne[MAX_BARRIER_CONDITIONS];
> +    /* The number of elements in the barrier_ne set */
> +    int            barrier_ne_sz;
> +} ChecksumBarrierCondition;
> +
> +static const ChecksumBarrierCondition checksum_barriers[] =
> +{
> +    {PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2,
{PG_DATA_CHECKSUM_VERSION},1},
 
> +    {PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, {0}, 0},
> +    {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {PG_DATA_CHECKSUM_ANY_VERSION}, 1, {PG_DATA_CHECKSUM_VERSION}, 1},
> +    {PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 1, {0}, 0},
> +    {-1}
> +};

The explanation for this doesn't really explain what the purpose of this thing
is...  Perhaps worth referencing datachecksumsworker.c or such?

For a local, constantly sized, array, you shouldn't need a -1 terminator, as
you can instead use lengthof() or such to detect invalid accesses.


> +/*
> + * Flag to remember if the procsignalbarrier being absorbed for checksums is
> + * the first one.  The first procsignalbarrier can in rare cases be for the
> + * state we've initialized, i.e. a duplicate.  This may happen for any
> + * data_checksum_version value when the process is spawned between the update
> + * of XLogCtl->data_checksum_version and the barrier being emitted.  This can
> + * only happen on the very first barrier so mark that with this flag.
> + */
> +static bool InitialDataChecksumTransition = true;

This is pretty hard to understand right now, at the very least it needs an
updated comment. But perhaps we can just get rid of this and accept barriers
that are redundant.



> @@ -830,9 +905,10 @@ XLogInsertRecord(XLogRecData *rdata,
>           * only happen just after a checkpoint, so it's better to be slow in
>           * this case and fast otherwise.
>           *
> -         * Also check to see if fullPageWrites was just turned on or there's a
> -         * running backup (which forces full-page writes); if we weren't
> -         * already doing full-page writes then go back and recompute.
> +         * Also check to see if fullPageWrites was just turned on, there's a
> +         * running backup or if checksums are enabled (all of which forces
> +         * full-page writes); if we weren't already doing full-page writes
> +         * then go back and recompute.
>           *
>           * If we aren't doing full-page writes then RedoRecPtr doesn't
>           * actually affect the contents of the XLOG record, so we'll update
> @@ -845,7 +921,9 @@ XLogInsertRecord(XLogRecData *rdata,
>              Assert(RedoRecPtr < Insert->RedoRecPtr);
>              RedoRecPtr = Insert->RedoRecPtr;
>          }
> -        doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
> +        doPageWrites = (Insert->fullPageWrites ||
> +                        Insert->runningBackups > 0 ||
> +                        DataChecksumsNeedWrite());
>
>          if (doPageWrites &&
>              (!prevDoPageWrites ||

Why do we need to separately check for DataChecksumsNeedWrite() if turning on
checksums also forces Insert->fullPageWrites to on?



> +/*
> + * SetDataChecksumsOnInProgress
> + *        Sets the data checksum state to "inprogress-on" to enable checksums
> + *
> + * To start the process of enabling data checksums in a running cluster the
> + * data_checksum_version state must be changed to "inprogress-on". See
> + * SetDataChecksumsOn below for a description on how this state change works.
> + * This function blocks until all backends in the cluster have acknowledged the
> + * state transition.
> + */
> +void
> +SetDataChecksumsOnInProgress(bool immediate_checkpoint)
> +{
> +    uint64        barrier;
> +    int            flags;
> +
> +    Assert(ControlFile != NULL);
> +
> +    /*
> +     * The state transition is performed in a critical section with
> +     * checkpoints held off to provide crash safety.
> +     */
> +    MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +    START_CRIT_SECTION();

ISTM that delayChkptFlags ought to only be set once within the critical
section. Obviously we can't fail inbetween as the code stands, but that's not
guaranteed to stay this way.


> +    XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> +
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
> +    SpinLockRelease(&XLogCtl->info_lck);

Maybe worth adding an assertion checking that we are currently in an expected
state (off or inprogress, I think?).


> +    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
> +
> +    END_CRIT_SECTION();
> +    MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;

Swap as well.

Think it might be worth mentioning that we rely on the memory ordering implied
by XLogChecksums() and WaitForProcSignalBarrier() for the changes to
delayChkptFlags. Unless we have a different logic around that?


> +    /*
> +     * Await state change in all backends to ensure that all backends are in
> +     * "inprogress-on". Once done we know that all backends are writing data
> +     * checksums.
> +     */
> +    WaitForProcSignalBarrier(barrier);
> +
> +    /*
> +     * force checkpoint to persist the current checksum state in control file
> +     * etc.
> +     *
> +     * XXX is this needed? There's already a checkpoint at the end of
> +     * ProcessAllDatabases, maybe this is redundant?
> +     */
> +    flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
> +    if (immediate_checkpoint)
> +        flags = flags | CHECKPOINT_FAST;
> +    RequestCheckpoint(flags);

Why do we need a checkpoint at all?


> +}
> +
> +/*
> + * SetDataChecksumsOn
> + *        Enables data checksums cluster-wide
> + *
> + * Enabling data checksums is performed using two barriers, the first one to
> + * set the state to "inprogress-on" (done by SetDataChecksumsOnInProgress())
> + * and the second one to set the state to "on" (done here). Below is a short
> + * description of the processing, a more detailed write-up can be found in
> + * datachecksumsworker.c.
> + *
> + * To start the process of enabling data checksums in a running cluster the
> + * data_checksum_version state must be changed to "inprogress-on".  This state
> + * requires data checksums to be written but not verified. This ensures that
> + * all data pages can be checksummed without the risk of false negatives in
> + * validation during the process.  When all existing pages are guaranteed to
> + * have checksums, and all new pages will be initiated with checksums, the
> + * state can be changed to "on". Once the state is "on" checksums will be both
> + * written and verified. See datachecksumsworker.c for a longer discussion on
> + * how data checksums can be enabled in a running cluster.
> + *
> + * This function blocks until all backends in the cluster have acknowledged the
> + * state transition.
> + */
> +void
> +SetDataChecksumsOn(bool immediate_checkpoint)
>  {
> +    uint64        barrier;
> +    int            flags;
> +
>      Assert(ControlFile != NULL);
> -    return (ControlFile->data_checksum_version > 0);
> +
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +
> +    /*
> +     * The only allowed state transition to "on" is from "inprogress-on" since
> +     * that state ensures that all pages will have data checksums written.
> +     */
> +    if (XLogCtl->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
> +    {
> +        SpinLockRelease(&XLogCtl->info_lck);
> +        elog(ERROR, "checksums not in \"inprogress-on\" mode");

Seems like a PANIC condition to me...

> +    }
> +
> +    SpinLockRelease(&XLogCtl->info_lck);
> +
> +    MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +    INJECTION_POINT("datachecksums-enable-checksums-delay", NULL);
> +    START_CRIT_SECTION();

I think it's a really really bad idea to do something fallible, like
INJECTION_POINT, after setting delayChkptFlags, but before entering the crit
section. Any error in the injection point will lead to a corrupted
delayChkptFlags, no?


> +    XLogChecksums(PG_DATA_CHECKSUM_VERSION);
> +
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
> +    SpinLockRelease(&XLogCtl->info_lck);
> +
> +    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
> +
> +    END_CRIT_SECTION();
> +    MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +
> +    /*
> +     * Await state transition of "on" in all backends. When done we know that
> +     * data checksums are enabled in all backends and data checksums are both
> +     * written and verified.
> +     */
> +    WaitForProcSignalBarrier(barrier);
> +
> +    INJECTION_POINT("datachecksums-enable-checksums-pre-checkpoint", NULL);
> +
> +    /* XXX is this needed? */
> +    flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
> +    if (immediate_checkpoint)
> +        flags = flags | CHECKPOINT_FAST;
> +    RequestCheckpoint(flags);
> +}
> +
> +/*
> + * SetDataChecksumsOff
> + *        Disables data checksums cluster-wide
> + *
> + * Disabling data checksums must be performed with two sets of barriers, each
> + * carrying a different state. The state is first set to "inprogress-off"
> + * during which checksums are still written but not verified. This ensures that
> + * backends which have yet to observe the state change from "on" won't get
> + * validation errors on concurrently modified pages. Once all backends have
> + * changed to "inprogress-off", the barrier for moving to "off" can be emitted.
> + * This function blocks until all backends in the cluster have acknowledged the
> + * state transition.
> + */
> +void
> +SetDataChecksumsOff(bool immediate_checkpoint)
> +{
> +    [...]
> +    /*
> +     * Ensure that we don't incur a checkpoint during disabling checksums.
> +     */
> +    MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +    START_CRIT_SECTION();
> +
> +    XLogChecksums(0);

Why no symbolic name here?


> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    XLogCtl->data_checksum_version = 0;
> +    SpinLockRelease(&XLogCtl->info_lck);
> +
> +    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF);
> +
> +    END_CRIT_SECTION();
> +    MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +
> +    WaitForProcSignalBarrier(barrier);
> +
> +    flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
> +    if (immediate_checkpoint)
> +        flags = flags | CHECKPOINT_FAST;
> +    RequestCheckpoint(flags);
> +}
> +
> +/*
> + * AbsorbDataChecksumsBarrier
> + *        Generic function for absorbing data checksum state changes
> + *
> + * All procsignalbarriers regarding data checksum state changes are absorbed
> + * with this function.  The set of conditions required for the state change to
> + * be accepted are listed in the checksum_barriers struct, target_state is
> + * used to look up the relevant entry.
> + */
> +bool
> +AbsorbDataChecksumsBarrier(int target_state)
> +{
> +    const ChecksumBarrierCondition *condition = checksum_barriers;
> +    int            current = LocalDataChecksumVersion;
> +    bool        found = false;
> +
> +    /*
> +     * Find the barrier condition definition for the target state. Not finding
> +     * a condition would be a grave programmer error as the states are a
> +     * discrete set.
> +     */
> +    while (condition->target != target_state && condition->target != -1)
> +        condition++;
> +    if (unlikely(condition->target == -1))
> +        ereport(ERROR,
> +                errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                errmsg("invalid target state %i for data checksum barrier",
> +                       target_state));

FWIW, you don't need unlikely() when the branch does an ereport(ERROR), as
ereports >=ERROR are marked "cold" automatically.


> +    /*
> +     * The current state MUST be equal to one of the EQ states defined in this
> +     * barrier condition, or equal to the target_state if - and only if -
> +     * InitialDataChecksumTransition is true.
> +     */
> +    for (int i = 0; i < condition->barrier_eq_sz; i++)
> +    {
> +        if (current == condition->barrier_eq[i] ||
> +            condition->barrier_eq[i] == PG_DATA_CHECKSUM_ANY_VERSION)
> +            found = true;
> +    }
> +    if (InitialDataChecksumTransition && current == target_state)
> +        found = true;
> +
> +    /*
> +     * The current state MUST NOT be equal to any of the NE states defined in
> +     * this barrier condition.
> +     */
> +    for (int i = 0; i < condition->barrier_ne_sz; i++)
> +    {
> +        if (current == condition->barrier_ne[i])
> +            found = false;
> +    }
> +
> +    /*
> +     * If the relevent state criteria aren't satisfied, throw an error which
> +     * will be caught by the procsignal machinery for a later retry.
> +     */
> +    if (!found)
> +        ereport(ERROR,
> +                errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                errmsg("incorrect data checksum state %i for target state %i",
> +                       current, target_state));
> +
> +    SetLocalDataChecksumVersion(target_state);
> +    InitialDataChecksumTransition = false;
> +    return true;
> +}


> +/*
> + * Log the new state of checksums
> + */
> +static void
> +XLogChecksums(uint32 new_type)
> +{
> +    xl_checksum_state xlrec;
> +    XLogRecPtr    recptr;
> +
> +    xlrec.new_checksumtype = new_type;
> +
> +    XLogBeginInsert();
> +    XLogRegisterData((char *) &xlrec, sizeof(xl_checksum_state));
> +
> +    INJECTION_POINT("datachecksums-xlogchecksums-pre-xloginsert", &new_type);
> +
> +    recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKSUMS);
> +    XLogFlush(recptr);
> +}

Why an injection point between XLogBeginInsert() and XLogInsert(), rather than
have the injection point before the XLogBeginInsert()?


> +    else if (info == XLOG_CHECKSUMS)
> +    {
> +        xl_checksum_state state;
> +        uint64        barrier;
> +
> +        memcpy(&state, XLogRecGetData(record), sizeof(xl_checksum_state));
> +
> +        /*
> +         * XXX Could this end up written to the control file prematurely? IIRC
> +         * that happens during checkpoint, so what if that gets triggered e.g.
> +         * because someone runs CHECKPOINT? If we then crash (or something
> +         * like that), could that confuse the instance?
> +         */
> +        SpinLockAcquire(&XLogCtl->info_lck);
> +        XLogCtl->data_checksum_version = state.new_checksumtype;
> +        SpinLockRelease(&XLogCtl->info_lck);
> +
> +        /*
> +         * Block on a procsignalbarrier to await all processes having seen the
> +         * change to checksum status. Once the barrier has been passed we can
> +         * initiate the corresponding processing.
> +         */
> +        switch (state.new_checksumtype)
> +        {
> +            case PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION:
> +                barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
> +                WaitForProcSignalBarrier(barrier);
> +                break;
> +
> +            case PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION:
> +                barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF);
> +                WaitForProcSignalBarrier(barrier);
> +                break;
> +
> +            case PG_DATA_CHECKSUM_VERSION:
> +                barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
> +                WaitForProcSignalBarrier(barrier);
> +                break;
> +
> +            default:
> +                Assert(state.new_checksumtype == 0);
> +                barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF);
> +                WaitForProcSignalBarrier(barrier);
> +                break;

I'd not add a default clause, but add a case each value of the enum. That way
we'll get warnings if the set of states changes.


These WaitForProcSignalBarrier() are one of the scariest bits of the
patchset. If the startup process were to hold any lock that backends need, and
the backends waited for that lock without processing interrupts, we'd have an
undetected deadlock.  This is much more likely to be a problem for the startup
process, as it does the work of many backends on the primary.

We do process interrupts while waiting for heavyweight locks, so that at least
is not an issue. Seems worth to call out rather explicitly.



> +    if (checksumRestartPoint &&
> +        (info == XLOG_CHECKPOINT_ONLINE ||
> +         info == XLOG_CHECKPOINT_REDO ||
> +         info == XLOG_CHECKPOINT_SHUTDOWN))
> +    {
> +        int            flags;
> +
> +        elog(LOG, "forcing creation of a restartpoint after XLOG_CHECKSUMS");
> +
> +        /* We explicitly want an immediate checkpoint here */
> +        flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_FAST;
> +        RequestCheckpoint(flags);
> +
> +        checksumRestartPoint = false;
> +    }

As noted above, I don't think we should rely on starting restartpoints.



> diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
> index 2be4e069816..baf6c8cc2cc 100644
> --- a/src/backend/backup/basebackup.c
> +++ b/src/backend/backup/basebackup.c
> @@ -1613,7 +1613,8 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
>       * enabled for this cluster, and if this is a relation file, then verify
>       * the checksum.
>       */
> -    if (!noverify_checksums && DataChecksumsEnabled() &&
> +    if (!noverify_checksums &&
> +        DataChecksumsNeedWrite() &&
>          RelFileNumberIsValid(relfilenumber))
>          verify_checksum = true;
>

Why is DataChecksumsNeedWrite() being tested here?


>  --
>  -- We also set up some things as accessible to standard roles.
>  --
> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> index 086c4c8fb6f..6d452b10bce 100644
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1374,6 +1374,26 @@ CREATE VIEW pg_stat_progress_copy AS
>      FROM pg_stat_get_progress_info('COPY') AS S
>          LEFT JOIN pg_database D ON S.datid = D.oid;
>
> +CREATE VIEW pg_stat_progress_data_checksums AS
> +    SELECT
> +        S.pid AS pid, S.datid, D.datname AS datname,
> +        CASE S.param1 WHEN 0 THEN 'enabling'
> +                      WHEN 1 THEN 'disabling'
> +                      WHEN 2 THEN 'waiting on temporary tables'
> +                      WHEN 3 THEN 'waiting on checkpoint'
> +                      WHEN 4 THEN 'waiting on barrier'
> +                      WHEN 5 THEN 'done'
> +                      END AS phase,
> +        CASE S.param2 WHEN -1 THEN NULL ELSE S.param2 END AS databases_total,
> +        S.param3 AS databases_done,
> +        CASE S.param4 WHEN -1 THEN NULL ELSE S.param4 END AS relations_total,
> +        CASE S.param5 WHEN -1 THEN NULL ELSE S.param5 END AS relations_done,
> +        CASE S.param6 WHEN -1 THEN NULL ELSE S.param6 END AS blocks_total,
> +        CASE S.param7 WHEN -1 THEN NULL ELSE S.param7 END AS blocks_done
> +    FROM pg_stat_get_progress_info('DATACHECKSUMS') AS S
> +        LEFT JOIN pg_database D ON S.datid = D.oid
> +    ORDER BY S.datid; -- return the launcher process first
> +

Not this patch's fault, but I strongly dislike that we do this in SQL. Every
postgres database in the world has ~110kB of pg_stat_progress view definitions
in it.  We should just do this in a C function.


> diff --git a/src/backend/postmaster/datachecksumsworker.c b/src/backend/postmaster/datachecksumsworker.c
> new file mode 100644
> index 00000000000..57311760b2b
> --- /dev/null
> +++ b/src/backend/postmaster/datachecksumsworker.c
> @@ -0,0 +1,1491 @@
> +/*-------------------------------------------------------------------------
> + *
> + * datachecksumsworker.c
> + *      Background worker for enabling or disabling data checksums online
> + *
> + * When enabling data checksums on a database at initdb time or when shut down
> + * with pg_checksums, no extra process is required as each page is checksummed,
> + * and verified, when accessed.  When enabling checksums on an already running
> + * cluster, this worker will ensure that all pages are checksummed before
> + * verification of the checksums is turned on. In the case of disabling
> + * checksums, the state transition is performed only in the control file, no
> + * changes are performed on the data pages.
> + *
> + * Checksums can be either enabled or disabled cluster-wide, with on/off being
> + * the end state for data_checksums.
> + *
> + * Enabling checksums
> + * ------------------
> + * When enabling checksums in an online cluster, data_checksums will be set to
> + * "inprogress-on" which signals that write operations MUST compute and write
> + * the checksum on the data page, but during reading the checksum SHALL NOT be
> + * verified. This ensures that all objects created during checksumming will
> + * have checksums set, but no reads will fail due to incorrect checksum.

Maybe "... due to not yet set checksums."?  Incorrect checksums sounds like
it's about checksums that are actively wrong, rather than just not set. Except
for the corner case of a torn page in the process of an hint bit write, after
having disabled checksums, there shouldn't be incorrect ones, right?


> The
> + * DataChecksumsWorker will compile a list of databases which exist at the
> + * start of checksumming, and all of these which haven't been dropped during
> + * the processing MUST have been processed successfully in order for checksums
> + * to be enabled. Any new relation created during processing will see the
> + * in-progress state and will automatically be checksummed.

What about new databases created while checksums are being enabled? They could
be copied before the worker has processed them. At least for the file_copy
strategy, the copy will be verbatim and thus will not necessarily have
checksums set.


> + * Synchronization and Correctness
> + * -------------------------------
> + * The processes involved in enabling, or disabling, data checksums in an
> + * online cluster must be properly synchronized with the normal backends
> + * serving concurrent queries to ensure correctness. Correctness is defined
> + * as the following:
> + *
> + *    - Backends SHALL NOT violate the data_checksums state they have agreed to
> + *      by acknowledging the procsignalbarrier:  This means that all backends
> + *      MUST calculate and write data checksums during all states except off;
> + *      MUST validate checksums only in the 'on' state.
> + *    - Data checksums SHALL NOT be considered enabled cluster-wide until all
> + *      currently connected backends have state "on": This means that all
> + *      backends must wait on the procsignalbarrier to be acknowledged by all
> + *      before proceeding to validate data checksums.
> + *
> + * There are two levels of synchronization required for changing data_checksums

Maybe s/levels/steps/?

> + * in an online cluster: (i) changing state in the active backends ("on",
> + * "off", "inprogress-on" and "inprogress-off"), and (ii) ensuring no
> + * incompatible objects and processes are left in a database when workers end.
> + * The former deals with cluster-wide agreement on data checksum state and the
> + * latter with ensuring that any concurrent activity cannot break the data
> + * checksum contract during processing.
> + *
> + * Synchronizing the state change is done with procsignal barriers, where the
> + * WAL logging backend updating the global state in the controlfile will wait

It's not entirely obvious what "the WAL logging backend" means.


> + * for all other backends to absorb the barrier. Barrier absorption will happen
> + * during interrupt processing, which means that connected backends will change
> + * state at different times. To prevent data checksum state changes when
> + * writing and verifying checksums, interrupts shall be held off before
> + * interrogating state and resumed when the IO operation has been performed.
> + *
> + *   When Enabling Data Checksums
> + *   ----------------------------

Odd change in indentation here.



> + *   A process which fails to observe data checksums being enabled can induce
> + *   two types of errors: failing to write the checksum when modifying the page
> + *   and failing to validate the data checksum on the page when reading it.
> + *
> + *   When processing starts all backends belong to one of the below sets, with
> + *   one set being empty:
> + *
> + *   Bd: Backends in "off" state
> + *   Bi: Backends in "inprogress-on" state
> + *
> + *   If processing is started in an online cluster then all backends are in Bd.
> + *   If processing was halted by the cluster shutting down, the controlfile
> + *   state "inprogress-on" will be observed on system startup and all backends
> + *   will be placed in Bd.

Why not in Bi? Just for simplicities sake? ISTM we already need to be sure
that new backends start in Bi, as they might never observe the barrier...


> Backends transition Bd -> Bi via a procsignalbarrier
> + *   which is emitted by the DataChecksumsLauncher.  When all backends have
> + *   acknowledged the barrier then Bd will be empty and the next phase can
> + *   begin: calculating and writing data checksums with DataChecksumsWorkers.
> + *   When the DataChecksumsWorker processes have finished writing checksums on
> + *   all pages and enables data checksums cluster-wide via another

s/enables/enabled/?

> + *   procsignalbarrier, there are four sets of backends where Bd shall be an
> + *   empty set:
> + *
> + *   Bg: Backend updating the global state and emitting the procsignalbarrier
> + *   Bd: Backends in "off" state
> + *   Be: Backends in "on" state
> + *   Bi: Backends in "inprogress-on" state
> + *
> + *   Backends in Bi and Be will write checksums when modifying a page, but only
> + *   backends in Be will verify the checksum during reading. The Bg backend is
> + *   blocked waiting for all backends in Bi to process interrupts and move to
> + *   Be. Any backend starting while Bg is waiting on the procsignalbarrier will
> + *   observe the global state being "on" and will thus automatically belong to
> + *   Be.  Checksums are enabled cluster-wide when Bi is an empty set. Bi and Be
> + *   are compatible sets while still operating based on their local state as
> + *   both write data checksums.
> + *
> + *   When Disabling Data Checksums
> + *   -----------------------------
> + *   A process which fails to observe that data checksums have been disabled
> + *   can induce two types of errors: writing the checksum when modifying the
> + *   page and validating a data checksum which is no longer correct due to
> + *   modifications to the page.

Hm. I wonder if we ought to zero out old checksums when loading a page into
s_b with checksums disabled... But that's really independent of this patchset.


> + * Potential optimizations
> + * -----------------------
> + * Below are some potential optimizations and improvements which were brought
> + * up during reviews of this feature, but which weren't implemented in the
> + * initial version. These are ideas listed without any validation on their
> + * feasibility or potential payoff. More discussion on these can be found on
> + * the -hackers threads linked to in the commit message of this feature.
> + *
> + *   * Launching datachecksumsworker for resuming operation from the startup
> + *     process: Currently users have to restart processing manually after a
> + *     restart since dynamic background worker cannot be started from the
> + *     postmaster. Changing the startup process could make restarting the
> + *     processing automatic on cluster restart.
> + *   * Avoid dirtying the page when checksums already match: Iff the checksum
> + *     on the page happens to already match we still dirty the page. It should
> + *     be enough to only do the log_newpage_buffer() call in that case.
> + *   * Invent a lightweight WAL record that doesn't contain the full-page
> + *     image but just the block number: On replay, the redo routine would read
> + *     the page from disk.

The last sentence might be truncated?


> + *   * Teach pg_checksums to avoid checksummed pages when pg_checksums is used
> + *     to enable checksums on a cluster which is in inprogress-on state and
> + *     may have checksummed pages (make pg_checksums be able to resume an
> + *     online operation).
> + *   * Restartability (not necessarily with page granularity).
> + *
> + *
> + * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *      src/backend/postmaster/datachecksumsworker.c

Hm. The whole set of interactions with checkpoints/restartpoints aren't explored
here?


> +/*
> + * ProcessSingleRelationFork
> + *        Enable data checksums in a single relation/fork.
> + *
> + * Returns true if successful, and false if *aborted*. On error, an actual
> + * error is raised in the lower levels.
> + */
> +static bool
> +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
> +{
> +    BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);
> +    char        activity[NAMEDATALEN * 2 + 128];
> +    char       *relns;
> +
> +    relns = get_namespace_name(RelationGetNamespace(reln));
> +
> +    if (!relns)
> +        return false;
> +
> +    /* Report the current relation to pgstat_activity */
> +    snprintf(activity, sizeof(activity) - 1, "processing: %s.%s (%s, %dblocks)",
> +             relns, RelationGetRelationName(reln), forkNames[forkNum], numblocks);
> +    pgstat_report_activity(STATE_RUNNING, activity);
> +
> +    /*
> +     * As of now we only update the block counter for main forks in order to
> +     * not cause too frequent calls. TODO: investigate whether we should do it
> +     * more frequent?
> +     */
> +    if (forkNum == MAIN_FORKNUM)
> +        pgstat_progress_update_param(PROGRESS_DATACHECKSUMS_BLOCKS_TOTAL,
> +                                     numblocks);

That doesn't make much sense to me. Presumably the reason to skip it for the
other forks is that they're small-ish. But if so, there's no point in skipping
the reporting either, as presumably there wouldn't be a lot of reporting?


> +    /*
> +     * We are looping over the blocks which existed at the time of process
> +     * start, which is safe since new blocks are created with checksums set
> +     * already due to the state being "inprogress-on".
> +     */
> +    for (BlockNumber blknum = 0; blknum < numblocks; blknum++)
> +    {
> +        Buffer        buf = ReadBufferExtended(reln, forkNum, blknum, RBM_NORMAL, strategy);
> +
> +        /* Need to get an exclusive lock before we can flag as dirty */
> +        LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> +
> +        /*
> +         * Mark the buffer as dirty and force a full page write.  We have to
> +         * re-write the page to WAL even if the checksum hasn't changed,
> +         * because if there is a replica it might have a slightly different
> +         * version of the page with an invalid checksum, caused by unlogged
> +         * changes (e.g. hintbits) on the master happening while checksums
> +         * were off. This can happen if there was a valid checksum on the page
> +         * at one point in the past, so only when checksums are first on, then
> +         * off, and then turned on again.  TODO: investigate if this could be
> +         * avoided if the checksum is calculated to be correct and wal_level
> +         * is set to "minimal",
> +         */
> +        START_CRIT_SECTION();
> +        MarkBufferDirty(buf);
> +        log_newpage_buffer(buf, false);
> +        END_CRIT_SECTION();

Hm. It's pretty annoying to have to pass page_std = false here, that could
increase the write volume noticeably. But there's not a great way to know
what the right value would be :(



> +/*
> + * launcher_exit
> + *
> + * Internal routine for cleaning up state when the launcher process exits. We
> + * need to clean up the abort flag to ensure that processing can be restarted
> + * again after it was previously aborted.
> + */
> +static void
> +launcher_exit(int code, Datum arg)
> +{
> +    if (launcher_running)
> +    {
> +        LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
> +        launcher_running = false;
> +        DataChecksumsWorkerShmem->launcher_running = false;
> +        LWLockRelease(DataChecksumsWorkerLock);
> +    }
> +}

Could we end up exiting this with the worker still running?


> +/*
> + * WaitForAllTransactionsToFinish
> + *        Blocks awaiting all current transactions to finish
> + *
> + * Returns when all transactions which are active at the call of the function
> + * have ended, or if the postmaster dies while waiting. If the postmaster dies
> + * the abort flag will be set to indicate that the caller of this shouldn't
> + * proceed.
> + *
> + * NB: this will return early, if aborted by SIGINT or if the target state
> + * is changed while we're running.

I think either here, or at its callsites, the patch needs to explain *why* we
are waiting for all transactions to finish.  Presumably this is to ensure that
other sessions haven't created relations that we can't see yet?

It actually doesn't seem to wait for all transactions, ust for ones with an
xid?

> + */
> +static void
> +WaitForAllTransactionsToFinish(void)
> +{
> +    TransactionId waitforxid;
> +
> +    LWLockAcquire(XidGenLock, LW_SHARED);
> +    waitforxid = XidFromFullTransactionId(TransamVariables->nextXid);
> +    LWLockRelease(XidGenLock);
> +
> +    while (TransactionIdPrecedes(GetOldestActiveTransactionId(false, true), waitforxid))
> +    {
> +        char        activity[64];
> +        int            rc;
> +
> +        /* Oldest running xid is older than us, so wait */
> +        snprintf(activity,
> +                 sizeof(activity),
> +                 "Waiting for current transactions to finish (waiting for %u)",
> +                 waitforxid);
> +        pgstat_report_activity(STATE_RUNNING, activity);
> +
> +        /* Retry every 3 seconds */
> +        ResetLatch(MyLatch);
> +        rc = WaitLatch(MyLatch,
> +                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
> +                       3000,
> +                       WAIT_EVENT_CHECKSUM_ENABLE_STARTCONDITION);
> +
> +        /*
> +         * If the postmaster died we won't be able to enable checksums
> +         * cluster-wide so abort and hope to continue when restarted.
> +         */
> +        if (rc & WL_POSTMASTER_DEATH)
> +            ereport(FATAL,
> +                    errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                    errmsg("postmaster exited during data checksum processing"),
> +                    errhint("Restart the database and restart data checksum processing by calling
pg_enable_data_checksums()."));
> +
> +        LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
> +        if (DataChecksumsWorkerShmem->launch_operation != operation)
> +            abort_requested = true;
> +        LWLockRelease(DataChecksumsWorkerLock);
> +        if (abort_requested)
> +            break;

I don't like this much - loops with a timeout are generally a really bad idea
and we shouldn't add more instances. Presumably this also makes the tests
slower...

How about collecting the to-be-waited-for virtualxids and then wait for those?


> +    if (operation == ENABLE_DATACHECKSUMS)
> +    {
> +        /*
> +         * If we are asked to enable checksums in a cluster which already has
> +         * checksums enabled, exit immediately as there is nothing more to do.
> +         * Hold interrupts to make sure state doesn't change during checking.
> +         */
> +        HOLD_INTERRUPTS();
> +        if (DataChecksumsNeedVerify())
> +        {
> +            RESUME_INTERRUPTS();
> +            goto done;
> +        }
> +        RESUME_INTERRUPTS();

I don't understand what this interrupt stuff achieves here?



> diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
> index 25d13a798d1..0faaac14b1b 100644
> --- a/src/include/storage/checksum.h
> +++ b/src/include/storage/checksum.h
> @@ -15,6 +15,21 @@
>
>  #include "storage/block.h"
>
> +/*
> + * Checksum version 0 is used for when data checksums are disabled (OFF).
> + * PG_DATA_CHECKSUM_VERSION defines that data checksums are enabled in the
> + * cluster and PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF}_VERSION defines that data
> + * checksums are either currently being enabled or disabled.
> + */
> +typedef enum ChecksumType
> +{
> +    PG_DATA_CHECKSUM_OFF = 0,
> +    PG_DATA_CHECKSUM_VERSION,
> +    PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION,
> +    PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION,
> +    PG_DATA_CHECKSUM_ANY_VERSION
> +} ChecksumType;

Why is there "VERSION" in the name of these?  Feels like that's basically just
vestigial at this point.


>  /*
>   * There also exist several built-in LWLock tranches.  As with the predefined
> diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
> index c6f5ebceefd..d90d35b1d6f 100644
> --- a/src/include/storage/proc.h
> +++ b/src/include/storage/proc.h
> @@ -463,11 +463,11 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
>   * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
>   * run during normal operation.  Startup process and WAL receiver also consume
>   * 2 slots, but WAL writer is launched only after startup has exited, so we
> - * only need 6 slots.
> + * only need 6 slots to cover these. The DataChecksums worker and launcher
> + * can consume 2 slots when data checksums are enabled or disabled.
>   */
>  #define MAX_IO_WORKERS          32
> -#define NUM_AUXILIARY_PROCS        (6 + MAX_IO_WORKERS)
> -
> +#define NUM_AUXILIARY_PROCS        (8 + MAX_IO_WORKERS)

Aren't they bgworkers now?


> diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
> index afeeb1ca019..c54c61e2cd8 100644
> --- a/src/include/storage/procsignal.h
> +++ b/src/include/storage/procsignal.h
> @@ -54,6 +54,11 @@ typedef enum
>  typedef enum
>  {
>      PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
> +
> +    PROCSIGNAL_BARRIER_CHECKSUM_OFF,
> +    PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON,
> +    PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF,
> +    PROCSIGNAL_BARRIER_CHECKSUM_ON,
>  } ProcSignalBarrierType;

I wonder if these really should be different barriers. What if we just made it
one, and instead drove the transition on the current shmem content?



Other stuff:
- what protects against multiple backends enabling checksums at the same time?

  Afaict there isn't anything, and we just ignore the second request. Which
  seems ok-ish if it's the same request as before, but not great if it's a
  different one.

  Should also have tests for that.

- I think this adds a bit too much of the logic to xlog.c, already an unwieldy
  file. A fair bit of all of this doesn't seem like it needs to be in there.

- the code seems somewhat split brained about bgworkers and auxprocesses


Greetings,

Andres Freund
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Changing the state of data checksums in a running cluster