Обсуждение: Incorrect checksum in control file with pg_rewind test

Поиск
Список
Период
Сортировка

Incorrect checksum in control file with pg_rewind test

От
"Maksim.Melnikov"
Дата:
Hi, hackers!

I've got test failure for pg_rewind tests and it seems we have
read/write races
for pg_control file. The test error is incorrect checksum in control file.
Build was compiled with -DEXEC_BACKEND flag.

# +++ tap check in src/bin/pg_rewind +++
Bailout called.  Further testing stopped:  pg_ctl start failed
t/001_basic.pl ...............
Dubious, test returned 255 (wstat 65280, 0xff00)
All 20 subtests passed

2025-05-07 15:00:39.353 MSK [2002308] LOG:  starting backup recovery
with redo LSN 0/2000028, checkpoint LSN 0/2000070, on timeline ID 1
2025-05-07 15:00:39.354 MSK [2002307] FATAL:  incorrect checksum in
control file
2025-05-07 15:00:39.354 MSK [2002308] LOG:  redo starts at 0/2000028
2025-05-07 15:00:39.354 MSK [2002308] LOG:  completed backup recovery
with redo LSN 0/2000028 and end LSN 0/2000138
2025-05-07 15:00:39.354 MSK [2002301] LOG:  background writer process
(PID 2002307) exited with exit code 1
2025-05-07 15:00:39.354 MSK [2002301] LOG:  terminating any other active
server processes
2025-05-07 15:00:39.355 MSK [2002301] LOG:  shutting down because
restart_after_crash is off
2025-05-07 15:00:39.356 MSK [2002301] LOG:  database system is shut down
# No postmaster PID for node "primary_remote"
[15:00:39.438](0.238s) Bail out!  pg_ctl start failed

Failure occurred during restart the primary node to check that rewind
went correctly.
Error is very rare and difficult to reproduce.

It seems we have race between process that replays WAL on start and
update control
file and other sub-processes that read control file and were started
with exec.
As the result sub-processes can read partially updated file with
incorrect crc.
The reason is that LocalProcessControlFile don't acquire ControlFileLock
and it
can't do it.

I found thread
https://www.postgresql.org/message-id/flat/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de,
where the similiar issue was discussed for frontend programs. The
decision was
to retry control file read in case of crc failures. Details can be found
in commit
5725e4ebe7a936f724f21e7ee1e84e54a70bfd83. My suggestion is to use this
approach
here. Patch is attached.

Best regards,
Maksim Melnikov

Вложения

Re: Incorrect checksum in control file with pg_rewind test

От
"Maksim.Melnikov"
Дата:
Hi,
just to clarify, it isn't pg_rewind related issue and can fire
spontaneously.
I don't have any strong scenario how to reproduce it, tests sometimes
fired on our local CI, but as you can see on thread [1],
where the same issue for frontends was discussed, it is very hard to
reproduce and there wasn't scenario how to do it too.

Some dirty hacks to reproduce it was described here [2], and I've tried
it on master branch:
First of all I applied patch
0001-XXX-Dirty-hack-to-clobber-control-file-for-testing.patch from [2],
then compile app with
-DEXEC_BACKEND and exec command in psql
do $$ begin loop perform pg_update_control_file(); end loop; end; $$;
Also I've run pgbench command
for run in {1..5000}; do pgbench -c50 -t100 -j6 -S postgres ; done
And eventually got error

2025-11-07 17:58:33.139 MSK [2472504] FATAL:  incorrect checksum in
control file
2025-11-07 17:58:33.141 MSK [2472501] LOG:  could not receive data from
client: Connection reset by peer
2025-11-07 17:58:33.143 MSK [2472505] LOG:  could not send data to
client: Broken pipe
2025-11-07 17:58:33.143 MSK [2472505] FATAL:  connection to client lost

Best regards,
Maksim Melnikov

[1]
https://www.postgresql.org/message-id/flat/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[2]
https://www.postgresql.org/message-id/CA%2BhUKGK-BEe38aKNqHJDQ86LUW-CMwF5F9bo1JtJVg71FoDv_w%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/f59335a4-83ff-438a-a30e-7cf2200276b6%40postgrespro.ru




Re: Incorrect checksum in control file with pg_rewind test

От
Ivan Kovmir
Дата:
I can reproduce the bug on the master branch with the following steps:

1. Apply 0001-XXX-Dirty-hack-to-clobber-control-file-for-testing.patch [1]
2. Compile PostgreSQL with `-DEXEC_BACKEND` C compiler flag option.
3. Run `initdb`
4. Run `postgres`
5. Run `pgbench -i`
6. Run `psql -c 'do $$ begin loop perform pg_update_control_file(); end loop; end; $$;'`
7. Run `for run in {1..5000}; do pgbench -c50 -t100 -j6 -S postgres; done` in parallel with the previous command.
8. Wait a while.

[1] https://www.postgresql.org/message-id/CA%2BhUKGK-BEe38aKNqHJDQ86LUW-CMwF5F9bo1JtJVg71FoDv_w%40mail.gmail.com

Re: Incorrect checksum in control file with pg_rewind test

От
Alexander Korotkov
Дата:
Hi, Maksim!

On Fri, Nov 7, 2025 at 5:19 PM Maksim.Melnikov
<m.melnikov@postgrespro.ru> wrote:
> just to clarify, it isn't pg_rewind related issue and can fire
> spontaneously.
> I don't have any strong scenario how to reproduce it, tests sometimes
> fired on our local CI, but as you can see on thread [1],
> where the same issue for frontends was discussed, it is very hard to
> reproduce and there wasn't scenario how to do it too.
>
> Some dirty hacks to reproduce it was described here [2], and I've tried
> it on master branch:
> First of all I applied patch
> 0001-XXX-Dirty-hack-to-clobber-control-file-for-testing.patch from [2],
> then compile app with
> -DEXEC_BACKEND and exec command in psql
> do $$ begin loop perform pg_update_control_file(); end loop; end; $$;
> Also I've run pgbench command
> for run in {1..5000}; do pgbench -c50 -t100 -j6 -S postgres ; done
> And eventually got error
>
> 2025-11-07 17:58:33.139 MSK [2472504] FATAL:  incorrect checksum in
> control file
> 2025-11-07 17:58:33.141 MSK [2472501] LOG:  could not receive data from
> client: Connection reset by peer
> 2025-11-07 17:58:33.143 MSK [2472505] LOG:  could not send data to
> client: Broken pipe
> 2025-11-07 17:58:33.143 MSK [2472505] FATAL:  connection to client lost

Thank you for spotting this issue and proposing a patch.  The fork
builds don't have this problem, because fork replicated contents of
LocalControlFile to the new process.  And the postmaster has
consistent snapshot of control file as there is no concurrent process
which could write it and that moment.  But EXEC_BACKEND, even with
your patch, may end up different processes with different contents of
LocalControlFile.  I don't see it could cause a material bug right
now, but I see this as undesirable divergence between fork and
EXEC_BACKEND behaviors.  I propose an alternative approach copy the
contents of control file to the new process via BackendParameters.
This approach solves two problems at once: no torn reads, and no
divergence between fork and EXEC_BACKEND.

------
Regards,
Alexander Korotkov
Supabase

Вложения

Re: Incorrect checksum in control file with pg_rewind test

От
Alexander Lakhin
Дата:
Hello Alexander and Maksim,

21.04.2026 15:12, Alexander Korotkov wrote:
> Thank you for spotting this issue and proposing a patch. ...

FWIW, this issue was also discussed at [1] (you can find the fix proposed
and even the commitfest entry there).

[1] https://www.postgresql.org/message-id/flat/202602021426.ztjk2wp6sgiy%40alvherre.pgsql

Best regards,
Alexander