Обсуждение: Check invalid pages at the end of recovery to alarm lost data

Поиск

Список

Период

Сортировка

От

"王伟(学弈)"

Дата:

10 июля 2023 г., 07:53:13

hello, all.

Recently, I find one very strange situation to lose data of primary node which the

details can be find at the first patch: 0001-Add-test-case-data-lost-after-restart.patch.

The first patch shows us that data could be lost after truncating physical file by

someone else before starting up primary node. However, then the primary node

still starts up normally without any alarm, even that it find any invalid page

during crash recovery.

And then I find another situation about abort transaction which details can be find

at the second patch: 0002-Add-test-case-for-abort-transaction-across-checkpoin.patch.

The second patch shows us that abort transaction across checkpoint could also cause

invalid pages, and leave some undeleted relation files forever during crash recovery.

And then the primary node still starts up normally without any alarm, just like the

first situation.

By the way, the above experiments are both running after setting the following

parameters:

$node_primary->append_conf('postgresql.conf', 'synchronous_commit=on');

$node_primary->append_conf('postgresql.conf', 'full_page_writes=off');

$node_primary->append_conf('postgresql.conf', 'log_min_messages=debug2');

As my opinion, the primary node should alarm some invalid pages found during

crash recovery, as same as what the standby node does after reached consistency

recovery state. So I put the third bug fix patch which is

0003-Check-invalid-pages-at-the-end-of-recovery.patch to do the following two things:

(1) Primary node checks invalid pages at the end of recovery;

(2) Flush the abort WAL before truncating or deleting any relation files.

Best wishes,

rogers.ww.