Обсуждение: Re: The same 2PC data maybe recovered twice

Поиск

Список

Период

Сортировка

Re: The same 2PC data maybe recovered twice

От

"suyu.cmj"

Дата:

12 июля 2023 г., 07:20:57

Yes, this bug can also be reproduced on the master branch, and the corresponding reproduction patch is attached.

I also considered comparing the 2pc.prepare_start_lsn and checkpoint.redo in ProcessTwoPhaseBuffer before, but this method requires modifying the format of the 2pc checkpoint file, which will bring compatibility issues. Especially for released branches, assuming that a node has encountered this bug, it will not be able to start successfully due to FATAL during crash recovery, and therefore cannot manually commit previous two-phase transactions. Using magic number to distinguish 2pc checkpoint file versions can't solve the problem in the above scenario either.

For unreleased branches, writing 2pc.prepare_start_lsn into the checkpoint file will be a good solution, but for released branches, I personally think using WAL record to overwrite checkpoint data would be a more reasonable approach, What do you think?

Best Regards

suyu.cmj

Вложения

0001-Reproduce-the-error_master.patch

Re: The same 2PC data maybe recovered twice

От

Michael Paquier

Дата:

13 июля 2023 г., 06:58:56

On Wed, Jul 12, 2023 at 03:20:57PM +0800, suyu.cmj wrote:
> Yes, this bug can also be reproduced on the master branch, and the
> corresponding reproduction patch is attached.

That's an interesting reproducer with injection points.  It looks like
you've spent a lot of time investigating that.  So, basically, a
checkpoint fails after writing a 2PC file to disk, but before the redo
LSN has been updated.

> I also considered comparing the 2pc.prepare_start_lsn and
> checkpoint.redo in ProcessTwoPhaseBuffer before, but this method
> requires modifying the format of the 2pc checkpoint file, which will
> bring compatibility issues. Especially for released branches,
> assuming that a node has encountered this bug, it will not be able
> to start successfully due to FATAL during crash recovery, and
> therefore cannot manually commit previous two-phase
> transactions. Using magic number to distinguish 2pc checkpoint file
> versions can't solve the problem in the above scenario either.
> For unreleased branches, writing 2pc.prepare_start_lsn into the
> checkpoint file will be a good solution, but for released branches,

Yes, changing anything in this format is a no-go.  Now, things could
be written so as the recovery code is able to handle multiple formats,
meaning that it would be able to feed from the a new format that
includes a LSN or something else for the comparison, but that would
not save from the case where 2PC files with the old format are still
around and a 2PC WAL record is replayed.

> I personally think using WAL record to overwrite checkpoint data
> would be a more reasonable approach, What do you think?

The O(2) loop added in PrepareRedoAdd() to scan the set of 2PC
transactions stored in TwoPhaseState for the purpose of checking for a
duplicate sucks from a performance point of view, particularly for
deployments with many 2PC transactions allowed.  It could delay
recovery a lot.  And actually, this is not completely correct, no?
It is OK to bypass the recovery of the same transaction if the server
has not reached a consistent state, but getting a duplicate when
consistency has been reached should lead to a hard failure.

One approach to avoid this O(2) would be to use a hash table to store
the 2PC entries, for example, rather than an array.  That would be
simple enough but such refactoring is rather scary from the point of
view of recovery.

And, actually, we could do something much more simpler than what's
been proposed on this thread..  PrepareRedoAdd() would be called when
scanning pg_twophase at the beginning of recovery, or when replaying a
PREPARE record, both aiming at adding an entry in shmem for the 2PC
transaction tracked.  Here is a simpler idea: why don't we just check
in PrepareRedoAdd() if the 2PC file of the transaction being recovery
is in pg_twophase/ when adding an entry from a WAL record?  If a
consistent point has *not* been reached by recovery and we find a file
on disk, then do nothing because we *know* thanks to
restoreTwoPhaseData() done at the beginning of recover that there is
an entry for this file.  If a consistent point has been reached in
recovery and we find a file on disk while replaying a WAL record for
the same 2PC file, then fail.  If there is no file in pg_twophase for
the record replayed, then add it to the array TwoPhaseState.

Adding a O(2) loop that checks for duplicates may be a good idea as a
cross-check if replaying a record, but I'd rather put that under an
USE_ASSERT_CHECKING so as there is no impact on production systems,
still we'd have some sanity checks for test setups.
--
Michael

Вложения

signature.asc

Re: The same 2PC data maybe recovered twice

От

"suyu.cmj"

Дата:

17 июля 2023 г., 06:26:56

Yes, the method you proposed is simpler and more efficient. Following your idea, I have modified the corresponding patch, hope you can review it when you have time.

Best Regards

suyu.cmj

Вложения

v2-0001-Fix-a-2PC-transaction-maybe-recovered-twice_master.patch

Re: The same 2PC data maybe recovered twice

От

Michael Paquier

Дата:

17 июля 2023 г., 07:59:53

On Mon, Jul 17, 2023 at 02:26:56PM +0800, suyu.cmj wrote:
> Yes, the method you proposed is simpler and more
> efficient. Following your idea, I have modified the corresponding
> patch, hope you can review it when you have time.

I'll double-check that tomorrow, but yes, that's basically what I had
in mind.  Thanks for the patch!

+   char        path[MAXPGPATH];
+   struct stat stat_buf;
These two variables can be declared in the code block added by the
patch where start_lsn is valid.

+    ereport(FATAL,
+            (errmsg("found unexpected duplicate two-phase
transaction:%u in pg_twophase, check for data correctness.",
+                    hdr->xid)));

The last part of this sentence has no need to be IMO, because it is
misleading when building without assertions.  How about a single
FATAL/WARNING like that:
- errmsg: "could not recover two-phase state file for transaction %u"
- errdetail: "Two-phase state file has been found in WAL record %X/%X
but this transaction has already been restored from disk."

Then a WARNING simply means that we've skipped the record entirely.
--
Michael

Вложения

signature.asc

Re: The same 2PC data maybe recovered twice

От

"suyu.cmj"

Дата:

17 июля 2023 г., 09:20:00

Thanks for the feedback! I have updated the patch, hope you can check it.

Best Regards

suyu.cmj

Вложения

v3-0001-Fix-a-2PC-transaction-maybe-recovered-twice_master.patch

Re: The same 2PC data maybe recovered twice

От

Michael Paquier

Дата:

18 июля 2023 г., 05:13:52

On Mon, Jul 17, 2023 at 05:20:00PM +0800, suyu.cmj wrote:
> Thanks for the feedback! I have updated the patch, hope you can check it.

I have looked at v3, and noticed that the stat() call is actually a
bit incorrect in its error handling because it would miss any errors
that happen when checking for the existence of the file.  The only
error that we should safely expect is ENOENT, for a missing entry.
All the other had better fail like the other code paths restoring 2PC
entries from the shared state.  At the end, I have made the choice of
relying only on access() to check the existence of the file as this is
an internal place, simplified a bit the comment.  Finally, I have made
the choice to remove the assert-only check.  After sleeping on it, it
would have value in very limited cases while a bunch of recovery cases
would take a hit, including developers with large 2PC setups, so there
are a lot of downsides with limited upsides.
--
Michael

Вложения

signature.asc

Re: The same 2PC data maybe recovered twice

От

"suyu.cmj"

Дата:

09 июля, 04:57:08

Hi, all. I would like to sync up on an issue I've recently encountered. Previously, the following logic was added to the PrepareRedoAdd() when fixing this 2PC recovery bug:

------------------------------------------------------------------
From:Michael Paquier <michael@paquier.xyz>
Send Time:2023 Jul. 18 (Tue.) 13:14
To:"蔡梦娟(玊于)"<mengjuan.cmj@alibaba-inc.com>
CC:"zhihui.fan1213"<zhihui.fan1213@gmail.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>
Subject:Re: The same 2PC data maybe recovered twice

On Mon, Jul 17, 2023 at 05:20:00PM +0800, suyu.cmj wrote:
> Thanks for the feedback! I have updated the patch, hope you can check it.

I have looked at v3, and noticed that the stat() call is actually a
bit incorrect in its error handling because it would miss any errors
that happen when checking for the existence of the file. The only
error that we should safely expect is ENOENT, for a missing entry.
All the other had better fail like the other code paths restoring 2PC
entries from the shared state. At the end, I have made the choice of
relying only on access() to check the existence of the file as this is
an internal place, simplified a bit the comment. Finally, I have made
the choice to remove the assert-only check. After sleeping on it, it
would have value in very limited cases while a bunch of recovery cases
would take a hit, including developers with large 2PC setups, so there
are a lot of downsides with limited upsides.
--
Michael

Re: The same 2PC data maybe recovered twice

От

"suyu.cmj"

Дата:

09 июля, 08:51:03

I'm sorry that the content of the previous email was incomplete. Let me continue with the issue I've recently encountered.

Previously, the following logic was added to the PrepareRedoAdd() function when fixing this 2PC recovery bug:

+ if (!XLogRecPtrIsInvalid(start_lsn))

+ {

+ char path[MAXPGPATH];

+ TwoPhaseFilePath(path, hdr->xid);

+ if (access(path, F_OK) == 0)

+ {

+ ereport(reachedConsistency ? ERROR : WARNING,

+ (errmsg("could not recover two-phase state file for transaction %u",

+ hdr->xid),

+ errdetail("Two-phase state file has been found in WAL record %X/%X, but thi

s transaction has already been restored from disk.",

+ LSN_FORMAT_ARGS(start_lsn))));

+ return;

+ }

Every time we add a 2PC from a WAL record, we check whether the corresponding 2PC file exists on the storage. When a consistent state is reached, it is highly likely that there will be no corresponding files on the storage, so it is probable that an empty file will be accessed. Each time a file is accessed, the OS creates a dentry for it and increases the reference count of the parent directory's dentry. The type of the dentry reference count is int32. When there are many 2PC transactions, this logic may lead to an overflow of the parent dentry's reference count. When reclaiming a dentry, such as during memory reclamation or when deleting the directory, the overflow of the reference count could ultimately cause the OS to crash.

I am considering whether the logic in this part can be modified as follows:

+ if (!XLogRecPtrIsInvalid(start_lsn) && !reachedConsistency)

+ {

+ char path[MAXPGPATH];

+ TwoPhaseFilePath(path, hdr->xid);

+ if (access(path, F_OK) == 0)

+ {

+ ereport(WARNING,

+ (errmsg("could not recover two-phase state file for transaction %u",

+ hdr->xid),

+ errdetail("Two-phase state file has been found in WAL record %X/%X, but thi

s transaction has already been restored from disk.",

+ LSN_FORMAT_ARGS(start_lsn))));

+ return;

+ }

Currently, since restoreTwoPhaseData() is the only function that restores 2PC transactions from disk before the recovery starts, after reaching a consistent state, the 2PC transactions are only added from the WAL. Under normal circumstances, there should not be any corresponding 2PC files on the storage at that point. Therefore, I prefer to perform the file access checks only when the server has not yet reached a consistent state. Once consistency has been reached, if a duplicate 2PC transaction is added, it will directly result in an error in the subsequent replay logic.

Do you think this modification is feasible? I look forward to your feedback

Best Regards

suyu.cmj

------------------------------------------------------------------
From:CAI, Mengjuan <mengjuan.cmj@alibaba-inc.com>
Send Time:2025 Jul. 9 (Wed.) 09:57
To:Michael Paquier<michael@paquier.xyz>
CC:"pgsql-bugs"<pgsql-bugs@lists.postgresql.org>
Subject:Re: The same 2PC data maybe recovered twice

Hi, all. I would like to sync up on an issue I've recently encountered. Previously, the following logic was added to the PrepareRedoAdd() when fixing this 2PC recovery bug:

------------------------------------------------------------------
From:Michael Paquier <michael@paquier.xyz>
Send Time:2023 Jul. 18 (Tue.) 13:14
To:"蔡梦娟(玊于)"<mengjuan.cmj@alibaba-inc.com>
CC:"zhihui.fan1213"<zhihui.fan1213@gmail.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>
Subject:Re: The same 2PC data maybe recovered twice

On Mon, Jul 17, 2023 at 05:20:00PM +0800, suyu.cmj wrote:
> Thanks for the feedback! I have updated the patch, hope you can check it.

I have looked at v3, and noticed that the stat() call is actually a
bit incorrect in its error handling because it would miss any errors
that happen when checking for the existence of the file. The only
error that we should safely expect is ENOENT, for a missing entry.
All the other had better fail like the other code paths restoring 2PC
entries from the shared state. At the end, I have made the choice of
relying only on access() to check the existence of the file as this is
an internal place, simplified a bit the comment. Finally, I have made
the choice to remove the assert-only check. After sleeping on it, it
would have value in very limited cases while a bunch of recovery cases
would take a hit, including developers with large 2PC setups, so there
are a lot of downsides with limited upsides.
--
Michael

Re: The same 2PC data maybe recovered twice

От

Michael Paquier

Дата:

09 июля, 10:35:23

On Wed, Jul 09, 2025 at 01:51:03PM +0800, suyu.cmj wrote:
> Currently, since restoreTwoPhaseData() is the only function that
> restores 2PC transactions from disk before the recovery starts,
> after reaching a consistent state, the 2PC transactions are only
> added from the WAL. Under normal circumstances, there should not be
> any corresponding 2PC files on the storage at that point. Therefore,
> I prefer to perform the file access checks only when the server has
> not yet reached a consistent state. Once consistency has been
> reached, if a duplicate 2PC transaction is added, it will directly
> result in an error in the subsequent replay logic.

There is a bit more going on in this code that needs to be fixed.

Please see the following thread, that also relates to the state of the
2PC recovery code when we read them from disk at the beginning of
recovery, with the bottom part being the most relevant:
https://www.postgresql.org/message-id/Z5sd5O9JO7NYNK-C%40paquier.xyz
--
Michael

Вложения

signature.asc

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: The same 2PC data maybe recovered twice

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения