Обсуждение: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE
https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote: > Separable, nontrivial things not fixed in the attached patch stack: > - Trouble is possible, I bet, if the system crashes between the inplace-update > memcpy() and XLogInsert(). See the new XXX comment below the memcpy(). That comment: /*---------- * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid: * * ["D" is a VACUUM (ONLY_DATABASE_STATS)] * ["R" is a VACUUM tbl] * D: vac_update_datfrozenid() -> systable_beginscan(pg_class) * D: systable_getnext() returns pg_class tuple of tbl * R: memcpy() into pg_class tuple of tbl * D: raise pg_database.datfrozenxid, XLogInsert(), finish * [crash] * [recovery restores datfrozenxid w/o relfrozenxid] */ > Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and > finally issuing memcpy() into the buffer. That fix worked. Along with that, I'm attaching a not-for-commit patch with a test case and one with the fix rebased on that test case. Apply on top of the v2 patch stack from https://postgr.es/m/20240617235854.f8.nmisch@google.com. This gets key testing from 027_stream_regress.pl; when I commented out some memcpy lines of the heapam.c change, that test caught it. This resolves the last inplace update defect known to me. Thanks, nm
Вложения
> On 20 Jun 2024, at 06:29, Noah Misch <noah@leadboat.com> wrote: > > This resolves the last inplace update defect known to me. That’s a huge amount of work, thank you! Do I get it right, that inplace updates are catalog-specific and some other OOM corruptions [0] and Standby corruptions [1]are not related to this fix. Bot cases we observed on regular tables. Or that might be effect of vacuum deepening corruption after observing wrong datfrozenxid? Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47 [1] https://www.postgresql.org/message-id/flat/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG%3Dtsnn49-A%40mail.gmail.com
On Thu, Jun 20, 2024 at 12:17:44PM +0500, Andrey M. Borodin wrote: > On 20 Jun 2024, at 06:29, Noah Misch <noah@leadboat.com> wrote: > > This resolves the last inplace update defect known to me. > > That’s a huge amount of work, thank you! > > Do I get it right, that inplace updates are catalog-specific and some other OOM corruptions [0] and Standby corruptions[1] are not related to this fix. Bot cases we observed on regular tables. In core code, inplace updates are specific to pg_class and pg_database. Adding PGXN modules, only the citus extension uses them on some other table. [0] definitely looks unrelated. > Or that might be effect of vacuum deepening corruption after observing wrong datfrozenxid? Wrong datfrozenxid can cause premature clog truncation, which can cause "could not access status of transaction". While $SUBJECT could cause that, I think it would happen on both primary and standby. [1] seems to be about a standby lacking clog present on the primary, which is unrelated. > [0] https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47 > [1] https://www.postgresql.org/message-id/flat/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG%3Dtsnn49-A%40mail.gmail.com