Re: Snapshot related assert failure on skink
От | Tomas Vondra |
---|---|
Тема | Re: Snapshot related assert failure on skink |
Дата | |
Msg-id | c72e360d-b363-4cd7-a299-4ee41b193d94@vondra.me обсуждение исходный текст |
Ответ на | Re: Snapshot related assert failure on skink (Heikki Linnakangas <hlinnaka@iki.fi>) |
Ответы |
Re: Snapshot related assert failure on skink
|
Список | pgsql-hackers |
On 3/19/25 08:17, Heikki Linnakangas wrote: > On 19/03/2025 04:22, Tomas Vondra wrote: >> I kept stress-testing this, and while the frequency massively increased >> on PG18, I managed to reproduce this all the way back to PG14. I see >> ~100x more corefiles on PG18. >> >> That is not a proof the issue was introduced in PG14, maybe it's just >> the assert that was added there or something. Or maybe there's another >> bug in PG18, making the impact worse. >> >> But I'd suspect this is a bug in >> >> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c >> Author: Andres Freund <andres@anarazel.de> >> Date: Mon Aug 17 21:07:10 2020 -0700 >> >> snapshot scalability: cache snapshots using a xact completion >> counter. >> >> Previous commits made it faster/more scalable to compute snapshots. >> But not >> building a snapshot is still faster. Now that GetSnapshotData() >> does not >> maintain RecentGlobal* anymore, that is actually not too hard: >> >> ... > > Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() > and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? > This can happen during hot standby: > > 1. Backend acquires snapshot A with xmin 1000 > 2. Startup process calls ExpireOldKnownAssignedTransactionIds(), > 3. Backend acquires snapshot B with xmin 1050 > 4. Backend releases snapshot A, updating TransactionXmin to 1050 > 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing > snapshot A's data. > > Because xactCompletionCount is not updated in step 2, the > GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A > has a lower xmin. > Could be. As an experiment I added xactCompletionCount advance to the two functions you mentioned, and I ran the stress test again. I haven't seen any failures so far, after ~1000 runs. Without the patch this produced ~200 failures/core files. regards -- Tomas Vondra
В списке pgsql-hackers по дате отправления: