Re: Snapshot related assert failure on skink
От | Heikki Linnakangas |
---|---|
Тема | Re: Snapshot related assert failure on skink |
Дата | |
Msg-id | 27a69a5e-7672-4ee0-ae06-62cd520697a8@iki.fi обсуждение исходный текст |
Ответ на | Re: Snapshot related assert failure on skink (Tomas Vondra <tomas@vondra.me>) |
Список | pgsql-hackers |
On 21/03/2025 12:28, Tomas Vondra wrote: > But it seems it changed in 952365cded6, which is: > > commit 952365cded635e54c4177399c0280cb7a5e34c11 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon Dec 23 12:42:39 2024 +0200 > > Remove unnecessary GetTransactionSnapshot() calls > > In get_database_list() and get_subscription_list(), the > GetTransactionSnapshot() call is not required because the catalog > table scans use the catalog snapshot, which is held until the end of > the scan. See table_beginscan_catalog(), which calls > RegisterSnapshot(GetCatalogSnapshot(relid)). > > ... > > I'm not claiming the commit is buggy - it might be, but I think it's > more likely it just made the preexisting bug easier to hit. I base this > on the observation that incrementing the xactCompletionCount made the > assert failures go away. That commit removed three GetTransactionSnapshot() calls: one in autovacuum, one in logical replication launcher, and one in InitPostgres(). It must be the one in InitPostgres() that makes the difference, because the other two are not executed in a standby. I was able to reproduce this and to catch it in action with 'rr'. After commit 952365cded6, the sequence of events looks like this: 1. At backend startup, GetCatalogSnapshot() sets CatalogSnapshotData.xmin=1000, xactCompletionCount=10 2. On first query, GetTransactionSnapshot() sets CurrentSnapshotData.xmin=1002, xactCompletionCount=10 3. GetCatalogSnapshot() is called again. It tries to reuse the snapshot with xmin=1000 and hits the assertion. Before commit 952365cded6, which removed the GetTransactionSnapshot() call from InitPostgres, this usually happens instead: 0. At backend startup, GetTransactionSnapshot() sets CurrentSnapshotData.xmin=1000, xactCompletionCount=10 1. At backend startup, GetCatalogSnapshot() sets CatalogSnapshotData.xmin=1000, xactCompletionCount=10 2. On first query, GetTransactionSnapshot() reuses the snapshot with CurrentSnapshotData.xmin=1002, xactCompletionCount=10 3. GetCatalogSnapshot() is called again. It successfully reuses the snapshot with xmin=1000 In other words, the GetTransactionSnapshot() call initializes the CurrentSnapshotData with the snapshot with an earlier xmin. The first GetCatalogSnapshot() call happens almost immediately after that, so they're very likely - but not guaranteed - to get the same snapshot. When I removed the GetTransactionSnapshot() call, the gap between the first GetTransactionSnapshot() and GetCatalogSnapshot() calls in the backend became much longer, making this more likely to happen. In any case, the missing "xactCompletionCount++" in ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() is a clear bug and explains this. I will prepare and commit patches to fix that. Thanks for the testing! -- Heikki Linnakangas Neon (https://neon.tech)
В списке pgsql-hackers по дате отправления: