Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
От | Andres Freund |
---|---|
Тема | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Дата | |
Msg-id | 20220204041935.gf4uwbdxddq6rffh@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
|
Список | pgsql-bugs |
Hi, On 2022-02-03 15:54:28 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >>> I'm just struggling with / procrastinating on the commit message, tbh. The > >>> whole issue is kinda complicated to explain... :/ > > > After struggling some more, I *finally* pushed the fix and the new assertions. > > I'm writing release notes and wondering what I can tell users about > how to detect or recover from this bug. Is a REINDEX sufficient, > or is the presence of the bogus redirect item going to cause > persistent problems? Good questions. It's hard to answer whether there's any danger after a REINDEX. Afaics the build scan would just pick the "lower offset" version of the root pointer. Which should be fine. It's possible there could be trouble down the line, e.g. heap pruning doing something weird once starting in a corrupted state, that then leads REINDEX to do something bogus. The simple cases look OK, because a second visit/action by heap_prune_chain for one tid from two different root pointers would see ->marked[offnum] as true. It gets more complicated once multiple intermediary row versions are involved, because the intermediary row versions won't be in ->marked if an entire chain is pruned. But afaict that should still end up looking like a hot chain ending in an aborted tuple or such. > If the latter, can we provide some amelioration? > "Your data is broken and there is nothing you can do about it" > is not nice to be writing in release notes. We've had quite a few bugs around broken HOT chains (which this is an instance of), I don't think we've provided a useful recipe for any of them :(. We now have heap amcheck, but it unfortunately doesn't detect most HOT corruption (redirect pointing to unused is the only case I think). One ameliorating factor is that, as far as I can tell, the window for this bug is pretty narrow (see sequence in [1] for details). Without adding sleeps it's hard to cause a snapshot computation to happen between vacuum_set_xid_limits() and lazy_scan_heap(). Cache invalidation processing needs to trigger GetSnapshotData() below vac_open_indexes() (which in turn requires a catalog invalidation to be received, followed by use of the catalog snapshot), and the computed RecentXmin needs to change. Except that it's not trivial to get right, I could see it being worthwhile to add verification of hot chains to amcheck, and backpatch that to 14. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20211118071941.e4far2w5vufnahun%40alap3.anarazel.de
В списке pgsql-bugs по дате отправления: