Обсуждение: Concurrency bug with vacuum full (cluster) and toast
Hi all, I've discovered bug, when vacuum full fails with error, because it couldn't find toast chunks deleted by itself. That happens because cluster_rel() sets OldestXmin, but toast accesses gets snapshot later and independently. That causes heap_page_prune_opt() to clean chunks, which rebuild_relation() expects to exist. This bug very rarely happens on busy systems which actively update toast values. But I found way to reliably reproduce it using debugger. *Setup* CREATE FUNCTION random_string(seed integer, length integer) RETURNS text AS $$ SELECT substr( string_agg( substr( encode( decode( md5(seed::text || '-' || i::text), 'hex'), 'base64'), 1, 21), ''), 1, length) FROM generate_series(1, (length + 20) / 21) i; $$ LANGUAGE SQL; CREATE TABLE test (val text); INSERT INTO test (random_string(1,100000)); *Reproduction steps* s1-s3 are three parallel PostgreSQL sessions s3lldb is lldb connected to s1 At first s1 acquires snapshot and holds it. s1# begin transaction isolation level repeatable read; s1# select 1; Then s2 makes multiple updates of our toasted value. s2# update test set val = random_string(2,100000); s2# update test set val = random_string(3,100000); s2# update test set val = random_string(4,100000); s2# update test set val = random_string(5,100000); s2# update test set val = random_string(6,100000); s2# update test set val = random_string(7,100000); Then s3 starting vacuum full stopping on vacuum_set_xid_limits(). s3lldb# b vacuum_set_xid_limits s3# vacuum full test; We pass vacuum_set_xid_limits() making sure old tuple versions made by s2 would be recently dead for vacuum full. s3lldb# finish Then s1 releases snapshot. Then heap_page_prune_opt() called from toast accessed would cleanup toast chunks, which vacuum full expects to be recently dead. s1# commit; Finally, we continue our vacuum full and get error! s3lldb# continue s3# ERROR: unexpected chunk number 50 (expected 2) for toast value 16429 in pg_toast_16387 Attached patch contains dirty fix of this bug, which just prevents heap_page_prune_opt() from clean tuples, when it's called from rebuild_relation(). Actually, it's not something I'm proposing to commit or even review, it might be just some start point for thoughts. Any ideas? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Вложения
On Mon, Mar 18, 2019 at 12:53 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > I've discovered bug, when vacuum full fails with error, because it > couldn't find toast chunks deleted by itself. That happens because > cluster_rel() sets OldestXmin, but toast accesses gets snapshot later > and independently. That causes heap_page_prune_opt() to clean chunks, > which rebuild_relation() expects to exist. This bug very rarely > happens on busy systems which actively update toast values. But I > found way to reliably reproduce it using debugger. Boy, I really feel like we've talked about this before. These are somewhat-related discussions, but none of them are exactly the same thing: http://postgr.es/m/1335.1304187758@sss.pgh.pa.us http://postgr.es/m/20362.1359747327@sss.pgh.pa.us http://postgr.es/m/87in8nec96.fsf@news-spur.riddles.org.uk I don't know whether we've actually talked about this precise problem before and I just can't find the thread, or whether I'm confusing what you've found here with some closely-related issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 19, 2019 at 6:48 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 18, 2019 at 12:53 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > I've discovered bug, when vacuum full fails with error, because it > > couldn't find toast chunks deleted by itself. That happens because > > cluster_rel() sets OldestXmin, but toast accesses gets snapshot later > > and independently. That causes heap_page_prune_opt() to clean chunks, > > which rebuild_relation() expects to exist. This bug very rarely > > happens on busy systems which actively update toast values. But I > > found way to reliably reproduce it using debugger. > > Boy, I really feel like we've talked about this before. These are > somewhat-related discussions, but none of them are exactly the same > thing: > > http://postgr.es/m/1335.1304187758@sss.pgh.pa.us > http://postgr.es/m/20362.1359747327@sss.pgh.pa.us > http://postgr.es/m/87in8nec96.fsf@news-spur.riddles.org.uk > > I don't know whether we've actually talked about this precise problem > before and I just can't find the thread, or whether I'm confusing what > you've found here with some closely-related issue. Thank you for pointing, but none of the threads you pointed describe this exact problem. Now I see this bug have a set of cute siblings :) ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Mar 19, 2019 at 1:37 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Thank you for pointing, but none of the threads you pointed describe > this exact problem. Now I see this bug have a set of cute siblings :) Yeah. I really thought this precise issue -- the interlocking between the VACUUM of the main table and the VACUUM of the TOAST table -- had been discussed somewhere before. But I couldn't find that discussion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On Fri, Mar 22, 2019 at 02:27:07PM -0400, Robert Haas wrote: > On Tue, Mar 19, 2019 at 1:37 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > Thank you for pointing, but none of the threads you pointed describe > > this exact problem. Now I see this bug have a set of cute siblings :) > > Yeah. I really thought this precise issue -- the interlocking between > the VACUUM of the main table and the VACUUM of the TOAST table -- had > been discussed somewhere before. But I couldn't find that discussion. That also describes the longstanding issue with pg_statistic / pg_toast_2619, no ? I think that's maybe what Robert is remembering, and searching for pg_toast_2619 gives a good number of results (including my own problem report). Is this an "Opened Item" ?
Greetings, * Justin Pryzby (pryzby@telsasoft.com) wrote: > On Fri, Mar 22, 2019 at 02:27:07PM -0400, Robert Haas wrote: > > On Tue, Mar 19, 2019 at 1:37 PM Alexander Korotkov > > <a.korotkov@postgrespro.ru> wrote: > > > Thank you for pointing, but none of the threads you pointed describe > > > this exact problem. Now I see this bug have a set of cute siblings :) > > > > Yeah. I really thought this precise issue -- the interlocking between > > the VACUUM of the main table and the VACUUM of the TOAST table -- had > > been discussed somewhere before. But I couldn't find that discussion. > > That also describes the longstanding issue with pg_statistic / pg_toast_2619, > no ? > > I think that's maybe what Robert is remembering, and searching for > pg_toast_2619 gives a good number of results (including my own problem report). > > Is this an "Opened Item" ? If you're referring to the v12 open items list, then, no, I wouldn't think it would be as it's not a new issue (unless I've misunderstood). Only regressions from prior versions are appropriate for the v12 open items list, long-standing bugs/issues should be addressed and fixed, of course, but those would be fixed and then back-patched. Thanks! Stephen
Вложения
On Wed, Apr 03, 2019 at 11:26:20AM -0400, Stephen Frost wrote: > If you're referring to the v12 open items list, then, no, I wouldn't > think it would be as it's not a new issue (unless I've misunderstood). > Only regressions from prior versions are appropriate for the v12 open > items list, long-standing bugs/issues should be addressed and fixed, of > course, but those would be fixed and then back-patched. Please no open items which do not apply directly and only to v12. There is a section on the page for older bugs however, which could prove to be useful for this case (items listed in this section do not have any impact on the release normally): https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Older_Bugs -- Michael