Обсуждение: Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions
Il 26/06/15 15:43, marco.nenciarini@2ndquadrant.it ha scritto: > The following bug has been logged on the website: > > Bug reference: 13473 > Logged by: Marco Nenciarini > Email address: marco.nenciarini@2ndquadrant.it > PostgreSQL version: 9.4.4 > Operating system: all > Description: > > = Symptoms > > Let's have a simple master -> standby setup, with hot_standby_feedback > activated, > if a backend on standby is holding the cluster xmin and the master runs a > VACUUM FREEZE > on the same database of the standby's backend, it will generate a conflict > and the query > running on standby will be canceled. > > = How to reproduce it > > Run the following operation on an idle cluster. > > 1) connect to the standby and simulate a long running query: > > select pg_sleep(3600); > > 2) connect to the master and run the following script > > create table t(id int primary key); > insert into t select generate_series(1, 10000); > vacuum freeze verbose t; > drop table t; > > 3) after 30 seconds the pg_sleep query on standby will be canceled. > > = Expected output > > The hot standby feedback should have prevented the query cancellation > > = Analysis > > Ive run postgres at DEBUG2 logging level, and I can confirm that the vacuum > correctly see the OldestXmin propagated by the standby through the hot > standby feedback. > The issue is in heap_xlog_freeze function, which calls > ResolveRecoveryConflictWithSnapshot as first thing, passing the cutoff_xid > value as first argument. > The cutoff_xid is the OldestXmin active when the vacuum, so it represents a > running xid. > The issue is that the function ResolveRecoveryConflictWithSnapshot expects > as first argument of is latestRemovedXid, which represent the higher xid > that has been actually removed, so there is an off-by-one error. > > I've been able to reproduce this issue for every version of postgres since > 9.0 (9.0, 9.1, 9.2, 9.3, 9.4 and current master) > > = Proposed solution > > In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid > before passing it to ResolveRecoveryConflictWithSnapshot. > > > Attached a proposed patch that solves the issue. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it
Вложения
On 6/26/15 8:50 AM, Marco Nenciarini wrote: >> >In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid >> >before passing it to ResolveRecoveryConflictWithSnapshot. >> > >> > >> > > Attached a proposed patch that solves the issue. FWIW, I've reviewed the usage and the new logic looks correct. It'd actually be nice if we also logged the most recent XID that actually got frozen. That's the XID that Hot Standby would actually care about (not the freeze limit). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On 27/06/15 01:13, Jim Nasby wrote: > On 6/26/15 8:50 AM, Marco Nenciarini wrote: >>> >In the heap_xlog_freeze we need to subtract one to the value of >>> cutoff_xid >>> >before passing it to ResolveRecoveryConflictWithSnapshot. >>> > >>> > >>> > >> Attached a proposed patch that solves the issue. > I have hit the bug again, as it has been fixed only from 9.5+ The procedure to reproduce it sent in the original post is not fully accurate, below there is one that always works: Run the following operation on an idle cluster. 1) connect to the master and run the following script create table t(id int primary key); insert into t select generate_series(1, 10000); 2) connect to the standby and simulate a long running query: select pg_sleep(3600); 3) on the master and run the following commands: vacuum freeze verbose t; drop table t; 4) after 30 seconds the pg_sleep query on standby will be canceled. Attached there is a patch that apply on every version that misses the fix (9.0, 9.1, 9.2, 9.3, 9.4) Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it
Вложения
Marco Nenciarini wrote: > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index eb8eada..434880a 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -4764,7 +4764,13 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record) > * consider the frozen xids as running. > */ > if (InHotStandby) > - ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node); > + { > + TransactionId latestRemovedXid = cutoff_xid; > + > + TransactionIdRetreat(latestRemovedXid); > + > + ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node); > + } > > /* If we have a full-page image, restore it and we're done */ > if (record->xl_info & XLR_BKP_BLOCK(0)) Actually, in 9.3 there's a heap_xlog_freeze routine and a separate heap_xlog_freeze_page routine; see commit 8e9a16ab8f7f. Applying this patch to that branch leaves heap_xlog_freeze itself unmodified, and only heap_xlog_freeze_page is updated. I think this is pretty much harmless, as the older routine would only be called by a standby running a minor version earlier than that commit, which has its own set of much more serious bugs. But I think the safest route is to patch them both, so I'm going to do that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services