Обсуждение: Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

Поиск
Список
Период
Сортировка

Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

От
Marco Nenciarini
Дата:
Il 26/06/15 15:43, marco.nenciarini@2ndquadrant.it ha scritto:
> The following bug has been logged on the website:
>
> Bug reference:      13473
> Logged by:          Marco Nenciarini
> Email address:      marco.nenciarini@2ndquadrant.it
> PostgreSQL version: 9.4.4
> Operating system:   all
> Description:
>
> = Symptoms
>
> Let's have a simple master -> standby setup, with hot_standby_feedback
> activated,
> if a backend on standby is holding the cluster xmin and the master runs a
> VACUUM FREEZE
> on the same database of the standby's backend, it will generate a conflict
> and the query
> running on standby will be canceled.
>
> = How to reproduce it
>
> Run the following operation on an idle cluster.
>
> 1) connect to the standby and simulate a long running query:
>
>    select pg_sleep(3600);
>
> 2) connect to the master and run the following script
>
>    create table t(id int primary key);
>    insert into t select generate_series(1, 10000);
>    vacuum freeze verbose t;
>    drop table t;
>
> 3) after 30 seconds the pg_sleep query on standby will be canceled.
>
> = Expected output
>
> The hot standby feedback should have prevented the query cancellation
>
> = Analysis
>
> Ive run postgres at DEBUG2 logging level, and I can confirm that the vacuum
> correctly see the OldestXmin propagated by the standby through the hot
> standby feedback.
> The issue is in heap_xlog_freeze function, which calls
> ResolveRecoveryConflictWithSnapshot as first thing, passing the cutoff_xid
> value as first argument.
> The cutoff_xid is the OldestXmin active when the vacuum, so it represents a
> running xid.
> The issue is that the function ResolveRecoveryConflictWithSnapshot expects
> as first argument of is latestRemovedXid, which represent the higher xid
> that has been actually removed, so there is an off-by-one error.
>
> I've been able to reproduce this issue for every version of postgres since
> 9.0 (9.0, 9.1, 9.2, 9.3, 9.4 and current master)
>
> = Proposed solution
>
> In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid
> before passing it to ResolveRecoveryConflictWithSnapshot.
>
>
>

Attached a proposed patch that solves the issue.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Вложения

Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

От
Jim Nasby
Дата:
On 6/26/15 8:50 AM, Marco Nenciarini wrote:
>> >In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid
>> >before passing it to ResolveRecoveryConflictWithSnapshot.
>> >
>> >
>> >
> Attached a proposed patch that solves the issue.

FWIW, I've reviewed the usage and the new logic looks correct.

It'd actually be nice if we also logged the most recent XID that 
actually got frozen. That's the XID that Hot Standby would actually care 
about (not the freeze limit).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

От
Marco Nenciarini
Дата:
On 27/06/15 01:13, Jim Nasby wrote:
> On 6/26/15 8:50 AM, Marco Nenciarini wrote:
>>> >In the heap_xlog_freeze we need to subtract one to the value of
>>> cutoff_xid
>>> >before passing it to ResolveRecoveryConflictWithSnapshot.
>>> >
>>> >
>>> >
>> Attached a proposed patch that solves the issue.
>

I have hit the bug again, as it has been fixed only from 9.5+

The procedure to reproduce it sent in the original post is not fully
accurate, below there is one that always works:

Run the following operation on an idle cluster.

1) connect to the master and run the following script

   create table t(id int primary key);
   insert into t select generate_series(1, 10000);

2) connect to the standby and simulate a long running query:

   select pg_sleep(3600);

3) on the master and run the following commands:
   vacuum freeze verbose t;
   drop table t;

4) after 30 seconds the pg_sleep query on standby will be canceled.

Attached there is a patch that apply on every version that misses the
fix (9.0, 9.1, 9.2, 9.3, 9.4)

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Вложения

Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

От
Alvaro Herrera
Дата:
Marco Nenciarini wrote:

> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index eb8eada..434880a 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -4764,7 +4764,13 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
>       * consider the frozen xids as running.
>       */
>      if (InHotStandby)
> -        ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
> +    {
> +        TransactionId latestRemovedXid = cutoff_xid;
> +
> +        TransactionIdRetreat(latestRemovedXid);
> +
> +        ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
> +    }
>  
>      /* If we have a full-page image, restore it and we're done */
>      if (record->xl_info & XLR_BKP_BLOCK(0))

Actually, in 9.3 there's a heap_xlog_freeze routine and a separate
heap_xlog_freeze_page routine; see commit 8e9a16ab8f7f.  Applying this
patch to that branch leaves heap_xlog_freeze itself unmodified, and only
heap_xlog_freeze_page is updated.  I think this is pretty much harmless,
as the older routine would only be called by a standby running a minor
version earlier than that commit, which has its own set of much more
serious bugs.  But I think the safest route is to patch them both, so
I'm going to do that.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services