Обсуждение: [HACKERS] Potential hot-standby bug around xacts committed but inxl_running_xacts

Поиск
Список
Период
Сортировка

[HACKERS] Potential hot-standby bug around xacts committed but inxl_running_xacts

От
Andres Freund
Дата:
Hi,

The thread below http://archives.postgresql.org/message-id/f37e975c-908f-858e-707f-058d3b1eb214%402ndquadrant.com
describes an issue in logical decoding that arises because
xl_running_xacts' contents aren't necessarily coherent with the contents
of the WAL, because RecordTransactionCommit() / RecordTransactionAbort()
don't have any interlock against the procarray.  That means
xl_running_xacts can contain transactions assumed to be running, that
already have their commit/abort records WAL logged.

I think that's not just problematic in logical decoding, but also
Hot-Standby.  Consider the following:

ProcArrayApplyRecoveryInfo() gets an xl_running_xacts record that's not
suboverflowed, and thus will change to STANDBY_SNAPSHOT_READY.  In that
case it'll populate the KnownAssignedXids machinery using
KnownAssignedXidsAdd().

Once STANDBY_SNAPSHOT_READY, CheckRecoveryConsistency() will signal
postmaster to allow connections.

For HS, a snapshot will be built by GetSnapshotData() using
KnownAssignedXidsGetAndSetXmin().  That in turn uses the transactions
currently known to be running, to populate the snapshot.

Now, if transactions have committed before (in the "earlier LSN" sense)
the xl_running_xacts record, ExpireTreeKnownAssignedTransactionIds() in
xact_redo_commit() will already have run.  Which means we'll assume
already committed transactions are still running.  In other words, the
snapshot is corrupted.

Luckily this'll self-correct over time, fixed by
ExpireOldKnownAssignedTransactionIds().


Am I missing something that protects against the above scenario?


Greetings,

Andres Freund



Re: [HACKERS] Potential hot-standby bug around xacts committed but in xl_running_xacts

От
Simon Riggs
Дата:
On 1 May 2017 at 22:38, Andres Freund <andres@anarazel.de> wrote:

> The thread below http://archives.postgresql.org/message-id/f37e975c-908f-858e-707f-058d3b1eb214%402ndquadrant.com
> describes an issue in logical decoding that arises because
> xl_running_xacts' contents aren't necessarily coherent with the contents
> of the WAL, because RecordTransactionCommit() / RecordTransactionAbort()
> don't have any interlock against the procarray.  That means
> xl_running_xacts can contain transactions assumed to be running, that
> already have their commit/abort records WAL logged.

That is known/by design.

> I think that's not just problematic in logical decoding, but also
> Hot-Standby.  Consider the following:
>
> ProcArrayApplyRecoveryInfo() gets an xl_running_xacts record that's not
> suboverflowed, and thus will change to STANDBY_SNAPSHOT_READY.

Yes

> In that
> case it'll populate the KnownAssignedXids machinery using
> KnownAssignedXidsAdd().

No, because of this section of code in ProcArrayApplyRecoveryInfo()

/** The running-xacts snapshot can contain xids that were still visible* in the procarray when the snapshot was taken,
butwere already* WAL-logged as completed. They're not running anymore, so ignore* them.*/if
(TransactionIdDidCommit(xid)|| TransactionIdDidAbort(xid))        continue;
 

> Am I missing something that protects against the above scenario?

It does seem so, yes. The locking issues in Hot Standby are complex,
but they do seem to be correct, thanks for reviewing them.

This is documented in multiple locations, including what I thought was
your own comment in LogStandbySnapshot().

What I suggest is that with logical decoding in mind we do this
1. Inject a new record XLOG_SNAPSHOT_START at the start of
LogStandbySnapshot(). We start logical decoding from there.
2. Record any transactions that end
3. Now the full XLOG_RUNNING_XACTS record arrives. We apply all xacts
that are seen as running, minus any ended between 1 and 3

This avoids the problems for the race but without holding locks while
we log XLOG_RUNNING_XACTS, something that was considered painful for
Hot Standby.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Potential hot-standby bug around xacts committed but in xl_running_xacts

От
Craig Ringer
Дата:
On 2 May 2017 at 13:12, Simon Riggs <simon@2ndquadrant.com> wrote:

> What I suggest is that with logical decoding in mind we do this
> 1. Inject a new record XLOG_SNAPSHOT_START at the start of
> LogStandbySnapshot(). We start logical decoding from there.
> 2. Record any transactions that end
> 3. Now the full XLOG_RUNNING_XACTS record arrives. We apply all xacts
> that are seen as running, minus any ended between 1 and 3
>
> This avoids the problems for the race but without holding locks while
> we log XLOG_RUNNING_XACTS, something that was considered painful for
> Hot Standby.

Sounds like a sensible solution to me. It avoids the need for a rather
undesirable interlock between xlog and shmem commit.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



[HACKERS] Re: Potential hot-standby bug around xacts committed but inxl_running_xacts

От
Andres Freund
Дата:
On 2017-05-02 07:12:41 +0200, Simon Riggs wrote:
> /*
>  * The running-xacts snapshot can contain xids that were still visible
>  * in the procarray when the snapshot was taken, but were already
>  * WAL-logged as completed. They're not running anymore, so ignore
>  * them.
>  */
>  if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
>          continue;

Ah, right.   Phew ;)


> What I suggest is that with logical decoding in mind we do this
> 1. Inject a new record XLOG_SNAPSHOT_START at the start of
> LogStandbySnapshot(). We start logical decoding from there.
> 2. Record any transactions that end
> 3. Now the full XLOG_RUNNING_XACTS record arrives. We apply all xacts
> that are seen as running, minus any ended between 1 and 3

> This avoids the problems for the race but without holding locks while
> we log XLOG_RUNNING_XACTS, something that was considered painful for
> Hot Standby.

I don't think that really solves it, because other transactions could
just be stuck just after the XLogInsert() forever.  And it'd have the
issue of having to backpatch a new record.  I'm working on an
alternative approach, let's hope that that works out.

- Andres



Re: [HACKERS] Potential hot-standby bug around xacts committed but in xl_running_xacts

От
Simon Riggs
Дата:
On 2 May 2017 at 18:06, Andres Freund <andres@anarazel.de> wrote:

>> What I suggest is that with logical decoding in mind we do this
>> 1. Inject a new record XLOG_SNAPSHOT_START at the start of
>> LogStandbySnapshot(). We start logical decoding from there.
>> 2. Record any transactions that end
>> 3. Now the full XLOG_RUNNING_XACTS record arrives. We apply all xacts
>> that are seen as running, minus any ended between 1 and 3
>
>> This avoids the problems for the race but without holding locks while
>> we log XLOG_RUNNING_XACTS, something that was considered painful for
>> Hot Standby.
>
> I don't think that really solves it, because other transactions could
> just be stuck just after the XLogInsert() forever.  And it'd have the
> issue of having to backpatch a new record.  I'm working on an
> alternative approach, let's hope that that works out.

Backpatchable approach.

1. Record CurrentInsertPosition()
2. LogStandbySnapshot()
3. Insert custom logical wal message containing currentinsertposition
and LSN of (2)

When we decode
1. Read standby snapshot
2. Scan forwards until we see message written by step (3) above, which
is identifiable because it contains LSN of snapshot.
3. Read initial LSN from message then re-scan from LSN until
xl_running_xacts message collecting any commits/aborts and removing
them from snapshot.

No new WAL messages, no locking problem, race condition handled.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services