Re: Prepared transaction releasing locks before deregistering its GID

Поиск
Список
Период
Сортировка
От Oleksii Kliukin
Тема Re: Prepared transaction releasing locks before deregistering its GID
Дата
Msg-id 1ECFC537-A8B5-4087-9469-7190630A3693@hintbits.com
обсуждение исходный текст
Ответ на Re: Prepared transaction releasing locks before deregistering its GID  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: Prepared transaction releasing locks before deregistering its GID  (Michael Paquier <michael@paquier.xyz>)
Re: Prepared transaction releasing locks before deregistering its GID  (Oleksii Kliukin <alexk@hintbits.com>)
Список pgsql-hackers
Hi,

Michael Paquier <michael@paquier.xyz> wrote:

> On Mon, Feb 18, 2019 at 05:05:13PM +0100, Oleksii Kliukin wrote:
>> That looks like a race condition to me. What happens is that another
>> transaction with the name identical to the running one can start and proceed
>> to the prepare phase while the original one commits, failing at last instead
>> of waiting for the original one to finish.
>
> It took me 50 clients and a bit more than 20 seconds, but I have been
> able to reproduce the problem with one error.  Thanks for the
> reproducer.  This is indeed a race condition with 2PC.

Yes, as a race condition I could consistently reproduce it on one system,
but not on another. On my macbook it never manifested in an error; however,
on one occasion I saw sessions waiting on a lock and an orphaned record in
pg_prepared_xacts. I couldn’t find any evidences in the logs or elsewhere
what has happened to that session, so I just assumed it was, perhaps, killed
by the test cleanup.

>> By looking at the source code of FinishPreparedTransaction() I can see the
>> RemoveGXact() call, which removes the prepared transaction from
>> TwoPhaseState->prepXacts. It is placed at the very end of the function,
>> after the post-commit callbacks that clear out the locks held by the
>> transaction. Those callbacks are not guarded by the TwoPhaseStateLock,
>> resulting in a possibility for a concurrent session to proceed will
>> MarkAsPreparing after acquiring the locks released by them.
>
> Hm, I see.  Taking a breakpoint just after ProcessRecords() or putting
> a sleep there makes the issue plain.  The same issue can happen with
> predicate locks.

I did verify this on my part by putting a TwoPhaseStateLock right before the
line saying "We assume it's safe to do this without taking
TwoPhaseStateLock” and observing that the errors has stopped over multiple
runs.

>> I couldn’t find any documentation on the expected outcome in cases like
>> this, so I assume it might not be a bug, but an undocumented behavior.
>
> If you run two transactions in parallel using your script, the second
> transaction would wait at LOCK time until the first transaction
> releases its locks with the COMMIT PREPARED.

That is the desired outcome, right?

>
>> Should I go about and produce a patch to put a note in the description of
>> commit prepared, or is there any interest in changing the behavior to avoid
>> such conflicts altogether (perhaps by holding the lock throughout the
>> cleanup phase)?
>
> That's a bug, let's fix it.  I agree with your suggestion that we had
> better not release the 2PC lock using the callbacks for COMMIT
> PREPARED or ROLLBACK PREPARED until the shared memory state is
> cleared.  At the same time, I think that we should be smarter in the
> ordering of the actions: we care about predicate locks here, but not
> about the on-disk file removal and the stat counters.  One trick is
> that the post-commit callback calls TwoPhaseGetDummyProc() which would
> try to take TwoPhaseStateLock which needs to be applied so we need
> some extra logic to not take a lock in this case.  From what I can see
> this is older than 9.4 as the removal of the GXACT entry in shared
> memory and the post-commit hooks are out of sync for a long time :(
>
> Attached is a patch that fixes the problem for me.  Please note the
> safety net in TwoPhaseGetGXact().

The approach looks good to me. Surprisingly, I saw no stalled backends
because of the double acquisition of lock at TwoPhaseGetGXact once I put a
simple TwoPhaseStateLock right before the "gxact->valid = false” line; I
will test your patch and post the outcome.

Regards,
Oleksii Kliukin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Prepared transaction releasing locks before deregistering its GID
Следующее
От: Amit Langote
Дата:
Сообщение: Re: speeding up planning with partitions