Re: lastOverflowedXid does not handle transaction ID wraparound

Поиск
Список
Период
Сортировка
От Nikolay Samokhvalov
Тема Re: lastOverflowedXid does not handle transaction ID wraparound
Дата
Msg-id CANNMO+Lu0_pW1D1gdz4qRB0Sr7q-R_ZRjFsQ89Ti8EXD2FopQg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: lastOverflowedXid does not handle transaction ID wraparound  (Nikolay Samokhvalov <samokhvalov@gmail.com>)
Ответы Re: lastOverflowedXid does not handle transaction ID wraparound  (Nikolay Samokhvalov <nikolay@samokhvalov.com>)
Список pgsql-hackers


On Mon, Oct 25, 2021 at 11:41 AM Nikolay Samokhvalov <samokhvalov@gmail.com> wrote:
On Thu, Oct 21, 2021 at 07:21 Stan Hu <stanhu@gmail.com> wrote:
On Wed, Oct 20, 2021 at 9:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> lastOverflowedXid is the smallest subxid that possibly exists but
> possiblly not known to the standby. So if all top-level transactions
> older than lastOverflowedXid end, that means that all the
> subtransactions in doubt are known to have been ended.

Thanks for the patch! I verified that it appears to reset
lastOverflowedXid properly.
... 
Any ideas in the direction of observability?

Perhaps, anything additional should be considered separately.

The behavior discussed here looks like a bug.

I also have tested the patch. It works fully as expected, details of testing – below.

I think this is a serious bug hitting heavily loaded Postgres setups with hot standbys
 and propose fixing it in all supported major versions ASAP since the fix looks simple.

Any standby in heavily loaded systems (10k+ TPS) where subtransactions are used
may experience huge performance degradation on standbys [1]. This is what happened
recently with GitLab [2]. While a full solution to this problem is something more complex, probably
requiring changes in SLRU [3], the problem discussed here definitely feels like a serious bug
– if we fully get rid of subtransactions, since 32-bit lastOverflowedXid is not reset, in new
XID epoch standbys start experience SubtransControlLock/SubtransSLRU again – 
without any subtransactions. This problem is extremely difficult to diagnose on one hand,
and it may fully make standbys irresponsible while a long-lasting transaction last on the primary
("long" here may be a matter of minutes or even dozens of seconds – it depends on the
TPS level). It is especially hard to diagnose in PG 12 or older – because it doesn't have
pg_stat_slru yet, so one cannot easily notice Subtrans reads.)

The only current solution to this problem is to restart standby Postgres.

How I tested the patch. First, I reproduced the problem:
- current 15devel Postgres, installed on 2 x c5ad.2xlarge on AWS (8 vCPUs, 16 GiB), working as
primary + standby
- follow the steps described in [3] to initiate SubtransSLRU on the standby
- at some point, stop using SAVEPOINTs on the primary - use regular UPDATEs instead, wait.

Using the following, observe procArray->lastOverflowedXid:

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bd3c7a47fe21949ba63da26f0d692b2ee618f885..ccf3274344d7ba52a6f28a10b08dbfc310cf97e9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2428,6 +2428,9 @@ GetSnapshotData(Snapshot snapshot)
  subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
   xmax);
 
+        if (random() % 100000 == 0)
+                elog(WARNING, "procArray->lastOverflowedXid: %u", procArray->lastOverflowedXid);
+
  if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
  suboverflowed = true;
  }

Once we stop using SAVEPOINTs on the primary, the value procArray->lastOverflowedXid stop
 changing, as expected.

Without the patch applied, lastOverflowedXid remains constant forever – till the server restart.
And as I mentioned, we start experiencing SubtransSLRU and pg_subtrans reads.

With the patch, lastOverflowedXid is reset to 0, as expected, shortly after an ongoing "long"
the transaction ends on the primary.

This solves the bug – we don't have SubtransSLRU on standby without actual use of subtransactions
on the primary.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Sasasu
Дата:
Сообщение: Re: XTS cipher mode for cluster file encryption
Следующее
От: Nikolay Samokhvalov
Дата:
Сообщение: Re: lastOverflowedXid does not handle transaction ID wraparound