RE: Fix 035_standby_logical_decoding.pl race conditions

Поиск

Список

Период

Сортировка

От	Hayato Kuroda (Fujitsu)
Тема	RE: Fix 035_standby_logical_decoding.pl race conditions
Дата	21 марта 15:28:10
Msg-id	OSCPR01MB14966852B0E4CF07D42774695F5DB2@OSCPR01MB14966.jpnprd01.prod.outlook.com обсуждение исходный текст
Ответ на	Re: Fix 035_standby_logical_decoding.pl race conditions (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Ответы	Re: Fix 035_standby_logical_decoding.pl race conditions
Список	pgsql-hackers

Дерево обсуждения

Dear Bertrand,

I'm also working on the thread to resolve the random failure.

> Yes, that's also my understanding. It's also easy to "simulate" by adding
> a checkpoint on the primary and a long enough sleep after we launched our sql in
> wait_until_vacuum_can_remove().

Thanks for letting me know. For me, it could be reporoduced only the sleep().

> > So, if the above is correct, the reason for generating extra
> > xl_running_xacts on primary is Vacuum followed by Insert on primary
> > via below part of test:
> > $node_primary->safe_psql(
> > 'testdb', qq[VACUUM $vac_option verbose $to_vac;
> >   INSERT INTO flush_wal DEFAULT VALUES;]);
> 
> I'm not sure, I think a xl_running_xacts could also be generated (for example by
> the checkpointer) before the vacuum (should the system be slow enough).

I think you are right. When I added `CHECKPOINT` and sleep after the user SQLs,
I got the below ordering. This meant that RUNNING_XACTS are generated before the
prune triggered by the vacuum.
```
...
lsn: 0/04025218, prev 0/040251A0, desc: RUNNING_XACTS nextXid 766 latestCompletedXid 765 oldestRunningXid 766
...
lsn: 0/04028FD0, prev 0/04026FB0, desc: PRUNE_ON_ACCESS snapshotConflictHorizon: 765,...
...
```

> I'm not sure, as I think a xl_running_xacts could still be generated after
> we execute "our sql" meaning:
> 
> "
>     $node_primary->safe_psql('testdb', qq[$sql]);
> "
> 
> and before we launch the new DML. In that case I guess the issue could still
> happen.
> 
> OTOH If we create the new DML "before" we launch "our sql" then the test
> would also fail for both active and inactive slots because that would not
> invalidate any slots.
> 
> I did observe the above with the attached changes (just changing the PREPARE
> TRANSACTION location).

I've also tried the idea with the living transaction via background_psql(),
but I got the same result. The test could fail when RUNNING_XACTS record was
generated before the transaction starts.

> I agree, but I'm not sure it's doable as it looks to me that we should prevent
> the catalog xmin to advance to advance past the conflict point while still
> generating a conflict point. Will try to give it another thought.

One primitive idea for me was to stop the walsender/pg_recvlogical process for a while.
SIGSTOP signal for pg_recvlogical may do the idea, but ISTM it could not be on windows.
See 019_replslot_limit.pl.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

RE: Fix 035_standby_logical_decoding.pl race conditions