RE: Fix 035_standby_logical_decoding.pl race conditions
От | Hayato Kuroda (Fujitsu) |
---|---|
Тема | RE: Fix 035_standby_logical_decoding.pl race conditions |
Дата | |
Msg-id | OSCPR01MB14966852B0E4CF07D42774695F5DB2@OSCPR01MB14966.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: Fix 035_standby_logical_decoding.pl race conditions (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
Ответы |
Re: Fix 035_standby_logical_decoding.pl race conditions
|
Список | pgsql-hackers |
Dear Bertrand, I'm also working on the thread to resolve the random failure. > Yes, that's also my understanding. It's also easy to "simulate" by adding > a checkpoint on the primary and a long enough sleep after we launched our sql in > wait_until_vacuum_can_remove(). Thanks for letting me know. For me, it could be reporoduced only the sleep(). > > So, if the above is correct, the reason for generating extra > > xl_running_xacts on primary is Vacuum followed by Insert on primary > > via below part of test: > > $node_primary->safe_psql( > > 'testdb', qq[VACUUM $vac_option verbose $to_vac; > > INSERT INTO flush_wal DEFAULT VALUES;]); > > I'm not sure, I think a xl_running_xacts could also be generated (for example by > the checkpointer) before the vacuum (should the system be slow enough). I think you are right. When I added `CHECKPOINT` and sleep after the user SQLs, I got the below ordering. This meant that RUNNING_XACTS are generated before the prune triggered by the vacuum. ``` ... lsn: 0/04025218, prev 0/040251A0, desc: RUNNING_XACTS nextXid 766 latestCompletedXid 765 oldestRunningXid 766 ... lsn: 0/04028FD0, prev 0/04026FB0, desc: PRUNE_ON_ACCESS snapshotConflictHorizon: 765,... ... ``` > I'm not sure, as I think a xl_running_xacts could still be generated after > we execute "our sql" meaning: > > " > $node_primary->safe_psql('testdb', qq[$sql]); > " > > and before we launch the new DML. In that case I guess the issue could still > happen. > > OTOH If we create the new DML "before" we launch "our sql" then the test > would also fail for both active and inactive slots because that would not > invalidate any slots. > > I did observe the above with the attached changes (just changing the PREPARE > TRANSACTION location). I've also tried the idea with the living transaction via background_psql(), but I got the same result. The test could fail when RUNNING_XACTS record was generated before the transaction starts. > I agree, but I'm not sure it's doable as it looks to me that we should prevent > the catalog xmin to advance to advance past the conflict point while still > generating a conflict point. Will try to give it another thought. One primitive idea for me was to stop the walsender/pg_recvlogical process for a while. SIGSTOP signal for pg_recvlogical may do the idea, but ISTM it could not be on windows. See 019_replslot_limit.pl. Best regards, Hayato Kuroda FUJITSU LIMITED
В списке pgsql-hackers по дате отправления: