Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
От | Bertrand Drouvot |
---|---|
Тема | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed |
Дата | |
Msg-id | ZaACEPtVovKlRVUf@ip-10-97-1-34.eu-west-3.compute.internal обсуждение исходный текст |
Ответ на | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed (Alexander Lakhin <exclusion@gmail.com>) |
Ответы |
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
|
Список | pgsql-hackers |
Hi, On Thu, Jan 11, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > 10.01.2024 19:32, Bertrand Drouvot wrote: > > > > > is an absent message 'obsolete replication slot "row_removal_activeslot"' > > Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here: > > > > Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot(). > > > > In case of an active slot we first call ReportSlotInvalidation with the second > > parameter set to true (to emit the "terminating" message), then SIGTERM the active > > process and then (later) we should call the other ReportSlotInvalidation() > > call with the second parameter set to false (to emit the message that we don't > > see here). > > > > So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation() > > call. > > I've found a way to reproduce the issue without slowing down a machine or > running multiple tests in parallel. It's enough for this to add a delay to > allow BackgroundWriterMain() to execute LogStandbySnapshot(): > @@ -692,2 +690,3 @@ > $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]); > +$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]); > $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]); > > With this delay, I get the failure immediately: > $ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery > # +++ tap check in src/test/recovery +++ > t/035_standby_logical_decoding.pl .. 47/? > # Failed test 'activeslot slot invalidation is logged with on-access pruning' > # at t/035_standby_logical_decoding.pl line 227. Thanks a lot for the testing! So I think we have 2 issues here: 1) The one you're mentioning above related to the on-access pruning test: I think the engine behavior is right here and that the test is racy. I'm proposing to bypass the active slot invalidation check for this particular test ( as I don't see any "easy" way to take care of this race condition). The active slot invalidation is already well covered in the other tests anyway. I'm proposing the attached v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch for it. 2) The fact that sometime we're getting a termination message which is not followed by an obsolete one (like as discussed in [1]). For this one, I think that InvalidatePossiblyObsoleteSlot() is racy: In case of an active slot we proceed in 2 steps: - terminate the backend holding the slot - report the slot as obsolete This is racy because between the two we release the mutex on the slot, which means the effective_xmin and effective_catalog_xmin could advance during that time. I'm proposing the attached v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch for it. Would it be possible to re-launch your repro (the slow one, not the pg_sleep() one) with bot patch applied and see how it goes? (Please note that v4 replaces v3 that you're already using in your tests). If it helps, I'll propose v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch into a dedicated hackers thread. [1]: https://www.postgresql.org/message-id/ZZ7GpII4bAYN%2BjT5%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: