Re: pgsql: Improve runtime and output of tests for replication slots checkp
От | Alexander Korotkov |
---|---|
Тема | Re: pgsql: Improve runtime and output of tests for replication slots checkp |
Дата | |
Msg-id | CAPpHfdtnkGPMSrTaWs3vGP-FYKOXhUZq3OHDeaZwHXsAQ2g=WA@mail.gmail.com обсуждение исходный текст |
Ответ на | pgsql: Improve runtime and output of tests for replication slots checkp (Alexander Korotkov <akorotkov@postgresql.org>) |
Список | pgsql-committers |
On Mon, Jun 23, 2025 at 3:00 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > Yeah, that's what I think too. The unintentional omission of a > > > > pre-shutdown delay in the 046 test has exposed some pre-existing > > > > fragility in pg_logical_slot_get_changes(). So I'm not in favor > > > > of changing 046 till we understand this better. > > > > > > Yes, better to understand what's going on before changing the test, > > > and perhaps change 046 so as it provide stable coverage for this case, > > > even if discovered accidentally. > > > > > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in > > > this context, because XLogDecodeNextRecord() is trying to read pages > > > as it has a logic too permissive? How did any of you reproduce the > > > failure? Just by running the test in a loop? It is one of these > > > patterns where a hardcoded sleep should do the trick and help with a > > > bisect. > > > > Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2. > > > > cd src/test/recovery > > while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done > > > > > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if > > > you expect the checkpoint to complete before sending the immediate > > > shutdown request, I would suggest to use a wait_for_log() based on > > > "checkpoint complete" or an equivalent loop. What you are doing in > > > the test is unstable as written. > > > > Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpointcompletion, and immediate stop after checkpoint completion) must work without hang. > > CFBot has been much more red than usual since this change afaict. Many > more windows builds are failing than usual with an error like this: > > [08:28:52.683] 338/338 postgresql:recovery / > recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1 > > How about we revert the commit for now to get CI and the buildfarm green again? What about removing the 046_checkpoint_logical_slot which currently causes all the buzz? I'm not yet convinced we need to revert ca307d5cec90. Opinions? ------ Regards, Alexander Korotkov Supabase
В списке pgsql-committers по дате отправления: