Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
Дата
Msg-id CAPpHfduYUMn9zdfR+hWZ0ohiDJsa8ayC8JycNqRoNo7-6chyPA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
Список pgsql-hackers
Hi, Michael!

On Tue, Jul 15, 2025 at 6:31 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote:
> > Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on
> > injection points. The issue reproduced consistently across all
> > branches up to PG13. You can use the attached
> > 049_slot_get_changes_wait_continously_pg17.pl script (found in the
> > 049_slot_get_changes_wait_continously_pg17.zip file) to verify this.
> > Just copy the script to src/test/recovery and run the test to observe
> > the problem.
>
> +# Abruptly stop the server (1 second should be enough for the checkpoint
> +# to finish; it would be better).
> +$node->stop('immediate');
>
> While v3-0001 is good enough to reproduce the original issue after a
> few dozen runs here, we cannot reintroduce the test without the proper
> wait_for_log() logic making sure that the checkpoint is completed
> before we stop the server.
>
> I like the addition of an extra pg_logical_emit_message() in test 046
> anyway, down to v17, in the test 046 for all the branches.  Even if
> the reproduction is sporadic, we have seen it pretty quickly in the CI
> and in the buildfarm so it would not go unnoticed for a long time if
> we mess up with this stuff again.
>
> We are lucky enough that the existing test does not fail in v17, TBH,
> so let's tackle all the issues on this thread step by step:
> 1) Fix let's add the header check.
> 2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with
> an extra pg_logical_emit_message() that travels across WAL pages
> forcing a reconstruction of the record and the extra header check.
> 3) Let's fix recovery test 046 currently in v17, for the checkpoint
> wait logic and the extra pg_logical_emit_message().
> 4) Let's improve the stability of 047 everywhere for the checkpoint
> wait logic, as already done by Alexander.
>
> I have been doing some tests with the patch vs HEAD (thanks for the
> script, it has saved some time), and I can summarize my results by
> using a message size of 819200000, hence worth 100k pages of WAL or
> so.  Then, taking 20 runs, and eliminating the three highest and
> lowest numbers to eliminate some of the variance, I am getting an
> average of runtime when re-assembling the record of:
> - 695473.09us for HEAD
> - 695679.18us for the patch.
> So I am not going to stress more on this point.
>
> Attached is the result of all that for HEAD.  There is also one extra
> patch dedicated to v17, where I have checked that the extra
> pg_logical_emit_message() is enough to reproduce the problem without
> the patch, and that the problem is fixed with the patch.  The patch
> set for v17 is simpler of course, as test 046 still exists on
> REL_17_STABLE.  Note that I have moved this extra contrecord to be
> generated before the checkpoint is completed, leading to the same
> result.
>
> With all that said, I'll move on with this stuff once the embargo for
> v18 beta2 is lifted and the tag is pushed.  That should happen in 24h
> or so, I guess.

Thank you for your efforts on this subject.  Actually, I was planning
to work on pushing this after the release freeze ends for v18.  I'd
like to do this at least for tests as they were initially committed by
me.

Anyway, please, hold on pushing this for ~1 day to let me do final
review of this.

------
Regards,
Alexander Korotkov
Supabase



В списке pgsql-hackers по дате отправления: