Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От | Michael Paquier |
---|---|
Тема | Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |
Дата | |
Msg-id | aHXLep3OaX_vRTNQ@paquier.xyz обсуждение исходный текст |
Ответ на | pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages (vignesh C <vignesh21@gmail.com>) |
Ответы |
RE: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |
Список | pgsql-hackers |
On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote: > Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on > injection points. The issue reproduced consistently across all > branches up to PG13. You can use the attached > 049_slot_get_changes_wait_continously_pg17.pl script (found in the > 049_slot_get_changes_wait_continously_pg17.zip file) to verify this. > Just copy the script to src/test/recovery and run the test to observe > the problem. +# Abruptly stop the server (1 second should be enough for the checkpoint +# to finish; it would be better). +$node->stop('immediate'); While v3-0001 is good enough to reproduce the original issue after a few dozen runs here, we cannot reintroduce the test without the proper wait_for_log() logic making sure that the checkpoint is completed before we stop the server. I like the addition of an extra pg_logical_emit_message() in test 046 anyway, down to v17, in the test 046 for all the branches. Even if the reproduction is sporadic, we have seen it pretty quickly in the CI and in the buildfarm so it would not go unnoticed for a long time if we mess up with this stuff again. We are lucky enough that the existing test does not fail in v17, TBH, so let's tackle all the issues on this thread step by step: 1) Fix let's add the header check. 2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with an extra pg_logical_emit_message() that travels across WAL pages forcing a reconstruction of the record and the extra header check. 3) Let's fix recovery test 046 currently in v17, for the checkpoint wait logic and the extra pg_logical_emit_message(). 4) Let's improve the stability of 047 everywhere for the checkpoint wait logic, as already done by Alexander. I have been doing some tests with the patch vs HEAD (thanks for the script, it has saved some time), and I can summarize my results by using a message size of 819200000, hence worth 100k pages of WAL or so. Then, taking 20 runs, and eliminating the three highest and lowest numbers to eliminate some of the variance, I am getting an average of runtime when re-assembling the record of: - 695473.09us for HEAD - 695679.18us for the patch. So I am not going to stress more on this point. Attached is the result of all that for HEAD. There is also one extra patch dedicated to v17, where I have checked that the extra pg_logical_emit_message() is enough to reproduce the problem without the patch, and that the problem is fixed with the patch. The patch set for v17 is simpler of course, as test 046 still exists on REL_17_STABLE. Note that I have moved this extra contrecord to be generated before the checkpoint is completed, leading to the same result. With all that said, I'll move on with this stuff once the embargo for v18 beta2 is lifted and the tag is pushed. That should happen in 24h or so, I guess. -- Michael
Вложения
В списке pgsql-hackers по дате отправления: