On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I could still use another pair of eyes on the test (looking out for
> > stability enhancing measures I could take).
>
> First, the basics: I found that your test failed reliably without your
> fix, and passed reliably with your fix.
Thanks for the review.
> Minor nitpicking about the comments in your TAP test:
>
> * It is necessary but not sufficient for your test to "skewer"
> maybe_needed, relative to OldestXmin. Obviously, it is not sufficient
> because the test can only fail when VACUUM prunes a heap page after
> the backend's horizons have been "skewered" in this sense.
>
> Pruning is when we get stuck, and if there's no more pruning then
> there's no opportunity for VACUUM to get stuck. Perhaps this point
> should be noted directly in the comments. You could add a sentence
> immediately after the existing sentence "Then vacuum's first pass will
> continue and pruning...". This new sentence would then add commentary
> such as "Finally, vacuum's second pass over the heap...".
I've added a description to the top of the test of the scenario
required and then reworked the comment you are describing to try and
make this more clear.
> * Perhaps you should point out that you're using VACUUM FREEZE for
> this because it'll force the backend to always get a cleanup lock.
> This is something you rely on to make the repro reliable, but that's
> it.
>
> In other words, point out to the reader that this bug has nothing to
> do with freezing; it just so happens to be convenient to use VACUUM
> FREEZE this way, due to implementation details.
I've mentioned this in a comment.
> * The sentence "VACUUM proceeds with pruning and does a visibility
> check on each tuple..." describes the bug in terms of the current
> state of things on Postgres 17, but Postgres 17 hasn't been released
> just yet. Does that really make sense?
In the patch targeted at master, I think it makes sense to describe
the code as it is. In the backpatch versions, I reworked this comment
to be correct for those versions.
> If you want to describe the invariant that caused
> heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the
> commit message of your fix seems like the right place for that. You
> could reference these errors in passing. The errors seem fairly
> incidental to the real problem, at least to me.
The errors are mentioned in the fix commit message.
> I think that there is some chance that this test will break the build
> farm in whatever way, since there is a long history of VACUUM not
> quite behaving as expected with these sorts of tests. I think that you
> should commit the test case separately, first thing in the morning,
> and then keep an eye on the build farm for the rest of the day. I
> don't think that it's sensible to bend over backwards, just to avoid
> breaking the build farm in this way.
Sounds good.
- Melanie