Thanks for looking/testing, Sergei. Thanks for the changes, Michael,
these all look good. I've squashed them and added you as co-author.
A couple more small comment/text changes:
1. In the place where we fail to allocate memory for an oversized
record, I copied the comment about treating that as a "bogus data"
condition. I suspect that we will soon be converting that to a FATAL
error[1], and that'll need to be done in both places.
2. In this version of the commit message I said we'd only back-patch
to 15 for now. After sleeping on this for a week, I realised that the
reason I keep vacillating on that point is that I am not sure what
your plan is for the malloc-failure-means-end-of-wal policy ([1],
ancient code from 0ffe11abd3a). If we're going to fix that in master
only but let sleeping dogs lie in the back-branches, then it becomes
less important to go back further than 15 with THIS patch. But if you
want to be able to distinguish garbage from out-of-memory, and thereby
end-of-wal from a FATAL please-insert-more-RAM condition, I think
you'd really need this industrial strength validation in all affected
branches, and I'd have more work to do, right? The weak validation we
are fixing here is the *real* underlying problem going back many
years, right?
I also wondered about strengthening the validation of various things
like redo begin/end LSNs etc in these tests. But we can always
continue to improve all this later...
Here also is a version for 15 (and a CI run[2]), since we tweaked many
of the error messages between 15 and 16.
[1] https://www.postgresql.org/message-id/flat/ZMh/WV%2BCuknqePQQ%40paquier.xyz
[2] https://cirrus-ci.com/task/4533280897236992 (failed on some
unrelated pgbench test, reported in another thread)