Hello
Firstly -- this is top-notch detective work, kudos and thanks for the
patch and test cases. (I verified that both fail before the code fix.)
Here's a v3. I applied a lot of makeup in order to try to understand
what's going on. I *think* I have a grasp on the original code and your
bugfix, not terribly firm I admit.
Some comments
* you don't need to Assert that things are not NULL if you're
immediately going to dereference them. The assert is there to make
the code crash in case it's a NULL pointer, but the subsequent
dereference is going to have the same effect, so the assert is
redundant.
* I think setting snapshot_base to NULL in ReorderBufferCleanupTXN is
pointless, since the struct is gonna be freed shortly afterwards.
* I rewrote many comments (both existing and some of the ones your patch
adds), and added lots of comments where there were none.
* I'm a bit unsure about the comment atop ReorderBufferSetBaseSnapshot.
Obviously, the bit within the #if 0/#endif I'm going to remove before
push. I don't understand why it says "Needs to be called before any
changes are added with ReorderBufferQueueChange"; but if you edit that
function and add an assert that the base snapshot is set, it crashes
pretty quickly in the test_decoding tests. (The supposedly bogus
comment was there before your patch -- I'm not saying your comment
addition is faulty.)
* I also noticed that we're doing subtxn cleanup one by one in both
ReorderBufferAssignChild and ReorderBufferCommitChild, which means the
top-level txn is sought in the hash table over and over, which seems a
bit silly. Not this patch's problem to fix ...
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services