Обсуждение: [PATCH] Full support for index LP_DEAD hint bits on standby
Hello, hackers.
[ABSTRACT]
Execution of queries to hot standby is one of the most popular ways to
scale application workload. Most of the modern Postgres installations
have two standby nodes for high-availability support. So, utilization
of replica's CPU seems to be a reasonable idea.
At the same time, some queries (index scans) could be much slower on
hot standby rather than on the primary one. It happens because the
LP_DEAD index hint bits mechanics is ignored in index scans during
recovery. It is done for reasons, of course [1]:
     * We do this because the xmin on the primary node could easily be
     * later than the xmin on the standby node, so that what the primary
     * thinks is killed is supposed to be visible on standby. So for correct
     * MVCC for queries during recovery we must ignore these hints and check
     * all tuples.
Also, according to [2] and cases like [3], it seems to be a good idea
to support "ignore_killed_tuples" on standby.
The goal of this patch is to provide full support for index hint bits
on hot standby. The mechanism should be based on well-tested
functionality and not cause a lot of recovery conflicts.
This thread is the continuation (and party copy-paste) of the old
previous one [4].
[PROBLEM]
The standby itself can set and read hint bits during recovery. Such
bits are even correct according to standby visibility rules. But the
problem here - is full-page-write WAL records coming from the primary.
Such WAL records could bring invalid (according to standby xmin) hint
bits.
So, if we could be sure the scan doesn’t see any invalid hint bit from
primary - the problem is solved. And we will even be able to allow
standby to set its LP_DEAD bits itself.
The idea is simple: let WAL log hint bits before FPW somehow. It could
cause a lot of additional logs, however...
But there are ways to avoid it:
1) Send only one `latestRemovedXid` of all tuples marked as dead
during page scan.
2) Remember the latest sent `latestRemovedXid` in shared memory. And
optimistically skip WAL records with older xid values [5].
Such WAL records would cause a lot of recovery conflicts on standbys.
But we could be tricky here - let use hint bits only if
hot_standby_feedback is enabled and effective on standby. If HSF is
effective - then conflicts are not possible. If HSF is off - then
standby ignores both hint bits and additional conflict resolution. The
major thing here is that HSF is just optimization and has nothing with
MVCC correctness.
[DETAILS]
The patch introduces a new WAL record (named
XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for
standbys snapshot to use LP_DEAD bits for an index scan.
`table_index_fetch_tuple` now returns `latest_removed_xid` value
additionally to `all_dead`. This value is used to advance
`killedLatestRemovedXid` at time of updating `killedItems` (see
`IndexHintBitAdvanceLatestRemovedXid`).
Primary sends the value of `killedLatestRemovedXid` in
XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting
LP_DEAD bits on the index page (by calling
`MarkBufferDirtyIndexHint`).
New WAL is always sent before possible FPW. It is required to send
such a record only if its `latestRemovedXid` is newer than the one was
sent before for the current database (see
`LogIndexHintBitsHorizonIfNeeded`).
There is a new flag in the PGPROC structure -
`indexIgnoreKilledTuples`. If the flag is set to true – standby
queries are going to use LP_DEAD bits in index scans. In such a case
snapshot is required to satisfice the new horizon pushed by
XLOG_INDEX_HINT_BITS_HORIZON records.
It is safe to set `indexIgnoreKilledTuples` to any value from the
perspective of correctness. But `true` value could cause recovery
conflict. It is just some kind of compromise – use LP_DEAD bits but be
aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa.
What is the way to make the right decision about this compromise? It
is pretty simple – if `hot_standby_feedback` is on and primary
confirmed feedback is received – then set
`indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`).
While feedback is working as expected – the query will never be
canceled by XLOG_INDEX_HINT_BITS_HORIZON.
To support cascading standby setups (with a possible break of feedback
chain in the middle) – an additional byte was added to the keep-alive
message of the feedback protocol. This byte is used to make sure our
xmin is honored by primary (see
`sender_propagates_feedback_to_primary`). Also, the WAL sender now
always sends a keep-alive after receiving a feedback message.
So, this way, it is safe to use LP_DEAD bits received from the primary
when we want to.
And, as a result, it is safe to set LP_DEAD bits on standby.
Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.
Because `HeapTupleIsSurelyDead` (and index LP_DEAD as result) needs
*heap* hint bits to be already set at standby. So, the same code
decides to set hint bits on the heap (it is done already on standby
for a long time) and in the index.
[EVALUATION]
It is not possible to find an ideal performance test for such kind of
optimization.
But there is a possible example in the attachment. It is a standard
pgbench schema with an additional index on balance and random balance
values.
On primary test do next:
1) transfer some money from one random of the top 100 rich accounts to
one random of the top 100 poor accounts.
2) calculate the amount of money in the top 10 rich and top 10 poor
accounts (and include an additional field to avoid index-only-scan).
In the case of standby only step 2 is used.
The patched version is about 9x faster for standby queries - like 455
TPS versus 4192 TPS on my system. There is no visible difference for
primary.
To estimate the additional amount of WAL logs, I have checked records
in WAL-segments during different conditions:
(pg_waldump pgdata/pg_wal/XXX | grep INDEX_HINT_BITS_HORIZON | wc -l)
- hot_standby_feedback=off - 5181 of 226274 records ~2%
- hot_standby_feedback=on (without load on standby) - 70 of 202594
records ~ 0.03%
- hot_standby_feedback=on (with load on standby) - 17 of 70504 records ~ 0.02%
So, with HSF=on (which is the default value) WAL increase is not
significant. Also, for HSF=off it should be possible to radically
reduce the number of additional WAL logs by using `latestRemovedXid`
from other records (like Heap2/CLEAN) in "send only newer xid"
optimization (I have skipped it for now for simplicity).
[CONCLUSION]
The only thing we pay – a few additional WAL records and some
additional moderate code complexity. But the support of hint-bits on
standby is a huge advantage for many workloads. I was able to get more
than a 900% performance boost (and it is not surprising – index hint
bits are just great optimization). And it works for almost all index
types out of the box.
Another major thing here – everything is based on old, well-tested
mechanics: query cancelation because of snapshot conflicts, setting
heap hint bits on standby, hot standby feedback.
[REFERENCES]
[1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org
[4] - https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[5] -
https://www.postgresql.org/message-id/flat/CANtu0oigC0%2BH0UkxktyovdLLU67ikM0%2BDw3J4EQqiDDeGhcwsQ%40mail.gmail.com
			
		Вложения
Hello, everyone.
Oh, I just realized that it seems like I was too naive to allow
standby to set LP_DEAD bits this way.
There is a possible consistency problem in the case of low
minRecoveryPoint value (because hint bits do not move PageLSN
forward).
Something like this:
LSN=10  STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10)
<-----------minRecoveryPoint will go here
LSN=20  STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10)
               REPLICA SCANS INDEX AND SET hint bits (index_lsn=10)
               INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10)
               CRASH
On crash recovery, a standby will be able to handle queries after
LSN=10. But the index page contains hints bits from the future
(LSN=20).
So, need to think here.
Thanks,
Michail.
			
		Hello, hackers.
I think I was able to fix the issue related to minRecoveryPoint and crash recovery. To make sure standby will be consistent after crash recovery, we need to take the current value of minRecoveryPoint into account while setting LP_DEAD hints (almost the same way as it is done for *heap* hint bits already).
I have introduced new structure IndexHintBitsData:
			
		I think I was able to fix the issue related to minRecoveryPoint and crash recovery. To make sure standby will be consistent after crash recovery, we need to take the current value of minRecoveryPoint into account while setting LP_DEAD hints (almost the same way as it is done for *heap* hint bits already).
I have introduced new structure IndexHintBitsData:
-------
/* guaranteed not visible for all backends */
bool all_dead;
/* latest removed xid if known */
TransactionId latest_removed_xid;
/* lsn of page where dead tuple located */
XLogRecPtr page_lsn;
-------
This structure is filled by the `heap_hot_search_buffer` function. After, we decide to set or not `kill_prior_tuple` depending on its content (calling `IsMarkBufferDirtyIndexHintAllowed`).
For primary - it is always safe to set LP_DEAD in index if `all_dead` == true.
In the case of standby, we need to check `latest_removed_xid` (if available) first. If commit LSN of the latest removed xid is already lower than minRecoveryPoint (`XLogNeedsFlush`) - it is safe to set `kill_prior_tuple`.
Sometimes we are not sure about the latest removed xid - heap record could be marked dead by the XLOG_HEAP2_CLEAN record, for example. In such a case we check the LSN of the *heap* page containing the tuple (LSN could be updated by other transactions already - but it does not matter in that situation). If page LSN is lower than minRecoveryPoint - it is safe to set LP_DEAD in the index too. Otherwise - just leave the index tuple alive.
So, to bring it all together:
* Normal operation, proc->indexIgnoreKilledTuples is true:
It is safe for standby to use hint bits from the primary FPI because of XLOG_INDEX_HINT_BITS_HORIZON conflict resolution.
It is safe for standby to set its index hint bits because `ComputeXidHorizons` honors other read-only procs xmin and lowest xid on primary (`KnownAssignedXidsGetOldestXmin`).
* Normal operation, proc->indexIgnoreKilledTuples is false:
Index hint bits are never set or taken into account.
* Crash recovery, proc->indexIgnoreKilledTuples is true:
It is safe for standby to use hint bits from the primary FPW because XLOG_INDEX_HINT_BITS_HORIZON is always logged before FPI, and commit record of transaction removed the tuple is logged before XLOG_INDEX_HINT_BITS_HORIZON. So, if FPI with hints was flushed (and taken into account by minRecoveryPoint) - both transaction-remover and horizon records are replayed before reading queries.
It is safe for standby to use its hint bits because they can be set only if the commit record of transaction-remover is lower than minRecoveryPoint or LSN of heap page with removed tuples is lower than minRecoveryPoint.
* Crash recovery, proc->indexIgnoreKilledTuples is false:
Index hint bits are never set or taken into account.
So, now it seems correct to me.
Another interesting point here - now position of minRecoveryPoint affects performance a lot. It is happening already (because of *heap* hint bits) but after the patch, it is noticeable even more. Is there any sense to keep minRecoveryPoint at a low value?
Rebased and updated patch in attachment.
Will be happy if someone could recheck my ideas or even the code :)
Thanks a lot,
Michail.
/* guaranteed not visible for all backends */
bool all_dead;
/* latest removed xid if known */
TransactionId latest_removed_xid;
/* lsn of page where dead tuple located */
XLogRecPtr page_lsn;
-------
This structure is filled by the `heap_hot_search_buffer` function. After, we decide to set or not `kill_prior_tuple` depending on its content (calling `IsMarkBufferDirtyIndexHintAllowed`).
For primary - it is always safe to set LP_DEAD in index if `all_dead` == true.
In the case of standby, we need to check `latest_removed_xid` (if available) first. If commit LSN of the latest removed xid is already lower than minRecoveryPoint (`XLogNeedsFlush`) - it is safe to set `kill_prior_tuple`.
Sometimes we are not sure about the latest removed xid - heap record could be marked dead by the XLOG_HEAP2_CLEAN record, for example. In such a case we check the LSN of the *heap* page containing the tuple (LSN could be updated by other transactions already - but it does not matter in that situation). If page LSN is lower than minRecoveryPoint - it is safe to set LP_DEAD in the index too. Otherwise - just leave the index tuple alive.
So, to bring it all together:
* Normal operation, proc->indexIgnoreKilledTuples is true:
It is safe for standby to use hint bits from the primary FPI because of XLOG_INDEX_HINT_BITS_HORIZON conflict resolution.
It is safe for standby to set its index hint bits because `ComputeXidHorizons` honors other read-only procs xmin and lowest xid on primary (`KnownAssignedXidsGetOldestXmin`).
* Normal operation, proc->indexIgnoreKilledTuples is false:
Index hint bits are never set or taken into account.
* Crash recovery, proc->indexIgnoreKilledTuples is true:
It is safe for standby to use hint bits from the primary FPW because XLOG_INDEX_HINT_BITS_HORIZON is always logged before FPI, and commit record of transaction removed the tuple is logged before XLOG_INDEX_HINT_BITS_HORIZON. So, if FPI with hints was flushed (and taken into account by minRecoveryPoint) - both transaction-remover and horizon records are replayed before reading queries.
It is safe for standby to use its hint bits because they can be set only if the commit record of transaction-remover is lower than minRecoveryPoint or LSN of heap page with removed tuples is lower than minRecoveryPoint.
* Crash recovery, proc->indexIgnoreKilledTuples is false:
Index hint bits are never set or taken into account.
So, now it seems correct to me.
Another interesting point here - now position of minRecoveryPoint affects performance a lot. It is happening already (because of *heap* hint bits) but after the patch, it is noticeable even more. Is there any sense to keep minRecoveryPoint at a low value?
Rebased and updated patch in attachment.
Will be happy if someone could recheck my ideas or even the code :)
Thanks a lot,
Michail.
Вложения
Hello, everyone. After some correspondence with Peter Geoghegan (1) and his ideas, I have reworked the patch a lot and now it is much more simple with even better performance (no new WAL or conflict resolution, hot standby feedback is unrelated). The idea is pretty simple now - let’s mark the page with “standby-safe” LP_DEAD hints by the bit in btpo_flags (BTP_LP_SAFE_ON_STANDBY and similar for gist and hash). If standby wants to set LP_DEAD - it checks BTP_LP_SAFE_ON_STANDBY on the page first, if it is not set - all “primary” hints are removed first, and then the flag is set (with memory barrier to avoid memory ordering issues in concurrent scans). Also, standby checks BTP_LP_SAFE_ON_STANDBY to be sure about ignoring tuples marked by LP_DEAD during the scan. Of course, it is not so easy. If standby was promoted (or primary was restored from standby backup) - it is still possible to receive FPI with such flag set in WAL logs. So, the main problem is still there. But we could just clear this flag while applying FPI because the page remains dirty after that anyway! It should not cause any checksum, consistency, or pg_rewind issues as explained in (2). Semantically it is the same as set hint bit one milisecond after FPI was applied (while page still remains dirty after FPI replay) - and standby already does it with *heap* hint bits. Also, TAP-test attached to (2) shows how it is easy to flush a hint bit which was set by standby to achieve different checksum comparing to primary already. If standby was promoted (or restored from standby backup) it is safe to use LP_DEAD with or without BTP_LP_SAFE_ON_STANDBY on a page. But for accuracy BTP_LP_SAFE_ON_STANDBY is cleared by primary if found. Also, we should take into account minRecoveryPoint as described in (3) to avoid consistency issues during crash recovery (see IsIndexLpDeadAllowed). Also, as far as I know - there is no practical sense to keep minRecoveryPoint at a low value. So, there is an optional patch that moves minRecoveryPoint forward at each xl_running_data (to allow standby to set hint bits and LP_DEADs more aggressively). It is about every 15s. There are some graphics showing performance testing results on my PC in the attachment (test is taken from (4)). Each test was running for 10 minutes. Additional primary performance is probably just measurement error. But standby performance gain is huge. Feel free to ask if you need more proof about correctness. Thanks, Michail. [1] - https://www.postgresql.org/message-id/flat/CAH2-Wz%3D-BoaKgkN-MnKj6hFwO1BOJSA%2ByLMMO%2BLRZK932fNUXA%40mail.gmail.com#6d7cdebd68069cc493c11b9732fd2040 [2] - https://www.postgresql.org/message-id/flat/CANtu0oiAtteJ%2BMpPonBg6WfEsJCKrxuLK15P6GsaGDcYGjefVQ%40mail.gmail.com#091fca433185504f2818d5364819f7a4 [3] - https://www.postgresql.org/message-id/flat/CANtu0oh28mX5gy5jburH%2Bn1mcczK5_dCQnhbBnCM%3DPfqh-A26Q%40mail.gmail.com#ecfe5a331a3058f895c0cba698fbc4d3 [4] - https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
Вложения
I'm trying to review the patch, but not sure if I understand this problem, please see my comment below. Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > Oh, I just realized that it seems like I was too naive to allow > standby to set LP_DEAD bits this way. > There is a possible consistency problem in the case of low > minRecoveryPoint value (because hint bits do not move PageLSN > forward). > > Something like this: > > LSN=10 STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10) > <-----------minRecoveryPoint will go here > LSN=20 STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10) Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by replaying the commit record. And if the standby happens to crash before the commit record could be replayed, no query should see the deletion and thus no hint bit should be set in the index. > REPLICA SCANS INDEX AND SET hint bits (index_lsn=10) > INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10) > CRASH > > On crash recovery, a standby will be able to handle queries after > LSN=10. But the index page contains hints bits from the future > (LSN=20). > So, need to think here. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello, Antonin. > I'm trying to review the patch, but not sure if I understand this problem, > please see my comment below. Thanks a lot for your attention. It is strongly recommended to look at version N3 (1) because it is a much more elegant, easy, and reliable solution :) But the minRecoveryPoint-related issue affects it anyway. > Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by > replaying the commit record. And if the standby happens to crash before the > commit record could be replayed, no query should see the deletion and thus no > hint bit should be set in the index. minRecoveryPoint is not affected by replaying the commit record in most cases. It is updated in a lazy way, something like this: minRecoveryPoint = max LSN of flushed page. Version 3 of a patch contains a code_optional.patch to move minRecoveryPoint more aggressively to get additional performance on standby (based on Peter’s answer in (2). So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS NEW ROW IN INDEX” it is just a random event. Thanks, Michail. [1]: https://www.postgresql.org/message-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u%3Dk6igAg%40mail.gmail.com (“Also, btw, do you know any reason to keep minRecoveryPoint at a low value?”)
Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > Hello, Antonin. > > > I'm trying to review the patch, but not sure if I understand this problem, > > please see my comment below. > > Thanks a lot for your attention. It is strongly recommended to look at > version N3 (1) because it is a much more elegant, easy, and reliable > solution :) But the minRecoveryPoint-related issue affects it anyway. Indeed I'm reviewing (1), but I wanted to discuss this particular question in context, so I replied here. > > Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by > > replaying the commit record. And if the standby happens to crash before the > > commit record could be replayed, no query should see the deletion and thus no > > hint bit should be set in the index. > > minRecoveryPoint is not affected by replaying the commit record in > most cases. It is updated in a lazy way, something like this: > minRecoveryPoint = max LSN of flushed page. Version 3 of a patch > contains a code_optional.patch to move minRecoveryPoint more > aggressively to get additional performance on standby (based on > Peter’s answer in (2). > So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS > NEW ROW IN INDEX” it is just a random event. > Michail. Sorry, I missed the fact that your example can be executed inside BEGIN - END block, in which case minRecoveryPoint won't advance after each command. I'll continue my review by replying to (1) > [1]: https://www.postgresql.org/message-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com > [2]: https://www.postgresql.org/message-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u%3Dk6igAg%40mail.gmail.com > (“Also, btw, do you know any reason to keep minRecoveryPoint at a low > value?”) I'm not an expert in this area (I'm reviewing this patch also to learn more about recovery and replication), but after a breif research I think that postgres tries not to update the control file too frequently, see comments in UpdateMinRecoveryPoint(). I don't know if what you do in code_optional.patch would be a problem. Actually I think that a commit record should be replayed more often than XLOG_RUNNING_XACTS, shouldn't it? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello, Antonin. > Sorry, I missed the fact that your example can be executed inside BEGIN - END > block, in which case minRecoveryPoint won't advance after each command. No, the block is not executed as a single transaction, all commands are separated transactions (see below) > Actually I think that a commit record should be replayed > more often than XLOG_RUNNING_XACTS, shouldn't it? Yes, but replaying commit records DOES NOT affect minRecoveryPoint in almost all cases. UpdateMinRecoveryPoint is called by XLogFlush, but xact_redo_commit calls XLogFlush only in two cases: * DropRelationFiles is called (some relation are dropped) * If ForceSyncCommit was used on primary - few “heavy” commands, like DropTableSpace, CreateTableSpace, movedb, etc. But “regular” commit record is replayed without XLogFlush and, as result, without UpdateMinRecoveryPoint. So, in practice, UpdateMinRecoveryPoint is updated in an “async” way by checkpoint job. This is why there is a sense to call it on XLOG_RUNNING_XACTS. Thanks, Michail.
Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > Sorry, I missed the fact that your example can be executed inside BEGIN - END > > block, in which case minRecoveryPoint won't advance after each command. > > No, the block is not executed as a single transaction, all commands > are separated transactions (see below) > > > Actually I think that a commit record should be replayed > > more often than XLOG_RUNNING_XACTS, shouldn't it? > > Yes, but replaying commit records DOES NOT affect minRecoveryPoint in > almost all cases. > > UpdateMinRecoveryPoint is called by XLogFlush, but xact_redo_commit > calls XLogFlush only in two cases: > * DropRelationFiles is called (some relation are dropped) > * If ForceSyncCommit was used on primary - few “heavy” commands, like > DropTableSpace, CreateTableSpace, movedb, etc. > > But “regular” commit record is replayed without XLogFlush and, as > result, without UpdateMinRecoveryPoint. ok, I missed this. Thanks for explanation. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
> After some correspondence with Peter Geoghegan (1) and his ideas, I
> have reworked the patch a lot and now it is much more simple with even
> better performance (no new WAL or conflict resolution, hot standby
> feedback is unrelated).
My review that started in [1] continues here.
(Please note that code.patch does not apply to the current master branch.)
I think I understand your approach now and couldn't find a problem by reading
the code. What I consider worth improving is documentation, both code comments
and nbtree/README. Especially for the problem discussed in [1] it should be
explained what would happen if kill_prior_tuple_min_lsn was not checked.
Also, in IsIndexLpDeadAllowed() you say that invalid
deadness->latest_removed_xid means the following:
    /*
     * Looks like it is tuple cleared by heap_page_prune_execute,
     * we must be sure if LSN of XLOG_HEAP2_CLEAN (or any subsequent
     * updates) less than minRecoveryPoint to avoid MVCC failure
     * after crash recovery.
     */
However I think there's one more case: if heap_hot_search_buffer() considers
all tuples in the chain to be "surely dead", but
HeapTupleHeaderAdvanceLatestRemovedXid() skips them all for this reason:
    /*
     * Ignore tuples inserted by an aborted transaction or if the tuple was
     * updated/deleted by the inserting transaction.
     *
     * Look for a committed hint bit, or if no xmin bit is set, check clog.
     */
I think that the dead tuples produced this way should never be visible on the
standby (and even if they were, they would change the page LSN so your
algorithm would treat them correctly) so I see no correctness problem. But it
might be worth explaining better the meaning of invalid "latest_removed_xid"
in comments.
In the nbtree/README, you say
    "... if the commit record of latestRemovedXid is more ..."
but it's not clear to me what "latestRemovedXid" is. If you mean the
scan->kill_prior_tuple_min_lsn field, you probably need more words to explain
it.
* IsIndexLpDeadAllowed()
    /* It all always allowed on primary if *all_dead. */
should probably be
    /* It is always allowed on primary if *all_dead. */
* gistkillitems()
As the function is only called if (so->numKilled > 0), I think both
"killedsomething" and "dirty" variables should always have the same value, so
one variable should be enough. Assert(so->numKilled) would be appropriate in
that case.
The situation is similar for btree and hash indexes.
doc.patch:
"+applying the fill page write."
[1] https://www.postgresql.org/message-id/61470.1620647290%40antos
-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com
			
		Hello, Antonin. > My review that started in [1] continues here. Thanks a lot for the review. > (Please note that code.patch does not apply to the current master branch.) Rebased. > Especially for the problem discussed in [1] it should be > explained what would happen if kill_prior_tuple_min_lsn was not checked. Updated README, hope it is better now. Also, added few details related to the flush of hint bits. > However I think there's one more case: if heap_hot_search_buffer() considers > all tuples in the chain to be "surely dead", but > HeapTupleHeaderAdvanceLatestRemovedXid() skips them all for this reason: Yes, good catch, missed it. > I think that the dead tuples produced this way should never be visible on the > standby (and even if they were, they would change the page LSN so your > algorithm would treat them correctly) so I see no correctness problem. But it > might be worth explaining better the meaning of invalid "latest_removed_xid" > in comments. Added additional comment. > but it's not clear to me what "latestRemovedXid" is. If you mean the > scan->kill_prior_tuple_min_lsn field, you probably need more words to explain > it. Hope it is better now. > should probably be > /* It is always allowed on primary if *all_dead. */ Fixed. > As the function is only called if (so->numKilled > 0), I think both > "killedsomething" and "dirty" variables should always have the same value, so > one variable should be enough. Assert(so->numKilled) would be appropriate in > that case. Fixed, but partly. It is because I have added additional checks for a long transaction in the case of promoted server. > "+applying the fill page write." Fixed. Updated version in attach. Thanks a lot, Michail.
Вложения
Hello. Added a check for standby promotion with the long transaction to the test (code and docs are unchanged). Thanks, Michail.
Вложения
Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > Hello. > > Added a check for standby promotion with the long transaction to the > test (code and docs are unchanged). I'm trying to continue the review, sorry for the delay. Following are a few question about the code: * Does the masking need to happen in the AM code, e.g. _bt_killitems()? I'd expect that the RmgrData.rm_fpi_mask can do all the work. Maybe you're concerned about clearing the "LP-safe-on-standby" bits after promotion, but I wouldn't consider this a problem: once the standby is allowed to set the hint bits (i.e. minRecoveryPoint is high enough, see IsIndexLpDeadAllowed() -> XLogNeedsFlush()), promotion shouldn't break anything because it should not allow minRecoveryPoint to go backwards. * How about modifying rm_mask() instead of introducing rm_fpi_mask()? Perhaps a boolean argument can be added to distinguish the purpose of the masking. * Are you sure it's o.k. to use mask_lp_flags() here? It sets the item flags to LP_UNUSED unconditionally, which IMO should only be done by VACUUM. I think you only need to revert the effect of prior ItemIdMarkDead(), so you only need to change the status LP_DEAD to LP_NORMAL if the tuple still has storage. (And maybe add an assertion to ItemIdMarkDead() confirming that it's only used for LP_NORMAL items?) As far as I understand, the current code only uses mask_lp_flags() during WAL consistency check on copies of pages which don't eventually get written to disk. * IsIndexLpDeadAllowed() ** is bufmgr.c the best location for this function? ** the header comment should explain the minLsn argument. ** comment /* It is always allowed on primary if *all_dead. */ should probably be /* It is always allowed on primary if ->all_dead. */ * comment: XLOG_HEAP2_CLEAN has been renamed to XLOG_HEAP2_PRUNE in PG14. On regression tests: * Is the purpose of the repeatable read (RR) snapshot to test that heap_hot_search_buffer() does not set deadness->all_dead if some transaction can still see a tuple of the chain? If so, I think the RR snapshot does not have to be used in the tests because this patch does not really affect the logic: heap_hot_search_buffer() only sets deadness->all_dead to false, just like it sets *all_dead in the current code. Besides that, IsIndexLpDeadAllowed() too can avoid setting of the LP_DEAD flag on an index tuple (at least until the commit record of the deleting/updating transaction gets flushed to disk), so it can hide the behaviour of heap_hot_search_buffer(). * Unless I miss something, the tests check that the hint bits are not propagated from primary (or they are propagated but marked non-safe), however there's no test to check that standby does set the hint bits itself. * I'm also not sure if promotion needs to be tested. What's specific about the promoted cluster from the point of view of this feature? The only thing I can think of is clearing of the "LP-safe-on-standby" bits, but, as I said above, I'm not sure if the tests ever let standby to set those bits before the promotion. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello, Antonin.
> I'm trying to continue the review, sorry for the delay. Following are a few
> question about the code:
Thanks for the review :) And sorry for the delay too :)
> * Does the masking need to happen in the AM code, e.g. _bt_killitems()?
>   I'd expect that the RmgrData.rm_fpi_mask can do all the work.
RmgrData.rm_fpi_mask clears a single BTP_LP_SAFE_ON_STANDBY bit only
to indicate that hints bit are not safe to be used on standby.
Why do not clear LP_DEAD bits in rm_fpi_mask? There is no sense
because we could get such bits in multiple ways:
* the standby was created from the base backup of the primary
* some pages were changed by pg_rewind
* the standby was updated to the version having this feature (so, old
pages still contains LP_DEAD)
So, AM code needs to know when and why clear LP_DEAD bits if
BTP_LP_SAFE_ON_STANDBY is not set.
Also, the important moment here is pg_memory_barrier() usage.
> * How about modifying rm_mask() instead of introducing rm_fpi_mask()? Perhaps
>  a boolean argument can be added to distinguish the purpose of the masking.
I have tried this way but the code was looking dirty and complicated.
Also, the separated fpi_mask provides some semantics to the function.
> * Are you sure it's o.k. to use mask_lp_flags() here? It sets the item flags
> to LP_UNUSED unconditionally, which IMO should only be done by VACUUM.
Oh, good catch. I made mask_lp_dead for this. Also, added such a
situation to the test.
> ** is bufmgr.c the best location for this function?
Moved to indexam.c and made static (is_index_lp_dead_allowed).
 > should probably be
 > /* It is always allowed on primary if ->all_dead. */
Fixed.
> * comment: XLOG_HEAP2_CLEAN has been renamed to XLOG_HEAP2_PRUNE in PG14.
Fixed.
> * Is the purpose of the repeatable read (RR) snapshot to test that
>  heap_hot_search_buffer() does not set deadness->all_dead if some transaction
>  can still see a tuple of the chain?
The main purpose is to test xactStartedInRecovery logic after the promotion.
For example -
    > if (scan->xactStartedInRecovery && !RecoveryInProgress())`
> * Unless I miss something, the tests check that the hint bits are not
>  propagated from primary (or they are propagated but marked non-safe),
>  however there's no test to check that standby does set the hint bits itself.
It is tested on different standby, see
    > is(hints_num($node_standby_2), qq(10), 'index hint bits already
set on second standby 2');
Also, I added checks for BTP_LP_SAFE_ON_STANDBY to make sure
everything in the test goes by scenario.
Thanks a lot,
Michail.
			
		Вложения
Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
> > * Is the purpose of the repeatable read (RR) snapshot to test that
> >  heap_hot_search_buffer() does not set deadness->all_dead if some transaction
> >  can still see a tuple of the chain?
>
> The main purpose is to test xactStartedInRecovery logic after the promotion.
> For example -
>     > if (scan->xactStartedInRecovery && !RecoveryInProgress())`
I understand that the RR snapshot is used to check the MVCC behaviour, however
this comment seems to indicate that the RR snapshot should also prevent the
standb from setting the hint bits.
# Make sure previous queries not set the hints on standby because
# of RR snapshot
I can imagine that on the primary, but I don't think that the backend that
checks visibility on standby does checks other snapshots/backends. And it
didn't work when I ran the test manually, although I could have missed
something.
A few more notes regarding the tests:
* 026_standby_index_lp_dead.pl should probably be renamed to
  027_standby_index_lp_dead.pl (026_* was created in the master branch
  recently)
* The test fails, although I do have convigrured the build with
  --enable-tap-tests.
BEGIN failed--compilation aborted at t/026_standby_index_lp_dead.pl line 5.
t/026_standby_index_lp_dead.pl .. Dubious, test returned 2 (wstat 512, 0x200)
I suspect the testing infrastructure changed recently.
* The messages like this
is(hints_num($node_standby_1), qq(10),
    'hints are set on standby1 because FPI but marked as non-safe');
  say that the hints are "marked as non-safe", but the hints_num() function
  does not seem to check that.
* wording:
is(hints_num($node_standby_2), qq(10), 'index hint bits already set on second standby 2');
->
is(hints_num($node_standby_2), qq(10), 'index hint bits already set on standby 2');
And a few more notes on the code:
* There's an extra colon in mask_lp_dead():
bufmask.c:148:38: warning: for loop has empty body [-Wempty-body]
                 offnum = OffsetNumberNext(offnum));
                                                   ^
bufmask.c:148:38: note: put the semicolon on a separate line to silence this warning
* the header comment of heap_hot_search_buffer() still says "*all_dead"
  whereas I'd expect "->all_dead".
  The same for "*page_lsn".
* I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the
  IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values,
  e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in
  index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and
  rename the function is_index_lp_dead_allowed() to
  is_index_lp_dead_maybe_allowed()?
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
			
		Hello, Antonin. Thanks for pushing it forward. > I understand that the RR snapshot is used to check the MVCC behaviour, however > this comment seems to indicate that the RR snapshot should also prevent the > standb from setting the hint bits. > # Make sure previous queries not set the hints on standby because > # of RR snapshot > I can imagine that on the primary, but I don't think that the backend that > checks visibility on standby does checks other snapshots/backends. And it > didn't work when I ran the test manually, although I could have missed > something. Yes, it checks - you could see ComputeXidHorizons for details. It is the main part of the correctness of the whole feature. I added some details about it to the test. > * 026_standby_index_lp_dead.pl should probably be renamed to > 027_standby_index_lp_dead.pl (026_* was created in the master branch > recently) Done. > BEGIN failed--compilation aborted at t/026_standby_index_lp_dead.pl line 5. > t/026_standby_index_lp_dead.pl .. Dubious, test returned 2 (wstat 512, 0x200) Fixed. > * The messages like this Fixed. > * There's an extra colon in mask_lp_dead(): Oh, it is a huge error really (the loop was empty) :) Fixed. > * the header comment of heap_hot_search_buffer() still says "*all_dead" > whereas I'd expect "->all_dead". > The same for "*page_lsn". I was trying to mimic the style of comment (it says about “*tid” from 2007). So, I think it is better to keep it in the same style for the whole function comment. > * I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the > IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values, > e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in > index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and > rename the function is_index_lp_dead_allowed() to > is_index_lp_dead_maybe_allowed()? Yes, this way it is looks better. Done. Also, I have added some checks for “maybe” LSN-related logic to the test. Thanks a lot, Michail.
Вложения
Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > I understand that the RR snapshot is used to check the MVCC behaviour, however > > this comment seems to indicate that the RR snapshot should also prevent the > > standb from setting the hint bits. > > # Make sure previous queries not set the hints on standby because > > # of RR snapshot > > I can imagine that on the primary, but I don't think that the backend that > > checks visibility on standby does checks other snapshots/backends. And it > > didn't work when I ran the test manually, although I could have missed > > something. > > Yes, it checks - you could see ComputeXidHorizons for details. It is > the main part of the correctness of the whole feature. I added some > details about it to the test. Ah, ok. I thought that only KnownAssignedXids is used on standby, but that would ignore the RR snapshot. It wasn't clear to me when the xmin of the hot-standby backends is set, now I think it's done by GetSnapshotData(). > > * I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the > > IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values, > > e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in > > index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and > > rename the function is_index_lp_dead_allowed() to > > is_index_lp_dead_maybe_allowed()? > > Yes, this way it is looks better. Done. Also, I have added some checks > for “maybe” LSN-related logic to the test. Attached is a proposal for a minor addition that would make sense to me, add it if you think it's appropriate. I think I've said enough, changing the status to "ready for committer" :-) -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/test/recovery/t/027_standby_index_lp_dead.pl b/src/test/recovery/t/027_standby_index_lp_dead.pl index 5a23eea052..e65000b6bd 100644 --- a/src/test/recovery/t/027_standby_index_lp_dead.pl +++ b/src/test/recovery/t/027_standby_index_lp_dead.pl @@ -7,7 +7,7 @@ use PostgreSQL::Test::Utils; use Test::More; use Config; -plan tests => 29; +plan tests => 30; # Initialize primary node my $node_primary = PostgreSQL::Test::Cluster->new('primary'); @@ -246,6 +246,7 @@ is(btp_safe_on_stanby($node_new_standby), qq(0), 'hint from FPI'); # Make sure bits are set only if minRecoveryPoint > than index page LSN try_to_set_hint_bits($node_new_standby); +is(hints_num($node_new_standby), qq(11), 'no new index hint bits are set on new standby'); is(btp_safe_on_stanby($node_new_standby), qq(0), 'hint not marked as standby-safe'); # Issue checkpoint on primary to update minRecoveryPoint on standby
I have changed approach, so it is better to start from this email: https://www.postgresql.org/message-id/flat/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com#4c81a4d623d8152f5e8889e97e750eec
Woo-hoo :) > Attached is a proposal for a minor addition that would make sense to me, add > it if you think it's appropriate. Yes, I'll add to the patch. > I think I've said enough, changing the status to "ready for committer" :-) Thanks a lot for your help and attention! Best regards, Michail.
Hello. > Attached is a proposal for a minor addition that would make sense to me, add > it if you think it's appropriate. Added. Also, I updated the documentation a little. > I have changed approach, so it is better to start from this email: Oops, I was thinking the comments feature in the commitfest app works in a different way :) Best regards, Michail.
Вложения
Hi, On Wed, Nov 10, 2021 at 3:06 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > > Attached is a proposal for a minor addition that would make sense to me, add > > it if you think it's appropriate. > > Added. Also, I updated the documentation a little. > > > I have changed approach, so it is better to start from this email: > > Oops, I was thinking the comments feature in the commitfest app works > in a different way :) The cfbot reports that this patch is currently failing at least on Linux and Windows, e.g. https://cirrus-ci.com/task/6532060239101952. I'm switching this patch on Waiting on Author.
Hello, Junien. Thanks for your attention. > The cfbot reports that this patch is currently failing at least on > Linux and Windows, e.g. https://cirrus-ci.com/task/6532060239101952. Fixed. It was the issue with the test - hangs on Windows because of psql + spurious vacuum sometimes. > I'm switching this patch on Waiting on Author. I have tested it multiple times on my Github repo, seems to be stable now. Switching back to Ready for committer. Best regards. Michail.
Вложения
Hello, Justin. Thanks for your attention. After some investigation, I think I have found the problem. It is caused by XLOG_RUNNING_XACTS at an undetermined moment (some test parts rely on it). Now test waits for XLOG_RUNNING_XACTS to happen (maximum is 15s) and proceed forward. I'll move entry back to "Ready for Committer" once it passes tests. Best regards, Michail.
Вложения
Hi, On Mon, Jan 24, 2022 at 10:33:43AM +0300, Michail Nikolaev wrote: > > Thanks for your attention. > After some investigation, I think I have found the problem. It is > caused by XLOG_RUNNING_XACTS at an undetermined moment (some test > parts rely on it). > > Now test waits for XLOG_RUNNING_XACTS to happen (maximum is 15s) and > proceed forward. > > I'll move entry back to "Ready for Committer" once it passes tests. It looks like you didn't fetch the latest upstream commits in a while as this version is still conflicting with 7a5f6b474 (Make logical decoding a part of the rmgr) from 6 days ago. I rebased the pathset in attached v9. Please double check that I didn't miss anything in the rebase.
Вложения
Hi, On Tue, Jan 25, 2022 at 07:21:01PM +0800, Julien Rouhaud wrote: > > > > I'll move entry back to "Ready for Committer" once it passes tests. > > It looks like you didn't fetch the latest upstream commits in a while as this > version is still conflicting with 7a5f6b474 (Make logical decoding a part of > the rmgr) from 6 days ago. > > I rebased the pathset in attached v9. Please double check that I didn't miss > anything in the rebase. FTR the cfbot is now happy with this version: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/2947. I will let you mark the patch as Ready for Committer once you validate that the rebase was ok.
Hello, Julien. > I rebased the pathset in attached v9. Please double check that I didn't miss > anything in the rebase. Thanks a lot for your help. > I will let you mark the patch as Ready for Committer once you validate that the > rebase was ok. Yes, rebase looks good. Best regards, Michail.
Hi, On 2022-01-25 19:21:01 +0800, Julien Rouhaud wrote: > I rebased the pathset in attached v9. Please double check that I didn't miss > anything in the rebase. Fails to apply at the moment: http://cfbot.cputube.org/patch_37_2947.log Marked as waiting for author. - Andres
Hello, Andres. > Fails to apply at the moment: http://cfbot.cputube.org/patch_37_2947.log Thanks for notifying me. BTW, some kind of automatic email in case of status change could be very helpful. > Marked as waiting for author. New version is attached, build is passing (https://cirrus-ci.com/build/5599876384817152), so, moving it back to "ready for committer" . Best regards, Michail.
Вложения
On Tue, 22 Mar 2022 at 09:52, Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > Thanks for notifying me. BTW, some kind of automatic email in case of > status change could be very helpful. I agree but realize the cfbot is quite new and I guess the priority is to work out any kinks before spamming people with false positives. > New version is attached, build is passing > (https://cirrus-ci.com/build/5599876384817152), so, moving it back to > "ready for committer" . I'm seeing a recovery test failure. Not sure if this represents an actual bug or just a test that needs to be adjusted for the new behaviour. https://cirrus-ci.com/task/5711008294502400 [14:42:46.885] # Failed test 'no new index hint bits are set on new standby' [14:42:46.885] # at t/027_standby_index_lp_dead.pl line 262. [14:42:46.885] # got: '12' [14:42:46.885] # expected: '11' [14:42:47.147] [14:42:47.147] # Failed test 'hint not marked as standby-safe' [14:42:47.147] # at t/027_standby_index_lp_dead.pl line 263. [14:42:47.147] # got: '1' [14:42:47.147] # expected: '0' [14:42:49.723] # Looks like you failed 2 tests of 30. [14:42:49.750] [14:42:49] t/027_standby_index_lp_dead.pl ....... [14:42:49.761] Dubious, test returned 2 (wstat 512, 0x200) [14:42:49.761] Failed 2/30 subtests -- greg
On Mon, Mar 28, 2022 at 12:40 PM Greg Stark <stark@mit.edu> wrote: > I'm seeing a recovery test failure. Not sure if this represents an > actual bug or just a test that needs to be adjusted for the new > behaviour. > > https://cirrus-ci.com/task/5711008294502400 I doubt that the patch's use of pg_memory_barrier() in places like _bt_killitems() is correct. There is no way to know for sure if this novel new lockless algorithm is correct or not, since it isn't explained anywhere. The existing use of memory barriers is pretty much limited to a handful of performance critical code paths, none of which are in access method code that reads from shared_buffers. So this is not a minor oversight. -- Peter Geoghegan
On Mon, Mar 28, 2022 at 1:23 PM Peter Geoghegan <pg@bowt.ie> wrote: > I doubt that the patch's use of pg_memory_barrier() in places like > _bt_killitems() is correct. I also doubt that posting list splits are handled correctly. If there is an LP_DEAD bit set on a posting list on the primary, and we need to do a posting list split against the posting tuple, we need to be careful -- we cannot allow our new TID to look like it's LP_DEAD immediately, before our transaction even commits/aborts. We cannot swap out our new TID with an old LP_DEAD TID, because we'll think that our new TID is LP_DEAD when we shouldn't. This is currently handled by having the inserted do an early round of simple/LP_DEAD index tuple deletion, using the "simpleonly" argument from _bt_delete_or_dedup_one_page(). Obviously the primary cannot be expected to know that one of its standbys has independently set a posting list's LP_DEAD bit, though. At the very least you need to teach the posting list split path in btree_xlog_insert() about all this -- it's not necessarily sufficient to clear LP_DEAD bits in the index AM's fpi_mask() routine. Overall, I think that this patch has serious design flaws, and that this issue is really just a symptom of a bigger problem. -- Peter Geoghegan
Hello, Greg. > I'm seeing a recovery test failure. Not sure if this represents an > actual bug or just a test that needs to be adjusted for the new > behaviour. Thanks for notifying me. It is a failure of a test added in the patch. It is a little hard to make it stable (because it depends on minRecoveryLSN which could be changed in asynchronous way without any control). I’ll check how to make it more stable. Thanks, Michail.
Hello, Peter. Thanks for your review! > I doubt that the patch's use of pg_memory_barrier() in places like > _bt_killitems() is correct. There is no way to know for sure if this > novel new lockless algorithm is correct or not, since it isn't > explained anywhere. The memory barrier is used only to ensure memory ordering in case of clearing LP_DEAD bits. Just to make sure the flag allowing the use LP_DEAD is seen AFTER bits are cleared. Yes, it should be described in more detail. The flapping test is one added in the patch and not related to memory ordering. I have already tried to make it stable once before, but it depends on minRecoveryLSN propagation. I’ll think about how to make it stable. > If there is an LP_DEAD bit set on a posting list on the primary, and > we need to do a posting list split against the posting tuple, we need > to be careful -- we cannot allow our new TID to look like it's LP_DEAD > immediately, before our transaction even commits/aborts. We cannot > swap out our new TID with an old LP_DEAD TID, because we'll think that > our new TID is LP_DEAD when we shouldn't. Oh, good catch! I was thinking it is safe to have additional hint bits on primary, but it seems like no. BTW I am wondering if it is possible to achieve the same situation by pg_rewind and standby promotion… > Overall, I think that this patch has serious design flaws, and that > this issue is really just a symptom of a bigger problem. Could you please advise me on something? The ways I see: * give up :) * try to fix this concept * go back to concept with LP_DEAD horizon WAL and optional cancellation * try to limit scope on “allow standby to use LP_DEAD set on primary in some cases” (by marking something in btree page probably) * look for some new way Best regards, Michail.
UPD: > I was thinking it is safe to have additional hint bits > on primary, but it seems like no. Oh, sorry for the mistake, it is about standby of course. > BTW I am wondering if it is possible > to achieve the same situation by pg_rewind and standby promotion… Looks like it is impossible, because wal_log_hints is required in order to use pg_rewind. It is possible to achieve a situation with some additional LP_DEAD on standby compared to the primary, but any change on primary would cause FPI, so LP_DEAD will be cleared. Thanks, Michail.
On Tue, Mar 29, 2022 at 4:55 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > Overall, I think that this patch has serious design flaws, and that > > this issue is really just a symptom of a bigger problem. > > Could you please advise me on something? The ways I see: > * give up :) I would never tell anybody to give up on something like this, because I don't really have the right to do so. And because it really isn't my style. > * try to fix this concept > * go back to concept with LP_DEAD horizon WAL and optional cancellation > * try to limit scope on “allow standby to use LP_DEAD set on primary The simple answer is: I don't know. I could probably come up with a better answer than that, but it would take real effort, and time. > in some cases” (by marking something in btree page probably) > * look for some new way You seem like a smart guy, and I respect the effort that you have put in already -- I really do. But I think that you have unrealistic ideas about how to be successful with a project like this. The reality is that the Postgres development process gives authority to a relatively small number of committers. This is not a perfect system, at all, but it's the best system that we have. Only a minority of the committers are truly experienced with the areas of the code that your patch touches -- so the number of people that are ever likely to commit a patch like that is very small (even if the patch was perfect). You need to convince at least one of them to do so, or else your patch doesn't get into PostgreSQL, no matter what else may be true. I hope that my remarks don't seem disdainful or belittling -- that is not my intention. These are just facts. I think that you could do a better job of explaining and promoting the problem that you're trying to solve here. Emphasis on the problem, not so much the solution. Only a very small number of patches don't need to be promoted. Of course I can see that the general idea has merit, but that isn't enough. Why do *you* care about this problem so much? The answer isn't self-evident. You have to tell us why it matters so much. You must understand that this whole area is *scary*. The potential for serious data corruption bugs is very real. And because the whole area is so complicated (it is at the intersection of 2-3 complicated areas), we can expect those bugs to be hidden for a long time. We might never be 100% sure that we've fixed all of them if the initial design is not generally robust. Most patches are not like that. -- Peter Geoghegan
On Tue, Mar 29, 2022 at 5:20 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Mar 29, 2022 at 4:55 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
I think that you could do a better job of explaining and promoting the
problem that you're trying to solve here. Emphasis on the problem, not
so much the solution.
As a specific recommendation here - submit patches with a complete commit message.  Tweak it for each new version so that any prior discussion that informed the general design of the patch is reflected in the commit message.
This doesn't solve the "experience" issue by itself but does allow someone with interest to jump in without having to read an entire thread, including false-starts and half-ideas, to understand what the patch is doing, and why.  At the end of the day the patch should largely speak for itself, and depend minimally on the discussion thread, to be understood.
David J.
On Tue, Mar 22, 2022 at 6:52 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
Hello, Andres.
> Fails to apply at the moment: http://cfbot.cputube.org/patch_37_2947.log
Thanks for notifying me. BTW, some kind of automatic email in case of
status change could be very helpful.
> Marked as waiting for author.
New version is attached, build is passing
(https://cirrus-ci.com/build/5599876384817152), so, moving it back to
"ready for committer" .
This may be a naive comment but I'm curious: The entire new second paragraph of the README scares me:
+There are restrictions on settings LP_DEAD bits by the standby related to
+minRecoveryPoint value. In case of crash recovery standby will start to process
+queries after replaying WAL to minRecoveryPoint position (some kind of rewind to
+the previous state). A the same time setting of LP_DEAD bits are not protected
+by WAL in any way. So, to mark tuple as dead we must be sure it was "killed"
+before minRecoveryPoint (comparing the LSN of commit record). Another valid
+option is to compare "killer" LSN with index page LSN because minRecoveryPoint
+would be moved forward when the index page flushed. Also, in some cases xid of
+"killer" is unknown - for example, tuples were cleared by XLOG_HEAP2_PRUNE.
+In that case, we compare the LSN of the heap page to index page LSN.
+minRecoveryPoint value. In case of crash recovery standby will start to process
+queries after replaying WAL to minRecoveryPoint position (some kind of rewind to
+the previous state). A the same time setting of LP_DEAD bits are not protected
+by WAL in any way. So, to mark tuple as dead we must be sure it was "killed"
+before minRecoveryPoint (comparing the LSN of commit record). Another valid
+option is to compare "killer" LSN with index page LSN because minRecoveryPoint
+would be moved forward when the index page flushed. Also, in some cases xid of
+"killer" is unknown - for example, tuples were cleared by XLOG_HEAP2_PRUNE.
+In that case, we compare the LSN of the heap page to index page LSN.
In terms of having room for bugs this description seems like a lot of logic to have to get correct.
Could we just do this first pass as:
Enable recovery mode LP_DEAD hint bit updates after the first streamed CHECKPOINT record comes over from the primary.
?
Now, maybe there aren't any real concerns here but even then breaking up the patches into enabling the general feature in a limited way and then ensuring that it behaves sanely during the standby crash recovery window would likely increase the appeal and ease the burden on the potential committer.
The proposed theory here seems sound to my inexperienced ears.  I have no idea whether there are other bits, and/or assumptions, lurking around that interfere with this though.
David J.
Hello, Peter. > The simple answer is: I don't know. I could probably come up with a > better answer than that, but it would take real effort, and time. I remember you had an idea about using the LP_REDIRECT bit in btree indexes as some kind of “recently dead” flag (1). Is this idea still in progress? Maybe an additional bit could provide a space for a better solution. > I think that you could do a better job of explaining and promoting the > problem that you're trying to solve here. Emphasis on the problem, not > so much the solution. System I am working on highly depends on the performance of reading from standby. In our workloads queries on standby are sometimes 10-100x slower than on primary due to absence of LP_DEAD support. Other users have the same issues (2). I believe such functionality is great optimization for read replicas with both analytics and OLTP (read-only) workloads. > You must understand that this whole area is *scary*. The potential for > serious data corruption bugs is very real. And because the whole area > is so complicated (it is at the intersection of 2-3 complicated > areas), we can expect those bugs to be hidden for a long time. We > might never be 100% sure that we've fixed all of them if the initial > design is not generally robust. Most patches are not like that. Moved to “Waiting for Author” for now. [1]: https://www.postgresql.org/message-id/flat/CAH2-Wz%3D-BoaKgkN-MnKj6hFwO1BOJSA%2ByLMMO%2BLRZK932fNUXA%40mail.gmail.com#6d7cdebd68069cc493c11b9732fd2040 [2]: https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org Thanks, Michail.
Hello, David. Thanks for your review! > As a specific recommendation here - submit patches with a complete commit message. > Tweak it for each new version so that any prior discussion that informed the general design of > the patch is reflected in the commit message. Yes, agreed. Applied to my other patch (1). > In terms of having room for bugs this description seems like a lot of logic to have to get correct. Yes, it is the scary part. But it is contained in single is_index_lp_dead_maybe_allowed function for now. > Could we just do this first pass as: > Enable recovery mode LP_DEAD hint bit updates after the first streamed CHECKPOINT record comes over from the primary. > ? Not sure, but yes, it is better to split the patch into more detailed commits. Thanks, Michail. [1]: https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382
On Thu, Mar 31, 2022 at 4:57 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > I remember you had an idea about using the LP_REDIRECT bit in btree > indexes as some kind of “recently dead” flag (1). > Is this idea still in progress? Maybe an additional bit could provide > a space for a better solution. I think that the best way to make the patch closer to being committable is to make the on-disk representation more explicit. Relying on an implicit or contextual definition for anything seems like something to avoid. This is probably the single biggest problem that I see with the patch. I suggest that you try to "work backwards". If the patch was already committed today, but had subtle bugs, then how would we be able to identify the bugs relatively easily? What would our strategy be then? -- Peter Geoghegan