Обсуждение: Review for GetWALAvailability()
Hi, The document explains that "lost" value that pg_replication_slots.wal_status reports means some WAL files are definitely lost and this slot cannot be used to resume replication anymore. However, I observed "lost" value while inserting lots of records, but replication could continue normally. So I wonder if pg_replication_slots.wal_status may have a bug. wal_status is calculated in GetWALAvailability(), and probably I found some issues in it. keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), wal_segment_size) + 1; max_wal_size_mb is the number of megabytes. wal_keep_segments is the number of WAL segment files. So it's strange to calculate max of them. The above should be the following? Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), wal_keep_segments) + 1 if ((max_slot_wal_keep_size_mb <= 0 || max_slot_wal_keep_size_mb >= max_wal_size_mb) && oldestSegMaxWalSize <= targetSeg) return WALAVAIL_NORMAL; This code means that wal_status reports "normal" only when max_slot_wal_keep_size is negative or larger than max_wal_size. Why is this condition necessary? The document explains "normal means that the claimed files are within max_wal_size". So whatever max_slot_wal_keep_size value is, IMO that "normal" should be reported if the WAL files claimed by the slot are within max_wal_size. Thought? Or, if that condition is really necessary, the document should be updated so that the note about the condition is added. If the WAL files claimed by the slot exceeds max_slot_wal_keep_size but any those WAL files have not been removed yet, wal_status seems to report "lost". Is this expected behavior? Per the meaning of "lost" described in the document, "lost" should be reported only when any claimed files are removed, I think. Thought? Or this behavior is expected and the document is incorrect? BTW, if we want to implement GetWALAvailability() as the document advertises, we can simplify it like the attached POC patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > Hi, > > The document explains that "lost" value that > pg_replication_slots.wal_status reports means > > some WAL files are definitely lost and this slot cannot be used to > resume replication anymore. > > However, I observed "lost" value while inserting lots of records, > but replication could continue normally. So I wonder if > pg_replication_slots.wal_status may have a bug. > > wal_status is calculated in GetWALAvailability(), and probably I found > some issues in it. > > > keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), > wal_segment_size) + > 1; > > max_wal_size_mb is the number of megabytes. wal_keep_segments is > the number of WAL segment files. So it's strange to calculate max of > them. Oops! I don't want to believe I did that but it's definitely wrong. > The above should be the following? > > Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), > wal_keep_segments) + 1 Looks reasonable. > if ((max_slot_wal_keep_size_mb <= 0 || > max_slot_wal_keep_size_mb >= max_wal_size_mb) && > oldestSegMaxWalSize <= targetSeg) > return WALAVAIL_NORMAL; > > This code means that wal_status reports "normal" only when > max_slot_wal_keep_size is negative or larger than max_wal_size. > Why is this condition necessary? The document explains "normal > means that the claimed files are within max_wal_size". So whatever > max_slot_wal_keep_size value is, IMO that "normal" should be > reported if the WAL files claimed by the slot are within max_wal_size. > Thought? It was a kind of hard to decide. Even when max_slot_wal_keep_size is smaller than max_wal_size, the segments more than max_slot_wal_keep_size are not guaranteed to be kept. In that case the state transits as NORMAL->LOST skipping the "RESERVED" state. Putting aside whether the setting is useful or not, I thought that the state transition is somewhat abrupt. > Or, if that condition is really necessary, the document should be > updated so that the note about the condition is added. Does the following make sense? https://www.postgresql.org/docs/13/view-pg-replication-slots.html normal means that the claimed files are within max_wal_size. + If max_slot_wal_keep_size is smaller than max_wal_size, this state + will not appear. > If the WAL files claimed by the slot exceeds max_slot_wal_keep_size > but any those WAL files have not been removed yet, wal_status seems > to report "lost". Is this expected behavior? Per the meaning of "lost" > described in the document, "lost" should be reported only when > any claimed files are removed, I think. Thought? > > Or this behavior is expected and the document is incorrect? In short, it is known behavior but it was judged as useless to prevent that. That can happen when checkpointer removes up to the segment that is being read by walsender. I think that that doesn't happen (or happenswithin a narrow time window?) for physical replication but happenes for logical replication. While development, I once added walsender a code to exit for that reason, but finally it is moved to InvalidateObsoleteReplicationSlots as a bit defferent function. With the current mechanism, there's a case where once invalidated slot came to revive but we decided to allow that behavior, but forgot to document that. Anyway if you see "lost", something bad is being happening. - lost means that some WAL files are definitely lost and this slot - cannot be used to resume replication anymore. + lost means that some required WAL files are removed and this slot is + no longer usable after once disconnected during this status. If it is crucial that the "lost" state may come back to reserved or normal state, + Note that there are cases where the state moves back to reserved or + normal state when all wal senders have left the just removed segment + before being terminated. There is a case where the state moves back to reserved or normal state when wal senders leaves the just removed segment beforebeing terminated. > BTW, if we want to implement GetWALAvailability() as the document > advertises, we can simplify it like the attached POC patch. I'm not sure it is right that the patch removes wal_keep_segments from the function. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 700271fd40..199053dd4a 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -11240,18 +11240,25 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx <itemizedlist> <listitem> <para><literal>normal</literal> means that the claimed files - are within <varname>max_wal_size</varname>.</para> + are within <varname>max_wal_size</varname>. If + <varname>max_slot_wal_keep_size</varname> is smaller than + <varname>max_wal_size</varname>, this state will not appear.</para> </listitem> <listitem> <para><literal>reserved</literal> means that <varname>max_wal_size</varname> is exceeded but the files are still held, either by some replication slot or - by <varname>wal_keep_segments</varname>.</para> + by <varname>wal_keep_segments</varname>. + </para> </listitem> <listitem> - <para><literal>lost</literal> means that some WAL files are - definitely lost and this slot cannot be used to resume replication - anymore.</para> + <para> + <literal>lost</literal> means that some required WAL files are + removed and this slot is no longer usable after once disconnected + during this state. Note that there are cases where the state moves + back to reserved or normal state when all wal senders have left + the just removed segment before being terminated. + </para> </listitem> </itemizedlist> The last two states are seen only when diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 55cac186dc..d1501d0cf7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9528,8 +9528,8 @@ GetWALAvailability(XLogRecPtr targetLSN) /* calculate oldest segment by max_wal_size and wal_keep_segments */ XLByteToSeg(currpos, currSeg, wal_segment_size); - keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), - wal_segment_size) + 1; + keepSegs = Max(ConvertToXSegs(max_wal_size_mb, wal_keep_segments), + wal_segment_size) + 1; if (currSeg > keepSegs) oldestSegMaxWalSize = currSeg - keepSegs;
At Mon, 15 Jun 2020 13:42:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Oops! I don't want to believe I did that but it's definitely wrong. Hmm. Quite disappointing. The patch was just a crap. This is the right patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 700271fd40..199053dd4a 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -11240,18 +11240,25 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx <itemizedlist> <listitem> <para><literal>normal</literal> means that the claimed files - are within <varname>max_wal_size</varname>.</para> + are within <varname>max_wal_size</varname>. If + <varname>max_slot_wal_keep_size</varname> is smaller than + <varname>max_wal_size</varname>, this state will not appear.</para> </listitem> <listitem> <para><literal>reserved</literal> means that <varname>max_wal_size</varname> is exceeded but the files are still held, either by some replication slot or - by <varname>wal_keep_segments</varname>.</para> + by <varname>wal_keep_segments</varname>. + </para> </listitem> <listitem> - <para><literal>lost</literal> means that some WAL files are - definitely lost and this slot cannot be used to resume replication - anymore.</para> + <para> + <literal>lost</literal> means that some required WAL files are + removed and this slot is no longer usable after once disconnected + during this state. Note that there are cases where the state moves + back to reserved or normal state when all wal senders have left + the just removed segment before being terminated. + </para> </listitem> </itemizedlist> The last two states are seen only when diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 55cac186dc..d6fe205eb4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9528,8 +9528,8 @@ GetWALAvailability(XLogRecPtr targetLSN) /* calculate oldest segment by max_wal_size and wal_keep_segments */ XLByteToSeg(currpos, currSeg, wal_segment_size); - keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), - wal_segment_size) + 1; + keepSegs = Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), + wal_keep_segments) + 1; if (currSeg > keepSegs) oldestSegMaxWalSize = currSeg - keepSegs;
On 2020/06/15 13:42, Kyotaro Horiguchi wrote: > At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> Hi, >> >> The document explains that "lost" value that >> pg_replication_slots.wal_status reports means >> >> some WAL files are definitely lost and this slot cannot be used to >> resume replication anymore. >> >> However, I observed "lost" value while inserting lots of records, >> but replication could continue normally. So I wonder if >> pg_replication_slots.wal_status may have a bug. >> >> wal_status is calculated in GetWALAvailability(), and probably I found >> some issues in it. >> >> >> keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), >> wal_segment_size) + >> 1; >> >> max_wal_size_mb is the number of megabytes. wal_keep_segments is >> the number of WAL segment files. So it's strange to calculate max of >> them. > > Oops! I don't want to believe I did that but it's definitely wrong. > >> The above should be the following? >> >> Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), >> wal_keep_segments) + 1 > > Looks reasonable. > >> if ((max_slot_wal_keep_size_mb <= 0 || >> max_slot_wal_keep_size_mb >= max_wal_size_mb) && >> oldestSegMaxWalSize <= targetSeg) >> return WALAVAIL_NORMAL; >> >> This code means that wal_status reports "normal" only when >> max_slot_wal_keep_size is negative or larger than max_wal_size. >> Why is this condition necessary? The document explains "normal >> means that the claimed files are within max_wal_size". So whatever >> max_slot_wal_keep_size value is, IMO that "normal" should be >> reported if the WAL files claimed by the slot are within max_wal_size. >> Thought? > > It was a kind of hard to decide. Even when max_slot_wal_keep_size is > smaller than max_wal_size, the segments more than > max_slot_wal_keep_size are not guaranteed to be kept. In that case > the state transits as NORMAL->LOST skipping the "RESERVED" state. > Putting aside whether the setting is useful or not, I thought that the > state transition is somewhat abrupt. IMO the direct transition of the state from normal to lost is ok to me if each state is clearly defined. >> Or, if that condition is really necessary, the document should be >> updated so that the note about the condition is added. > > Does the following make sense? > > https://www.postgresql.org/docs/13/view-pg-replication-slots.html > > normal means that the claimed files are within max_wal_size. > + If max_slot_wal_keep_size is smaller than max_wal_size, this state > + will not appear. I don't think this change is enough. For example, when max_slot_wal_keep_size is smaller than max_wal_size and the amount of WAL files claimed by the slot is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But which is inconsistent with the meaning of "reserved" in the docs. To consider what should be reported in wal_status, could you tell me what purpose and how the users is expected to use this information? >> If the WAL files claimed by the slot exceeds max_slot_wal_keep_size >> but any those WAL files have not been removed yet, wal_status seems >> to report "lost". Is this expected behavior? Per the meaning of "lost" >> described in the document, "lost" should be reported only when >> any claimed files are removed, I think. Thought? >> >> Or this behavior is expected and the document is incorrect? > > In short, it is known behavior but it was judged as useless to prevent > that. > > That can happen when checkpointer removes up to the segment that is > being read by walsender. I think that that doesn't happen (or > happenswithin a narrow time window?) for physical replication but > happenes for logical replication. > > While development, I once added walsender a code to exit for that > reason, but finally it is moved to InvalidateObsoleteReplicationSlots > as a bit defferent function. With the current mechanism, there's a > case where once invalidated slot came to revive but we decided to > allow that behavior, but forgot to document that. > > Anyway if you see "lost", something bad is being happening. > > - lost means that some WAL files are definitely lost and this slot > - cannot be used to resume replication anymore. > + lost means that some required WAL files are removed and this slot is > + no longer usable after once disconnected during this status. > > If it is crucial that the "lost" state may come back to reserved or > normal state, > > + Note that there are cases where the state moves back to reserved or > + normal state when all wal senders have left the just removed segment > + before being terminated. > > There is a case where the state moves back to reserved or normal state when wal senders leaves the just removed segmentbefore being terminated. Even if walsender is terminated during the state "lost", unless checkpointer removes the required WAL files, the state can go back to "reserved" after new replication connection is established. This is the same as what you're explaining at the above? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/06/15 13:42, Kyotaro Horiguchi wrote: > At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> Hi, >> >> The document explains that "lost" value that >> pg_replication_slots.wal_status reports means >> >> some WAL files are definitely lost and this slot cannot be used to >> resume replication anymore. >> >> However, I observed "lost" value while inserting lots of records, >> but replication could continue normally. So I wonder if >> pg_replication_slots.wal_status may have a bug. >> >> wal_status is calculated in GetWALAvailability(), and probably I found >> some issues in it. >> >> >> keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), >> wal_segment_size) + >> 1; >> >> max_wal_size_mb is the number of megabytes. wal_keep_segments is >> the number of WAL segment files. So it's strange to calculate max of >> them. > > Oops! I don't want to believe I did that but it's definitely wrong. > >> The above should be the following? >> >> Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), >> wal_keep_segments) + 1 > > Looks reasonable. > >> if ((max_slot_wal_keep_size_mb <= 0 || >> max_slot_wal_keep_size_mb >= max_wal_size_mb) && >> oldestSegMaxWalSize <= targetSeg) >> return WALAVAIL_NORMAL; >> >> This code means that wal_status reports "normal" only when >> max_slot_wal_keep_size is negative or larger than max_wal_size. >> Why is this condition necessary? The document explains "normal >> means that the claimed files are within max_wal_size". So whatever >> max_slot_wal_keep_size value is, IMO that "normal" should be >> reported if the WAL files claimed by the slot are within max_wal_size. >> Thought? > > It was a kind of hard to decide. Even when max_slot_wal_keep_size is > smaller than max_wal_size, the segments more than > max_slot_wal_keep_size are not guaranteed to be kept. In that case > the state transits as NORMAL->LOST skipping the "RESERVED" state. > Putting aside whether the setting is useful or not, I thought that the > state transition is somewhat abrupt. > >> Or, if that condition is really necessary, the document should be >> updated so that the note about the condition is added. > > Does the following make sense? > > https://www.postgresql.org/docs/13/view-pg-replication-slots.html > > normal means that the claimed files are within max_wal_size. > + If max_slot_wal_keep_size is smaller than max_wal_size, this state > + will not appear. > >> If the WAL files claimed by the slot exceeds max_slot_wal_keep_size >> but any those WAL files have not been removed yet, wal_status seems >> to report "lost". Is this expected behavior? Per the meaning of "lost" >> described in the document, "lost" should be reported only when >> any claimed files are removed, I think. Thought? >> >> Or this behavior is expected and the document is incorrect? > > In short, it is known behavior but it was judged as useless to prevent > that. > > That can happen when checkpointer removes up to the segment that is > being read by walsender. I think that that doesn't happen (or > happenswithin a narrow time window?) for physical replication but > happenes for logical replication. > > While development, I once added walsender a code to exit for that > reason, but finally it is moved to InvalidateObsoleteReplicationSlots > as a bit defferent function. BTW, I read the code of InvalidateObsoleteReplicationSlots() and probably found some issues in it. 1. Each cycle of the "for" loop in InvalidateObsoleteReplicationSlots() emits the log message "terminating walsender ...". This means that if it takes more than 10ms for walsender to exit after it's signaled, the second and subsequent cycles would happen and output the same log message several times. IMO that log message should be output only once. 2. InvalidateObsoleteReplicationSlots() uses the loop to scan replication slots array and uses the "for" loop in each scan. Also it calls ReplicationSlotAcquire() for each "for" loop cycle, and ReplicationSlotAcquire() uses another loop to scan replication slots array. I don't think this is good design. ISTM that we can get rid of ReplicationSlotAcquire()'s loop because InvalidateObsoleteReplicationSlots() already know the index of the slot that we want to find. The attached patch does that. Thought? 3. There is a corner case where the termination of walsender cleans up the temporary replication slot while InvalidateObsoleteReplicationSlots() is sleeping on ConditionVariableTimedSleep(). In this case, ReplicationSlotAcquire() is called in the subsequent cycle of the "for" loop, cannot find the slot and then emits ERROR message. This leads to the failure of checkpoint by the checkpointer. To avoid this case, if SAB_Inquire is specified, ReplicationSlotAcquire() should return the special value instead of emitting ERROR even when it cannot find the slot. Also InvalidateObsoleteReplicationSlots() should handle that special returned value. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
At Mon, 15 Jun 2020 18:59:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > It was a kind of hard to decide. Even when max_slot_wal_keep_size is > > smaller than max_wal_size, the segments more than > > max_slot_wal_keep_size are not guaranteed to be kept. In that case > > the state transits as NORMAL->LOST skipping the "RESERVED" state. > > Putting aside whether the setting is useful or not, I thought that the > > state transition is somewhat abrupt. > > IMO the direct transition of the state from normal to lost is ok to me > if each state is clearly defined. > > >> Or, if that condition is really necessary, the document should be > >> updated so that the note about the condition is added. > > Does the following make sense? > > https://www.postgresql.org/docs/13/view-pg-replication-slots.html > > normal means that the claimed files are within max_wal_size. > > + If max_slot_wal_keep_size is smaller than max_wal_size, this state > > + will not appear. > > I don't think this change is enough. For example, when > max_slot_wal_keep_size > is smaller than max_wal_size and the amount of WAL files claimed by > the slot > is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But > which is > inconsistent with the meaning of "reserved" in the docs. You're right. > To consider what should be reported in wal_status, could you tell me > what > purpose and how the users is expected to use this information? I saw that the "reserved" is the state where slots are working to retain segments, and "normal" is the state to indicate that "WAL segments are within max_wal_size", which is orthogonal to the notion of "reserved". So it seems to me useless when the retained WAL segments cannot exceeds max_wal_size. With longer description they would be: "reserved under max_wal_size" "reserved over max_wal_size" "lost some segements" Come to think of that, I realized that my trouble was just the wording. Are the following wordings make sense to you? "reserved" - retained within max_wal_size "extended" - retained over max_wal_size "lost" - lost some segments With these wordings I can live with "not extended"=>"lost". Of course more appropriate wording are welcome. > Even if walsender is terminated during the state "lost", unless > checkpointer > removes the required WAL files, the state can go back to "reserved" > after > new replication connection is established. This is the same as what > you're > explaining at the above? GetWALAvailability checks restart_lsn against lastRemovedSegNo, thus the "lost" cannot be seen unless checkpointer actually have removed the segment at restart_lsn (and restart_lsn has not been invalidated). However, walsenders are killed before that segments are actually removed so there're cases where physical walreceiver reconnects before RemoveOldXloFiles removes all segments, then removed after reconnection. "lost" can go back to "resrved" in that case. (Physical walreceiver can connect to invalid-restart_lsn slot) I noticed the another issue. If some required WALs are removed, the slot will be "invalidated", that is, restart_lsn is set to invalid value. As the result we hardly see the "lost" state. It can be "fixed" by remembering the validity of a slot separately from restart_lsn. Is that worth doing? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > In short, it is known behavior but it was judged as useless to prevent > > that. > > That can happen when checkpointer removes up to the segment that is > > being read by walsender. I think that that doesn't happen (or > > happenswithin a narrow time window?) for physical replication but > > happenes for logical replication. > > While development, I once added walsender a code to exit for that > > reason, but finally it is moved to InvalidateObsoleteReplicationSlots > > as a bit defferent function. > > BTW, I read the code of InvalidateObsoleteReplicationSlots() and > probably > found some issues in it. > > 1. Each cycle of the "for" loop in > InvalidateObsoleteReplicationSlots() > emits the log message "terminating walsender ...". This means that > if it takes more than 10ms for walsender to exit after it's signaled, > the second and subsequent cycles would happen and output the same > log message several times. IMO that log message should be output > only once. Sounds reasonable. > 2. InvalidateObsoleteReplicationSlots() uses the loop to scan > replication > slots array and uses the "for" loop in each scan. Also it calls > ReplicationSlotAcquire() for each "for" loop cycle, and > ReplicationSlotAcquire() uses another loop to scan replication slots > array. I don't think this is good design. > > ISTM that we can get rid of ReplicationSlotAcquire()'s loop because > InvalidateObsoleteReplicationSlots() already know the index of the > slot > that we want to find. The attached patch does that. Thought? The inner loop is expected to run at most several times per checkpoint, which won't be a serious problem. However, it is better if we can get rid of that in a reasonable way. The attached patch changes the behavior for SAB_Block. Before the patch, it rescans from the first slot for the same name, but with the patch it just rechecks the same slot. The only caller of the function with SAB_Block is ReplicationSlotDrop and I don't come up with a case where another slot with the same name is created at different place before the condition variable fires. But I'm not sure the change is completely safe. Maybe some assertion is needed? > 3. There is a corner case where the termination of walsender cleans up > the temporary replication slot while > InvalidateObsoleteReplicationSlots() > is sleeping on ConditionVariableTimedSleep(). In this case, > ReplicationSlotAcquire() is called in the subsequent cycle of the > "for" > loop, cannot find the slot and then emits ERROR message. This leads > to the failure of checkpoint by the checkpointer. Agreed. > To avoid this case, if SAB_Inquire is specified, > ReplicationSlotAcquire() > should return the special value instead of emitting ERROR even when > it cannot find the slot. Also InvalidateObsoleteReplicationSlots() > should > handle that special returned value. I thought the same thing hearing that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020/06/16 14:00, Kyotaro Horiguchi wrote: > At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >>> In short, it is known behavior but it was judged as useless to prevent >>> that. >>> That can happen when checkpointer removes up to the segment that is >>> being read by walsender. I think that that doesn't happen (or >>> happenswithin a narrow time window?) for physical replication but >>> happenes for logical replication. >>> While development, I once added walsender a code to exit for that >>> reason, but finally it is moved to InvalidateObsoleteReplicationSlots >>> as a bit defferent function. >> >> BTW, I read the code of InvalidateObsoleteReplicationSlots() and >> probably >> found some issues in it. >> >> 1. Each cycle of the "for" loop in >> InvalidateObsoleteReplicationSlots() >> emits the log message "terminating walsender ...". This means that >> if it takes more than 10ms for walsender to exit after it's signaled, >> the second and subsequent cycles would happen and output the same >> log message several times. IMO that log message should be output >> only once. > > Sounds reasonable. > >> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan >> replication >> slots array and uses the "for" loop in each scan. Also it calls >> ReplicationSlotAcquire() for each "for" loop cycle, and >> ReplicationSlotAcquire() uses another loop to scan replication slots >> array. I don't think this is good design. >> >> ISTM that we can get rid of ReplicationSlotAcquire()'s loop because >> InvalidateObsoleteReplicationSlots() already know the index of the >> slot >> that we want to find. The attached patch does that. Thought? > > The inner loop is expected to run at most several times per > checkpoint, which won't be a serious problem. However, it is better if > we can get rid of that in a reasonable way. > > The attached patch changes the behavior for SAB_Block. Before the > patch, it rescans from the first slot for the same name, but with the > patch it just rechecks the same slot. The only caller of the function > with SAB_Block is ReplicationSlotDrop and I don't come up with a case > where another slot with the same name is created at different place > before the condition variable fires. But I'm not sure the change is > completely safe. Yes, that change might not be safe. So I'm thinking another approach to fix the issues. > Maybe some assertion is needed? > >> 3. There is a corner case where the termination of walsender cleans up >> the temporary replication slot while >> InvalidateObsoleteReplicationSlots() >> is sleeping on ConditionVariableTimedSleep(). In this case, >> ReplicationSlotAcquire() is called in the subsequent cycle of the >> "for" >> loop, cannot find the slot and then emits ERROR message. This leads >> to the failure of checkpoint by the checkpointer. > > Agreed. > >> To avoid this case, if SAB_Inquire is specified, >> ReplicationSlotAcquire() >> should return the special value instead of emitting ERROR even when >> it cannot find the slot. Also InvalidateObsoleteReplicationSlots() >> should >> handle that special returned value. > > I thought the same thing hearing that. While reading InvalidateObsoleteReplicationSlots() code, I found another issue. ereport(LOG, (errmsg("terminating walsender %d because replication slot \"%s\" is too far behind", wspid, NameStr(slotname)))); (void) kill(wspid, SIGTERM); wspid indicates the PID of process using the slot. That process can be a backend, for example, executing pg_replication_slot_advance(). So "walsender" in the above log message is not always correct. int wspid = ReplicationSlotAcquire(NameStr(slotname), SAB_Inquire); Why do we need to call ReplicationSlotAcquire() here and mark the slot as used by the checkpointer? Isn't it enough to check directly the slot's active_pid, instead? Maybe ReplicationSlotAcquire() is necessary because ReplicationSlotRelease() is called later? If so, why do we need to call ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's active_pid is zero. No? If my understanding is right, I'd like to propose the attached patch. It introduces DeactivateReplicationSlot() and replace the "for" loop in InvalidateObsoleteReplicationSlots() with it. ReplicationSlotAcquire() and ReplicationSlotRelease() are no longer called there. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
On 2020-Jun-16, Kyotaro Horiguchi wrote: > I noticed the another issue. If some required WALs are removed, the > slot will be "invalidated", that is, restart_lsn is set to invalid > value. As the result we hardly see the "lost" state. > > It can be "fixed" by remembering the validity of a slot separately > from restart_lsn. Is that worth doing? We discussed this before. I agree it would be better to do this in some way, but I fear that if we do it naively, some code might exist that reads the LSN without realizing that it needs to check the validity flag first. On the other hand, maybe this is not a problem in practice, because if such a bug occurs, what will happen is that trying to read WAL from such a slot will return the error message that the WAL file cannot be found. Maybe this is acceptable? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jun-17, Fujii Masao wrote: > While reading InvalidateObsoleteReplicationSlots() code, I found another issue. > > ereport(LOG, > (errmsg("terminating walsender %d because replication slot \"%s\" is too far behind", > wspid, NameStr(slotname)))); > (void) kill(wspid, SIGTERM); > > wspid indicates the PID of process using the slot. That process can be > a backend, for example, executing pg_replication_slot_advance(). > So "walsender" in the above log message is not always correct. Good point. > int wspid = ReplicationSlotAcquire(NameStr(slotname), > SAB_Inquire); > > Why do we need to call ReplicationSlotAcquire() here and mark the slot as > used by the checkpointer? Isn't it enough to check directly the slot's > active_pid, instead? > > Maybe ReplicationSlotAcquire() is necessary because > ReplicationSlotRelease() is called later? If so, why do we need to call > ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's > active_pid is zero. No? I think the point here was that in order to modify the slot you have to acquire it -- it's not valid to modify a slot you don't own. > + /* > + * Signal to terminate the process using the replication slot. > + * > + * Try to signal every 100ms until it succeeds. > + */ > + if (!killed && kill(active_pid, SIGTERM) == 0) > + killed = true; > + ConditionVariableTimedSleep(&slot->active_cv, 100, > + WAIT_EVENT_REPLICATION_SLOT_DROP); > + } while (ReplicationSlotIsActive(slot, NULL)); Note that here you're signalling only once and then sleeping many times in increments of 100ms -- you're not signalling every 100ms as the comment claims -- unless the signal fails, but you don't really expect that. On the contrary, I'd claim that the logic is reversed: if the signal fails, *then* you should stop signalling. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Wed, 17 Jun 2020 00:46:38 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > >> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan > >> replication > >> slots array and uses the "for" loop in each scan. Also it calls > >> ReplicationSlotAcquire() for each "for" loop cycle, and > >> ReplicationSlotAcquire() uses another loop to scan replication slots > >> array. I don't think this is good design. > >> > >> ISTM that we can get rid of ReplicationSlotAcquire()'s loop because > >> InvalidateObsoleteReplicationSlots() already know the index of the > >> slot > >> that we want to find. The attached patch does that. Thought? > > The inner loop is expected to run at most several times per > > checkpoint, which won't be a serious problem. However, it is better if > > we can get rid of that in a reasonable way. > > The attached patch changes the behavior for SAB_Block. Before the > > patch, it rescans from the first slot for the same name, but with the > > patch it just rechecks the same slot. The only caller of the function > > with SAB_Block is ReplicationSlotDrop and I don't come up with a case > > where another slot with the same name is created at different place > > before the condition variable fires. But I'm not sure the change is > > completely safe. > > Yes, that change might not be safe. So I'm thinking another approach > to > fix the issues. > > > Maybe some assertion is needed? > > > >> 3. There is a corner case where the termination of walsender cleans up > >> the temporary replication slot while > >> InvalidateObsoleteReplicationSlots() > >> is sleeping on ConditionVariableTimedSleep(). In this case, > >> ReplicationSlotAcquire() is called in the subsequent cycle of the > >> "for" > >> loop, cannot find the slot and then emits ERROR message. This leads > >> to the failure of checkpoint by the checkpointer. > > Agreed. > > > >> To avoid this case, if SAB_Inquire is specified, > >> ReplicationSlotAcquire() > >> should return the special value instead of emitting ERROR even when > >> it cannot find the slot. Also InvalidateObsoleteReplicationSlots() > >> should > >> handle that special returned value. > > I thought the same thing hearing that. > > While reading InvalidateObsoleteReplicationSlots() code, I found > another issue. > > ereport(LOG, > (errmsg("terminating walsender %d > because replication slot \"%s\" is too > far behind", > wspid, > NameStr(slotname)))); > (void) kill(wspid, SIGTERM); > > wspid indicates the PID of process using the slot. That process can be > a backend, for example, executing pg_replication_slot_advance(). > So "walsender" in the above log message is not always correct. Agreed. > > int wspid = ReplicationSlotAcquire(NameStr(slotname), > SAB_Inquire); > > Why do we need to call ReplicationSlotAcquire() here and mark the slot > as > used by the checkpointer? Isn't it enough to check directly the slot's > active_pid, instead? > Maybe ReplicationSlotAcquire() is necessary because > ReplicationSlotRelease() is called later? If so, why do we need to > call > ReplicationSlotRelease()? ISTM that we don't need to do that if the > slot's > active_pid is zero. No? My understanding of the reason is that we update a slot value here. The restriction allows the owner of a slot to assume that all the slot values don't voluntarily change. slot.h:104 | * - Individual fields are protected by mutex where only the backend owning | * the slot is authorized to update the fields from its own slot. The | * backend owning the slot does not need to take this lock when reading its | * own fields, while concurrent backends not owning this slot should take the | * lock when reading this slot's data. > If my understanding is right, I'd like to propose the attached patch. > It introduces DeactivateReplicationSlot() and replace the "for" loop > in InvalidateObsoleteReplicationSlots() with > it. ReplicationSlotAcquire() > and ReplicationSlotRelease() are no longer called there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Jun-16, Kyotaro Horiguchi wrote: > > > I noticed the another issue. If some required WALs are removed, the > > slot will be "invalidated", that is, restart_lsn is set to invalid > > value. As the result we hardly see the "lost" state. > > > > It can be "fixed" by remembering the validity of a slot separately > > from restart_lsn. Is that worth doing? > > We discussed this before. I agree it would be better to do this > in some way, but I fear that if we do it naively, some code might exist > that reads the LSN without realizing that it needs to check the validity > flag first. Yes, that was my main concern on it. That's error-prone. How about remembering the LSN where invalidation happened? It's safe since no others than slot-monitoring functions would look last_invalidated_lsn. It can be reset if active_pid is a valid pid. InvalidateObsoleteReplicationSlots: ... SpinLockAcquire(&s->mutex); + s->data.last_invalidated_lsn = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); > On the other hand, maybe this is not a problem in practice, because if > such a bug occurs, what will happen is that trying to read WAL from such > a slot will return the error message that the WAL file cannot be found. > Maybe this is acceptable? I'm not sure. For my part a problem of that would we need to look into server logs to know what is acutally going on. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020/06/17 3:50, Alvaro Herrera wrote: > On 2020-Jun-17, Fujii Masao wrote: > >> While reading InvalidateObsoleteReplicationSlots() code, I found another issue. >> >> ereport(LOG, >> (errmsg("terminating walsender %d because replication slot \"%s\" is too far behind", >> wspid, NameStr(slotname)))); >> (void) kill(wspid, SIGTERM); >> >> wspid indicates the PID of process using the slot. That process can be >> a backend, for example, executing pg_replication_slot_advance(). >> So "walsender" in the above log message is not always correct. > > Good point. So InvalidateObsoleteReplicationSlots() can terminate normal backends. But do we want to do this? If we want, we should add the note about this case into the docs? Otherwise the users would be surprised at termination of backends by max_slot_wal_keep_size. I guess that it's basically rarely happen, though. >> int wspid = ReplicationSlotAcquire(NameStr(slotname), >> SAB_Inquire); >> >> Why do we need to call ReplicationSlotAcquire() here and mark the slot as >> used by the checkpointer? Isn't it enough to check directly the slot's >> active_pid, instead? >> >> Maybe ReplicationSlotAcquire() is necessary because >> ReplicationSlotRelease() is called later? If so, why do we need to call >> ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's >> active_pid is zero. No? > > I think the point here was that in order to modify the slot you have to > acquire it -- it's not valid to modify a slot you don't own. Understood. Thanks! >> + /* >> + * Signal to terminate the process using the replication slot. >> + * >> + * Try to signal every 100ms until it succeeds. >> + */ >> + if (!killed && kill(active_pid, SIGTERM) == 0) >> + killed = true; >> + ConditionVariableTimedSleep(&slot->active_cv, 100, >> + WAIT_EVENT_REPLICATION_SLOT_DROP); >> + } while (ReplicationSlotIsActive(slot, NULL)); > > Note that here you're signalling only once and then sleeping many times > in increments of 100ms -- you're not signalling every 100ms as the > comment claims -- unless the signal fails, but you don't really expect > that. On the contrary, I'd claim that the logic is reversed: if the > signal fails, *then* you should stop signalling. You mean; in this code path, signaling fails only when the target process disappears just before signaling. So if it fails, slot->active_pid is expected to become 0 even without signaling more. Right? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020-Jun-17, Fujii Masao wrote: > On 2020/06/17 3:50, Alvaro Herrera wrote: > So InvalidateObsoleteReplicationSlots() can terminate normal backends. > But do we want to do this? If we want, we should add the note about this > case into the docs? Otherwise the users would be surprised at termination > of backends by max_slot_wal_keep_size. I guess that it's basically rarely > happen, though. Well, if we could distinguish a walsender from a non-walsender process, then maybe it would make sense to leave backends alive. But do we want that? I admit I don't know what would be the reason to have a non-walsender process with an active slot, so I don't have a good opinion on what to do in this case. > > > + /* > > > + * Signal to terminate the process using the replication slot. > > > + * > > > + * Try to signal every 100ms until it succeeds. > > > + */ > > > + if (!killed && kill(active_pid, SIGTERM) == 0) > > > + killed = true; > > > + ConditionVariableTimedSleep(&slot->active_cv, 100, > > > + WAIT_EVENT_REPLICATION_SLOT_DROP); > > > + } while (ReplicationSlotIsActive(slot, NULL)); > > > > Note that here you're signalling only once and then sleeping many times > > in increments of 100ms -- you're not signalling every 100ms as the > > comment claims -- unless the signal fails, but you don't really expect > > that. On the contrary, I'd claim that the logic is reversed: if the > > signal fails, *then* you should stop signalling. > > You mean; in this code path, signaling fails only when the target process > disappears just before signaling. So if it fails, slot->active_pid is > expected to become 0 even without signaling more. Right? I guess kill() can also fail if the PID now belongs to a process owned by a different user. I think we've disregarded very quick reuse of PIDs, so we needn't concern ourselves with it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Jun-17, Fujii Masao wrote: > > On 2020/06/17 3:50, Alvaro Herrera wrote: > > > So InvalidateObsoleteReplicationSlots() can terminate normal backends. > > But do we want to do this? If we want, we should add the note about this > > case into the docs? Otherwise the users would be surprised at termination > > of backends by max_slot_wal_keep_size. I guess that it's basically rarely > > happen, though. > > Well, if we could distinguish a walsender from a non-walsender process, > then maybe it would make sense to leave backends alive. But do we want > that? I admit I don't know what would be the reason to have a > non-walsender process with an active slot, so I don't have a good > opinion on what to do in this case. The non-walsender backend is actually doing replication work. It rather should be killed? > > > > + /* > > > > + * Signal to terminate the process using the replication slot. > > > > + * > > > > + * Try to signal every 100ms until it succeeds. > > > > + */ > > > > + if (!killed && kill(active_pid, SIGTERM) == 0) > > > > + killed = true; > > > > + ConditionVariableTimedSleep(&slot->active_cv, 100, > > > > + WAIT_EVENT_REPLICATION_SLOT_DROP); > > > > + } while (ReplicationSlotIsActive(slot, NULL)); > > > > > > Note that here you're signalling only once and then sleeping many times > > > in increments of 100ms -- you're not signalling every 100ms as the > > > comment claims -- unless the signal fails, but you don't really expect > > > that. On the contrary, I'd claim that the logic is reversed: if the > > > signal fails, *then* you should stop signalling. > > > > You mean; in this code path, signaling fails only when the target process > > disappears just before signaling. So if it fails, slot->active_pid is > > expected to become 0 even without signaling more. Right? > > I guess kill() can also fail if the PID now belongs to a process owned > by a different user. I think we've disregarded very quick reuse of > PIDs, so we needn't concern ourselves with it. The first time call to ConditionVariableTimedSleep doen't actually sleep, so the loop works as expected. But we may make an extra call to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the loop would make it better. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 17 Jun 2020 10:17:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > > On 2020-Jun-16, Kyotaro Horiguchi wrote: > > > > > I noticed the another issue. If some required WALs are removed, the > > > slot will be "invalidated", that is, restart_lsn is set to invalid > > > value. As the result we hardly see the "lost" state. > > > > > > It can be "fixed" by remembering the validity of a slot separately > > > from restart_lsn. Is that worth doing? > > > > We discussed this before. I agree it would be better to do this > > in some way, but I fear that if we do it naively, some code might exist > > that reads the LSN without realizing that it needs to check the validity > > flag first. > > Yes, that was my main concern on it. That's error-prone. How about > remembering the LSN where invalidation happened? It's safe since no > others than slot-monitoring functions would look > last_invalidated_lsn. It can be reset if active_pid is a valid pid. > > InvalidateObsoleteReplicationSlots: > ... > SpinLockAcquire(&s->mutex); > + s->data.last_invalidated_lsn = s->data.restart_lsn; > s->data.restart_lsn = InvalidXLogRecPtr; > SpinLockRelease(&s->mutex); The attached does that (Poc). No document fix included. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d6fe205eb4..d3240d1e38 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9485,20 +9485,25 @@ CreateRestartPoint(int flags) * (typically a slot's restart_lsn) * * Returns one of the following enum values: - * * WALAVAIL_NORMAL means targetLSN is available because it is in the range - * of max_wal_size. * - * * WALAVAIL_PRESERVED means it is still available by preserving extra + * * WALAVAIL_RESERVED means targetLSN is available and it is in the range of + * max_wal_size. + * + * * WALAVAIL_EXTENDED means it is still available by preserving extra * segments beyond max_wal_size. If max_slot_wal_keep_size is smaller * than max_wal_size, this state is not returned. * + * * WALAVAIL_BEING_REMOVED means it is being lost. The walsender using this + * slot may return to the above. + * * * WALAVAIL_REMOVED means it is definitely lost. A replication stream on * a slot with this LSN cannot continue. * * * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL. */ WALAvailability -GetWALAvailability(XLogRecPtr targetLSN) +GetWALAvailability(XLogRecPtr targetLSN, XLogSegNo last_removed_seg, + bool slot_is_active) { XLogRecPtr currpos; /* current write LSN */ XLogSegNo currSeg; /* segid of currpos */ @@ -9509,7 +9514,11 @@ GetWALAvailability(XLogRecPtr targetLSN) * slot */ uint64 keepSegs; - /* slot does not reserve WAL. Either deactivated, or has never been active */ + /* + * slot does not reserve WAL. Either deactivated, or has never been active + * The caller should have passed last_invalidated_lsn as targetLSN if the + * slot has been invalidated. + */ if (XLogRecPtrIsInvalid(targetLSN)) return WALAVAIL_INVALID_LSN; @@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN) * the first WAL segment file since startup, which causes the status being * wrong under certain abnormal conditions but that doesn't actually harm. */ - oldestSeg = XLogGetLastRemovedSegno() + 1; + oldestSeg = last_removed_seg + 1; /* calculate oldest segment by max_wal_size and wal_keep_segments */ XLByteToSeg(currpos, currSeg, wal_segment_size); @@ -9544,20 +9553,21 @@ GetWALAvailability(XLogRecPtr targetLSN) */ if (targetSeg >= oldestSeg) { - /* - * show "normal" when targetSeg is within max_wal_size, even if - * max_slot_wal_keep_size is smaller than max_wal_size. - */ - if ((max_slot_wal_keep_size_mb <= 0 || - max_slot_wal_keep_size_mb >= max_wal_size_mb) && - oldestSegMaxWalSize <= targetSeg) - return WALAVAIL_NORMAL; - - /* being retained by slots */ - if (oldestSlotSeg <= targetSeg) + /* show "reserved" when targetSeg is within max_wal_size */ + if (oldestSegMaxWalSize <= targetSeg) return WALAVAIL_RESERVED; + + /* being retained by slots exceeding max_wal_size */ + return WALAVAIL_EXTENDED; } + /* + * However segments required by the slot has been lost, if walsender is + * active the walsender can read into the first reserved slot. + */ + if (slot_is_active) + return WALAVAIL_BEING_REMOVED; + /* Definitely lost */ return WALAVAIL_REMOVED; } diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 505445f2dc..f141b29d28 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific, slot->candidate_xmin_lsn = InvalidXLogRecPtr; slot->candidate_restart_valid = InvalidXLogRecPtr; slot->candidate_restart_lsn = InvalidXLogRecPtr; + slot->last_invalidated_lsn = InvalidXLogRecPtr; /* * Create the slot on disk. We haven't actually marked the slot allocated @@ -1144,6 +1145,7 @@ restart: (uint32) restart_lsn))); SpinLockAcquire(&s->mutex); + s->last_invalidated_lsn = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); ReplicationSlotRelease(); diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 1b929a603e..ed0abe0c39 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -243,6 +243,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) MemoryContext per_query_ctx; MemoryContext oldcontext; int slotno; + XLogSegNo last_removed_seg; /* check to see if caller supports us returning a tuplestore */ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) @@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) rsinfo->setResult = tupstore; rsinfo->setDesc = tupdesc; + /* + * Remember the last removed segment at this point for the consistency in + * this table. Since there's no interlock between slot data and + * checkpointer, the segment can be removed in-between, but that doesn't + * make any practical difference. + */ + last_removed_seg = XLogGetLastRemovedSegno(); + MemoryContextSwitchTo(oldcontext); LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); @@ -282,7 +291,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) Datum values[PG_GET_REPLICATION_SLOTS_COLS]; bool nulls[PG_GET_REPLICATION_SLOTS_COLS]; WALAvailability walstate; - XLogSegNo last_removed_seg; + XLogRecPtr targetLSN; int i; if (!slot->in_use) @@ -342,7 +351,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) else nulls[i++] = true; - walstate = GetWALAvailability(slot_contents.data.restart_lsn); + /* use last_invalidated_lsn when the slot is invalidated */ + if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)) + targetLSN = slot_contents.last_invalidated_lsn; + else + targetLSN = slot_contents.data.restart_lsn; + + walstate = GetWALAvailability(targetLSN, last_removed_seg, + slot_contents.active_pid != 0); switch (walstate) { @@ -350,14 +366,18 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) nulls[i++] = true; break; - case WALAVAIL_NORMAL: - values[i++] = CStringGetTextDatum("normal"); - break; - case WALAVAIL_RESERVED: values[i++] = CStringGetTextDatum("reserved"); break; + case WALAVAIL_EXTENDED: + values[i++] = CStringGetTextDatum("extended"); + break; + + case WALAVAIL_BEING_REMOVED: + values[i++] = CStringGetTextDatum("being lost"); + break; + case WALAVAIL_REMOVED: values[i++] = CStringGetTextDatum("lost"); break; @@ -367,8 +387,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) } if (max_slot_wal_keep_size_mb >= 0 && - (walstate == WALAVAIL_NORMAL || walstate == WALAVAIL_RESERVED) && - ((last_removed_seg = XLogGetLastRemovedSegno()) != 0)) + (walstate == WALAVAIL_RESERVED || walstate == WALAVAIL_EXTENDED) && + (last_removed_seg != 0)) { XLogRecPtr min_safe_lsn; diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index e917dfe92d..49d9578bc5 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -270,8 +270,9 @@ extern CheckpointStatsData CheckpointStats; typedef enum WALAvailability { WALAVAIL_INVALID_LSN, /* parameter error */ - WALAVAIL_NORMAL, /* WAL segment is within max_wal_size */ - WALAVAIL_RESERVED, /* WAL segment is reserved by a slot */ + WALAVAIL_RESERVED, /* WAL segment is within max_wal_size */ + WALAVAIL_EXTENDED, /* WAL segment is reserved by a slot */ + WALAVAIL_BEING_REMOVED, /* WAL segment is being removed */ WALAVAIL_REMOVED /* WAL segment has been removed */ } WALAvailability; @@ -326,7 +327,9 @@ extern void ShutdownXLOG(int code, Datum arg); extern void InitXLOGAccess(void); extern void CreateCheckPoint(int flags); extern bool CreateRestartPoint(int flags); -extern WALAvailability GetWALAvailability(XLogRecPtr restart_lsn); +extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN, + XLogSegNo last_removed_seg, + bool slot_is_active); extern XLogRecPtr CalculateMaxmumSafeLSN(void); extern void XLogPutNextOid(Oid nextOid); extern XLogRecPtr XLogRestorePoint(const char *rpName); diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 917876010e..8090ca81fe 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -156,6 +156,9 @@ typedef struct ReplicationSlot XLogRecPtr candidate_xmin_lsn; XLogRecPtr candidate_restart_valid; XLogRecPtr candidate_restart_lsn; + + /* restart_lsn is copied here when the slot is invalidated */ + XLogRecPtr last_invalidated_lsn; } ReplicationSlot; #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
On 2020/06/17 12:10, Kyotaro Horiguchi wrote: > At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in >> On 2020-Jun-17, Fujii Masao wrote: >>> On 2020/06/17 3:50, Alvaro Herrera wrote: >> >>> So InvalidateObsoleteReplicationSlots() can terminate normal backends. >>> But do we want to do this? If we want, we should add the note about this >>> case into the docs? Otherwise the users would be surprised at termination >>> of backends by max_slot_wal_keep_size. I guess that it's basically rarely >>> happen, though. >> >> Well, if we could distinguish a walsender from a non-walsender process, >> then maybe it would make sense to leave backends alive. But do we want >> that? I admit I don't know what would be the reason to have a >> non-walsender process with an active slot, so I don't have a good >> opinion on what to do in this case. > > The non-walsender backend is actually doing replication work. It > rather should be killed? I have no better opinion about this. So I agree to leave the logic as it is at least for now, i.e., we terminate the process owning the slot whatever the type of process is. > >>>>> + /* >>>>> + * Signal to terminate the process using the replication slot. >>>>> + * >>>>> + * Try to signal every 100ms until it succeeds. >>>>> + */ >>>>> + if (!killed && kill(active_pid, SIGTERM) == 0) >>>>> + killed = true; >>>>> + ConditionVariableTimedSleep(&slot->active_cv, 100, >>>>> + WAIT_EVENT_REPLICATION_SLOT_DROP); >>>>> + } while (ReplicationSlotIsActive(slot, NULL)); >>>> >>>> Note that here you're signalling only once and then sleeping many times >>>> in increments of 100ms -- you're not signalling every 100ms as the >>>> comment claims -- unless the signal fails, but you don't really expect >>>> that. On the contrary, I'd claim that the logic is reversed: if the >>>> signal fails, *then* you should stop signalling. >>> >>> You mean; in this code path, signaling fails only when the target process >>> disappears just before signaling. So if it fails, slot->active_pid is >>> expected to become 0 even without signaling more. Right? >> >> I guess kill() can also fail if the PID now belongs to a process owned >> by a different user. Yes. This case means that the PostgreSQL process using the slot disappeared and the same PID was assigned to non-PostgreSQL process. So if kill() fails for this reason, we don't need to kill() again. > I think we've disregarded very quick reuse of >> PIDs, so we needn't concern ourselves with it. > > The first time call to ConditionVariableTimedSleep doen't actually > sleep, so the loop works as expected. But we may make an extra call > to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the > loop would make it better. Sorry I failed to understand your point... Anyway, the attached is the updated version of the patch. This fixes all the issues in InvalidateObsoleteReplicationSlots() that I reported upthread. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/06/17 12:10, Kyotaro Horiguchi wrote: > > At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera > > <alvherre@2ndquadrant.com> wrote in > >> On 2020-Jun-17, Fujii Masao wrote: > >>> On 2020/06/17 3:50, Alvaro Herrera wrote: > >> > >>> So InvalidateObsoleteReplicationSlots() can terminate normal backends. > >>> But do we want to do this? If we want, we should add the note about > >>> this > >>> case into the docs? Otherwise the users would be surprised at > >>> termination > >>> of backends by max_slot_wal_keep_size. I guess that it's basically > >>> rarely > >>> happen, though. > >> > >> Well, if we could distinguish a walsender from a non-walsender > >> process, > >> then maybe it would make sense to leave backends alive. But do we > >> want > >> that? I admit I don't know what would be the reason to have a > >> non-walsender process with an active slot, so I don't have a good > >> opinion on what to do in this case. > > The non-walsender backend is actually doing replication work. It > > rather should be killed? > > I have no better opinion about this. So I agree to leave the logic as > it is > at least for now, i.e., we terminate the process owning the slot > whatever > the type of process is. Agreed. > >>>>> + /* > >>>>> + * Signal to terminate the process using the replication slot. > >>>>> + * > >>>>> + * Try to signal every 100ms until it succeeds. > >>>>> + */ > >>>>> + if (!killed && kill(active_pid, SIGTERM) == 0) > >>>>> + killed = true; > >>>>> + ConditionVariableTimedSleep(&slot->active_cv, 100, > >>>>> + WAIT_EVENT_REPLICATION_SLOT_DROP); > >>>>> + } while (ReplicationSlotIsActive(slot, NULL)); > >>>> > >>>> Note that here you're signalling only once and then sleeping many > >>>> times > >>>> in increments of 100ms -- you're not signalling every 100ms as the > >>>> comment claims -- unless the signal fails, but you don't really expect > >>>> that. On the contrary, I'd claim that the logic is reversed: if the > >>>> signal fails, *then* you should stop signalling. > >>> > >>> You mean; in this code path, signaling fails only when the target > >>> process > >>> disappears just before signaling. So if it fails, slot->active_pid is > >>> expected to become 0 even without signaling more. Right? > >> > >> I guess kill() can also fail if the PID now belongs to a process owned > >> by a different user. > > Yes. This case means that the PostgreSQL process using the slot > disappeared > and the same PID was assigned to non-PostgreSQL process. So if kill() > fails > for this reason, we don't need to kill() again. > > > I think we've disregarded very quick reuse of > >> PIDs, so we needn't concern ourselves with it. > > The first time call to ConditionVariableTimedSleep doen't actually > > sleep, so the loop works as expected. But we may make an extra call > > to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the > > loop would make it better. > > Sorry I failed to understand your point... My point is the ConditionVariableTimedSleep does *not* sleep on the CV first time in this usage. The new version anyway avoids useless kill(2) call, but still may make an extra call to ReplicationSlotAcquireInternal. I think we should call ConditionVariablePrepareToSleep before the sorrounding for statement block. > Anyway, the attached is the updated version of the patch. This fixes > all the issues in InvalidateObsoleteReplicationSlots() that I reported > upthread. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020/06/17 17:30, Kyotaro Horiguchi wrote: > At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> >> >> On 2020/06/17 12:10, Kyotaro Horiguchi wrote: >>> At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera >>> <alvherre@2ndquadrant.com> wrote in >>>> On 2020-Jun-17, Fujii Masao wrote: >>>>> On 2020/06/17 3:50, Alvaro Herrera wrote: >>>> >>>>> So InvalidateObsoleteReplicationSlots() can terminate normal backends. >>>>> But do we want to do this? If we want, we should add the note about >>>>> this >>>>> case into the docs? Otherwise the users would be surprised at >>>>> termination >>>>> of backends by max_slot_wal_keep_size. I guess that it's basically >>>>> rarely >>>>> happen, though. >>>> >>>> Well, if we could distinguish a walsender from a non-walsender >>>> process, >>>> then maybe it would make sense to leave backends alive. But do we >>>> want >>>> that? I admit I don't know what would be the reason to have a >>>> non-walsender process with an active slot, so I don't have a good >>>> opinion on what to do in this case. >>> The non-walsender backend is actually doing replication work. It >>> rather should be killed? >> >> I have no better opinion about this. So I agree to leave the logic as >> it is >> at least for now, i.e., we terminate the process owning the slot >> whatever >> the type of process is. > > Agreed. > >>>>>>> + /* >>>>>>> + * Signal to terminate the process using the replication slot. >>>>>>> + * >>>>>>> + * Try to signal every 100ms until it succeeds. >>>>>>> + */ >>>>>>> + if (!killed && kill(active_pid, SIGTERM) == 0) >>>>>>> + killed = true; >>>>>>> + ConditionVariableTimedSleep(&slot->active_cv, 100, >>>>>>> + WAIT_EVENT_REPLICATION_SLOT_DROP); >>>>>>> + } while (ReplicationSlotIsActive(slot, NULL)); >>>>>> >>>>>> Note that here you're signalling only once and then sleeping many >>>>>> times >>>>>> in increments of 100ms -- you're not signalling every 100ms as the >>>>>> comment claims -- unless the signal fails, but you don't really expect >>>>>> that. On the contrary, I'd claim that the logic is reversed: if the >>>>>> signal fails, *then* you should stop signalling. >>>>> >>>>> You mean; in this code path, signaling fails only when the target >>>>> process >>>>> disappears just before signaling. So if it fails, slot->active_pid is >>>>> expected to become 0 even without signaling more. Right? >>>> >>>> I guess kill() can also fail if the PID now belongs to a process owned >>>> by a different user. >> >> Yes. This case means that the PostgreSQL process using the slot >> disappeared >> and the same PID was assigned to non-PostgreSQL process. So if kill() >> fails >> for this reason, we don't need to kill() again. >> >>> I think we've disregarded very quick reuse of >>>> PIDs, so we needn't concern ourselves with it. >>> The first time call to ConditionVariableTimedSleep doen't actually >>> sleep, so the loop works as expected. But we may make an extra call >>> to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the >>> loop would make it better. >> >> Sorry I failed to understand your point... > > My point is the ConditionVariableTimedSleep does *not* sleep on the CV > first time in this usage. The new version anyway avoids useless > kill(2) call, but still may make an extra call to > ReplicationSlotAcquireInternal. I think we should call > ConditionVariablePrepareToSleep before the sorrounding for statement > block. OK, so what about the attached patch? I added ConditionVariablePrepareToSleep() just before entering the "for" loop in InvalidateObsoleteReplicationSlots(). Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
I think passing the slot name when the slot is also passed is useless and wasteful; it'd be better to pass NULL for the name and ignore the strcmp() in that case -- in fact I suggest to forbid passing both name and slot. (Any failure there would risk raising an error during checkpoint, which is undesirable.) So I propose the following tweaks to your patch, and otherwise +1. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On 2020-Jun-17, Kyotaro Horiguchi wrote: > @@ -342,7 +351,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) > else > nulls[i++] = true; > > - walstate = GetWALAvailability(slot_contents.data.restart_lsn); > + /* use last_invalidated_lsn when the slot is invalidated */ > + if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)) > + targetLSN = slot_contents.last_invalidated_lsn; > + else > + targetLSN = slot_contents.data.restart_lsn; > + > + walstate = GetWALAvailability(targetLSN, last_removed_seg, > + slot_contents.active_pid != 0); Yeah, this approach seems better overall. I'll see if I can get this done after lunch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/06/18 3:04, Alvaro Herrera wrote: > I think passing the slot name when the slot is also passed is useless > and wasteful; it'd be better to pass NULL for the name and ignore the > strcmp() in that case -- in fact I suggest to forbid passing both name > and slot. (Any failure there would risk raising an error during > checkpoint, which is undesirable.) Sounds reasonable. > So I propose the following tweaks to your patch, and otherwise +1. Thanks for the patch! It looks good to me. Barring any objections, I will commit the patches in the master and v13 branches later. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > ReplicationSlotAcquireInternal. I think we should call > > ConditionVariablePrepareToSleep before the sorrounding for statement > > block. > > OK, so what about the attached patch? I added > ConditionVariablePrepareToSleep() > just before entering the "for" loop in > InvalidateObsoleteReplicationSlots(). Thanks. ReplicationSlotAcquireInternal: + * If *slot == NULL, search for the slot with the given name. '*' seems needless here. The patch moves ConditionVariablePrepareToSleep. We need to call the function before looking into active_pid as originally commented. Since it is not protected by ReplicationSlotControLock, just before releasing the lock is not correct. The attached on top of the v3 fixes that. + s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot; + if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0) The conditions in the second line is needed for the case slot is given, but it is already done in SearchNamedReplicationSlot if slot is not given. I would like something like the following instead, but I don't insist on it. ReplicationSlot *s = NULL; ... if (!slot) s = SearchNamedReplicationSlot(name); else if(s->in_use && strcmp(name, NameStr(s->data.name))) s = slot; + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_OBJECT), + errmsg("replication slot \"%s\" does not exist", name))); The error message is not right when the given slot doesn't match the given name. It might be better to leave it to the caller. Currently no such caller exists so I don't insist on this but the message should be revised otherwise. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 8f1913ec7ff3809d11adf1611aba57e44cb42a82 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 18 Jun 2020 10:55:49 +0900 Subject: [PATCH 1/3] 001 --- src/backend/replication/slot.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index b94e11a8e7..dcc76c4783 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -412,6 +412,14 @@ retry: */ if (IsUnderPostmaster) { + /* + * Get ready to sleep on it in case it is active if needed. + * (We may end up not sleeping, but we don't want to do this while + * holding the spinlock.) + */ + if (behavior != SAB_Inquire) + ConditionVariablePrepareToSleep(&s->active_cv); + SpinLockAcquire(&s->mutex); if (s->active_pid == 0) s->active_pid = MyProcPid; @@ -438,12 +446,13 @@ retry: return active_pid; /* Wait here until we get signaled, and then restart */ - ConditionVariablePrepareToSleep(&s->active_cv); ConditionVariableSleep(&s->active_cv, WAIT_EVENT_REPLICATION_SLOT_DROP); ConditionVariableCancelSleep(); goto retry; } + else if (behavior != SAB_Inquire) + ConditionVariableCancelSleep(); /* Let everybody know we've modified this slot */ ConditionVariableBroadcast(&s->active_cv); -- 2.18.4 From 1b2f94d6bfb4508b2cbf3d552a5615ae2959e90c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 18 Jun 2020 11:25:25 +0900 Subject: [PATCH 2/3] 002 --- src/backend/replication/slot.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index dcc76c4783..8893516f00 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -380,7 +380,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot, const char *name, SlotAcquireBehavior behavior) { - ReplicationSlot *s; + ReplicationSlot *s = NULL; int active_pid; retry: @@ -393,8 +393,12 @@ retry: * acquire is not given. If the slot is not found, we either * return -1 or error out. */ - s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot; - if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0) + if (!slot) + s = SearchNamedReplicationSlot(name); + else if(s->in_use && strcmp(name, NameStr(s->data.name))) + s = slot; + + if (s == NULL) { if (behavior == SAB_Inquire) { -- 2.18.4 From 2d4a83abd2f278a9f9dc6e9329aed1acc191f2c8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 18 Jun 2020 11:33:29 +0900 Subject: [PATCH 3/3] 003 --- src/backend/replication/slot.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 8893516f00..25ae334a29 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -373,8 +373,10 @@ ReplicationSlotAcquire(const char *name, SlotAcquireBehavior behavior) * Mark the specified slot as used by this process. * * If *slot == NULL, search for the slot with the given name. - * * See comments about the return value in ReplicationSlotAcquire(). + * + * If slot is not NULL, returns -1 if the slot is not in use or doesn't match + * the given name. */ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot, const char *name, @@ -400,7 +402,7 @@ retry: if (s == NULL) { - if (behavior == SAB_Inquire) + if (behavior == SAB_Inquire || slot) { LWLockRelease(ReplicationSlotControlLock); return -1; -- 2.18.4
On 2020/06/18 11:44, Kyotaro Horiguchi wrote: > At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >>> ReplicationSlotAcquireInternal. I think we should call >>> ConditionVariablePrepareToSleep before the sorrounding for statement >>> block. >> >> OK, so what about the attached patch? I added >> ConditionVariablePrepareToSleep() >> just before entering the "for" loop in >> InvalidateObsoleteReplicationSlots(). > > Thanks. Thanks for the review! > > ReplicationSlotAcquireInternal: > + * If *slot == NULL, search for the slot with the given name. > > '*' seems needless here. Fixed. Also I added "Only one of slot and name can be specified." into the comments of ReplicationSlotAcquireInternal(). > The patch moves ConditionVariablePrepareToSleep. We need to call the > function before looking into active_pid as originally commented. > Since it is not protected by ReplicationSlotControLock, just before > releasing the lock is not correct. > > The attached on top of the v3 fixes that. Yes, you're right. I merged your 0001.patch into mine. + if (behavior != SAB_Inquire) + ConditionVariablePrepareToSleep(&s->active_cv); + else if (behavior != SAB_Inquire) Isn't "behavior == SAB_Block" condition better here? I changed the patch that way. The attached is the updated version of the patch. I also merged Alvaro's patch into this. > + s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot; > + if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0) > > The conditions in the second line is needed for the case slot is > given, but it is already done in SearchNamedReplicationSlot if slot is > not given. I would like something like the following instead, but I > don't insist on it. Yes, I got rid of strcmp() check, but left is_use check as it is. I like that because it's simpler. > ReplicationSlot *s = NULL; > ... > if (!slot) > s = SearchNamedReplicationSlot(name); > else if(s->in_use && strcmp(name, NameStr(s->data.name))) > s = slot; > > > + ereport(ERROR, > + (errcode(ERRCODE_UNDEFINED_OBJECT), > + errmsg("replication slot \"%s\" does not exist", name))); > > The error message is not right when the given slot doesn't match the > given name. This doesn't happen after applying Alvaro's patch. BTW, using "name" here is not valid because it may be NULL. So I added the following code and used "slot_name" in log messages. + slot_name = name ? name : NameStr(slot->data.name); Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
On 2020/06/18 14:40, Fujii Masao wrote: > > > On 2020/06/18 11:44, Kyotaro Horiguchi wrote: >> At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >>>> ReplicationSlotAcquireInternal. I think we should call >>>> ConditionVariablePrepareToSleep before the sorrounding for statement >>>> block. >>> >>> OK, so what about the attached patch? I added >>> ConditionVariablePrepareToSleep() >>> just before entering the "for" loop in >>> InvalidateObsoleteReplicationSlots(). >> >> Thanks. > > Thanks for the review! > >> >> ReplicationSlotAcquireInternal: >> + * If *slot == NULL, search for the slot with the given name. >> >> '*' seems needless here. > > Fixed. > > Also I added "Only one of slot and name can be specified." into > the comments of ReplicationSlotAcquireInternal(). > > >> The patch moves ConditionVariablePrepareToSleep. We need to call the >> function before looking into active_pid as originally commented. >> Since it is not protected by ReplicationSlotControLock, just before >> releasing the lock is not correct. >> >> The attached on top of the v3 fixes that. > > Yes, you're right. I merged your 0001.patch into mine. > > + if (behavior != SAB_Inquire) > + ConditionVariablePrepareToSleep(&s->active_cv); > + else if (behavior != SAB_Inquire) > > Isn't "behavior == SAB_Block" condition better here? > I changed the patch that way. > > The attached is the updated version of the patch. > I also merged Alvaro's patch into this. > > >> + s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot; >> + if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0) >> >> The conditions in the second line is needed for the case slot is >> given, but it is already done in SearchNamedReplicationSlot if slot is >> not given. I would like something like the following instead, but I >> don't insist on it. > > Yes, I got rid of strcmp() check, but left is_use check as it is. > I like that because it's simpler. > > >> ReplicationSlot *s = NULL; >> ... >> if (!slot) >> s = SearchNamedReplicationSlot(name); >> else if(s->in_use && strcmp(name, NameStr(s->data.name))) >> s = slot; >> >> >> + ereport(ERROR, >> + (errcode(ERRCODE_UNDEFINED_OBJECT), >> + errmsg("replication slot \"%s\" does not exist", name))); >> >> The error message is not right when the given slot doesn't match the >> given name. > > This doesn't happen after applying Alvaro's patch. > > BTW, using "name" here is not valid because it may be NULL. > So I added the following code and used "slot_name" in log messages. > > + slot_name = name ? name : NameStr(slot->data.name); Sorry, this caused compiler failure. So I fixed that and attached the updated version of the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
At Thu, 18 Jun 2020 14:54:47 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > Sorry, this caused compiler failure. So I fixed that and > attached the updated version of the patch. At Thu, 18 Jun 2020 14:40:55 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > + errmsg("replication slot \"%s\" does not exist", name))); > > The error message is not right when the given slot doesn't match the > > given name. > > This doesn't happen after applying Alvaro's patch. If name is specified (so slot is NULL) to ReplicationSlotAcquireInternal and the slot is not found, the ereport in following code dereferences NULL. ==== if (s == NULL || !s->in_use) { LWLockRelease(ReplicationSlotControlLock); if (behavior == SAB_Inquire) return -1; ereport(ERROR, (errcode(ERRCODE_UNDEFINED_OBJECT), errmsg("replication slot \"%s\" does not exist", name ? name : NameStr(slot->data.name)))); } ==== regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Mmm. I hurried too much.. At Thu, 18 Jun 2020 16:32:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > If name is specified (so slot is NULL) to > ReplicationSlotAcquireInternal and the slot is not found, the ereport > in following code dereferences NULL. That's bogus. It is using name in that case. Sorry for the noise. I don't find a problem by a brief look on it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020/06/18 16:36, Kyotaro Horiguchi wrote: > Mmm. I hurried too much.. > > At Thu, 18 Jun 2020 16:32:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in >> If name is specified (so slot is NULL) to >> ReplicationSlotAcquireInternal and the slot is not found, the ereport >> in following code dereferences NULL. > > That's bogus. It is using name in that case. Sorry for the noise. > > I don't find a problem by a brief look on it. Thanks for the review! I pushed the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020-Jun-17, Kyotaro Horiguchi wrote: > @@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN) > * the first WAL segment file since startup, which causes the status being > * wrong under certain abnormal conditions but that doesn't actually harm. > */ > - oldestSeg = XLogGetLastRemovedSegno() + 1; > + oldestSeg = last_removed_seg + 1; > > /* calculate oldest segment by max_wal_size and wal_keep_segments */ > XLByteToSeg(currpos, currSeg, wal_segment_size); This hunk should have updated the comment two lines above. However: > @@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) > rsinfo->setResult = tupstore; > rsinfo->setDesc = tupdesc; > > + /* > + * Remember the last removed segment at this point for the consistency in > + * this table. Since there's no interlock between slot data and > + * checkpointer, the segment can be removed in-between, but that doesn't > + * make any practical difference. > + */ > + last_removed_seg = XLogGetLastRemovedSegno(); I am mystified as to why you added this change. I understand that your point here is to make all slots reported their state as compared to the same LSN, but why do it like that? If a segment is removed in between, it could mean that the view reports more lies than it would if we update the segno for each slot. I mean, suppose two slots are lagging behind and one is reported as 'extended' because when we compute it it's still in range; then a segment is removed. With your coding, we'll report both as extended, but with the original coding, we'll report the new one as lost. By the time the user reads the result, they'll read one incorrect report with the original code, and two incorrect reports with your code. So ... yes it might be more consistent, but what does that buy the user? OTOH it makes GetWALAvailability gain a new argument, which we have to document. > + /* > + * However segments required by the slot has been lost, if walsender is > + * active the walsender can read into the first reserved slot. > + */ > + if (slot_is_active) > + return WALAVAIL_BEING_REMOVED; I don't understand this comment; can you please clarify what you mean? I admit I don't like this slot_is_active argument you propose to add to GetWALAvailability either; previously the function can be called with an LSN coming from anywhere, not just a slot; the new argument implies that the LSN comes from a slot. (Your proposed patch doesn't document this one either.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks for looking this. At Fri, 19 Jun 2020 18:23:59 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Jun-17, Kyotaro Horiguchi wrote: > > > @@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN) > > * the first WAL segment file since startup, which causes the status being > > * wrong under certain abnormal conditions but that doesn't actually harm. > > */ > > - oldestSeg = XLogGetLastRemovedSegno() + 1; > > + oldestSeg = last_removed_seg + 1; > > > > /* calculate oldest segment by max_wal_size and wal_keep_segments */ > > XLByteToSeg(currpos, currSeg, wal_segment_size); > > This hunk should have updated the comment two lines above. However: > > > @@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) > > rsinfo->setResult = tupstore; > > rsinfo->setDesc = tupdesc; > > > > + /* > > + * Remember the last removed segment at this point for the consistency in > > + * this table. Since there's no interlock between slot data and > > + * checkpointer, the segment can be removed in-between, but that doesn't > > + * make any practical difference. > > + */ > > + last_removed_seg = XLogGetLastRemovedSegno(); > > I am mystified as to why you added this change. I understand that your > point here is to make all slots reported their state as compared to the > same LSN, but why do it like that? If a segment is removed in between, > it could mean that the view reports more lies than it would if we update > the segno for each slot. I mean, suppose two slots are lagging behind > and one is reported as 'extended' because when we compute it it's still > in range; then a segment is removed. With your coding, we'll report > both as extended, but with the original coding, we'll report the new one > as lost. By the time the user reads the result, they'll read one > incorrect report with the original code, and two incorrect reports with > your code. So ... yes it might be more consistent, but what does that > buy the user? I agree to you. Anyway the view may show "wrong" statuses if concurrent WAL-file removal is running. But I can understand it is better that the numbers in a view are consistent. The change contributes only to that point. So I noted as "doesn't make any practical difference". Since it is going to be removed, I removed the changes for the part. https://www.postgresql.org/message-id/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com > OTOH it makes GetWALAvailability gain a new argument, which we have to > document. > > > + /* > > + * However segments required by the slot has been lost, if walsender is > > + * active the walsender can read into the first reserved slot. > > + */ > > + if (slot_is_active) > > + return WALAVAIL_BEING_REMOVED; > > I don't understand this comment; can you please clarify what you mean? I have had comments that the "lost" state should be a definite state, that is, a state mustn't go back to other states. I had the same from Fujii-san again. Suppose we are starting from the following situation: State A: |---- seg n-1 ----|---- seg n ----| ^ X (restart_lsn of slot S) - max_slot_wal_keep_size If the segment n-1 is removed, slot S's status becomes "lost". However, if the walsender that is using the slot has not been killed yet, the point X can move foward to the segment n (State B). State B: |XXXX seg n-1 XXXX|---- seg n ----| ^ X (restart_lsn of slot S) - max_slot_wal_keep_size This is the normal (or extend) state. If we want to the state "lost" to be definitive, we cannot apply the state label "lost" to State A if it is active. WALAVAIL_BEING_REMOVED (I noticed it has been removed for a wrong reason so I revived it in this patch [1].) was used for the same state, that is, the segment at restart_lsn will be removed soon but not yet. 1: https://www.postgresql.org/message-id/20200406.185027.648866525989475817.horikyota.ntt@gmail.com > I admit I don't like this slot_is_active argument you propose to add to > GetWALAvailability either; previously the function can be called with > an LSN coming from anywhere, not just a slot; the new argument implies > that the LSN comes from a slot. (Your proposed patch doesn't document > this one either.) Agreed. I felt like you at the time. I came up with another way after hearing that from you. In the attached GetWALAvailability() returns the state assuming the walsender is not active. And the caller (pg_get_replication_slots()) considers the case where the walsender is active. regares. -- Kyotaro Horiguchi NTT Open Source Software Center From 74fb205fc87397671392f4f877b2466d1d19869c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Mon, 15 Jun 2020 16:18:23 +0900 Subject: [PATCH] Some fixes of pg_replication_slots.wal_status The colums is shown as "lost" when the segment at the given LSN has been removed by a checkpoint. But in certain situation the state can come back to "normal" state. To avoid that transition, a new state "being lost" is added, which means "the slot is nearly losing required WAL segments, but not definitely yet." Along with that change, other status words are changed as "normal"->"reserved" and "reserved"->"extended" to clarify the meaning of each word. This also fixes the state recognition logic so that the "lost" state persists after slot invalidation happens. --- doc/src/sgml/catalogs.sgml | 24 +++++++--- src/backend/access/transam/xlog.c | 58 +++++++++++++---------- src/backend/replication/slot.c | 2 + src/backend/replication/slotfuncs.c | 38 ++++++++++++--- src/include/access/xlog.h | 7 +-- src/include/replication/slot.h | 3 ++ src/test/recovery/t/019_replslot_limit.pl | 25 +++++----- 7 files changed, 103 insertions(+), 54 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 5a66115df1..2c0b51bbd8 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -11239,19 +11239,29 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx Possible values are: <itemizedlist> <listitem> - <para><literal>normal</literal> means that the claimed files + <para><literal>reserved</literal> means that the claimed files are within <varname>max_wal_size</varname>.</para> </listitem> <listitem> - <para><literal>reserved</literal> means + <para><literal>extended</literal> means that <varname>max_wal_size</varname> is exceeded but the files are - still held, either by some replication slot or - by <varname>wal_keep_segments</varname>.</para> + still retained, either by some replication slot or + by <varname>wal_keep_segments</varname>. + </para> </listitem> <listitem> - <para><literal>lost</literal> means that some WAL files are - definitely lost and this slot cannot be used to resume replication - anymore.</para> + <para><literal>being lost</literal> means that the slot no longer + retains all required WAL files and some of them are to be removed at + the next checkpoint. This state can return + to <literal>reserved</literal> or <literal>extended</literal> + states. + </para> + </listitem> + <listitem> + <para> + <literal>lost</literal> means that some required WAL files have + been removed and this slot is no longer usable. + </para> </listitem> </itemizedlist> The last two states are seen only when diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a1256a103b..9d94d21d5e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9485,15 +9485,20 @@ CreateRestartPoint(int flags) * (typically a slot's restart_lsn) * * Returns one of the following enum values: - * * WALAVAIL_NORMAL means targetLSN is available because it is in the range - * of max_wal_size. * - * * WALAVAIL_PRESERVED means it is still available by preserving extra + * * WALAVAIL_RESERVED means targetLSN is available and it is in the range of + * max_wal_size. + * + * * WALAVAIL_EXTENDED means it is still available by preserving extra * segments beyond max_wal_size. If max_slot_wal_keep_size is smaller * than max_wal_size, this state is not returned. * - * * WALAVAIL_REMOVED means it is definitely lost. A replication stream on - * a slot with this LSN cannot continue. + * * WALAVAIL_BEING_REMOVED means it is being lost and the next checkpoint will + * remove reserved segments. The walsender using this slot may return to the + * above. + * + * * WALAVAIL_REMOVED means it has been removed. A replication stream on + * a slot with this LSN cannot continue after a restart. * * * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL. */ @@ -9509,13 +9514,19 @@ GetWALAvailability(XLogRecPtr targetLSN) * slot */ uint64 keepSegs; - /* slot does not reserve WAL. Either deactivated, or has never been active */ + /* + * slot does not reserve WAL. Either deactivated, or has never been + * active + */ if (XLogRecPtrIsInvalid(targetLSN)) return WALAVAIL_INVALID_LSN; currpos = GetXLogWriteRecPtr(); - /* calculate oldest segment currently needed by slots */ + /* + * calculate the oldest segment currently reserved by all slots, + * considering wal_keep_segments and max_slot_wal_keep_size + */ XLByteToSeg(targetLSN, targetSeg, wal_segment_size); KeepLogSeg(currpos, &oldestSlotSeg); @@ -9526,10 +9537,9 @@ GetWALAvailability(XLogRecPtr targetLSN) */ oldestSeg = XLogGetLastRemovedSegno() + 1; - /* calculate oldest segment by max_wal_size and wal_keep_segments */ + /* calculate oldest segment by max_wal_size */ XLByteToSeg(currpos, currSeg, wal_segment_size); - keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), - wal_segment_size) + 1; + keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1; if (currSeg > keepSegs) oldestSegMaxWalSize = currSeg - keepSegs; @@ -9537,27 +9547,23 @@ GetWALAvailability(XLogRecPtr targetLSN) oldestSegMaxWalSize = 1; /* - * If max_slot_wal_keep_size has changed after the last call, the segment - * that would been kept by the current setting might have been lost by the - * previous setting. No point in showing normal or keeping status values - * if the targetSeg is known to be lost. + * No point in returning reserved or extended status values if the + * targetSeg is known to be lost. */ - if (targetSeg >= oldestSeg) + if (targetSeg >= oldestSlotSeg) { - /* - * show "normal" when targetSeg is within max_wal_size, even if - * max_slot_wal_keep_size is smaller than max_wal_size. - */ - if ((max_slot_wal_keep_size_mb <= 0 || - max_slot_wal_keep_size_mb >= max_wal_size_mb) && - oldestSegMaxWalSize <= targetSeg) - return WALAVAIL_NORMAL; - - /* being retained by slots */ - if (oldestSlotSeg <= targetSeg) + /* show "reserved" when targetSeg is within max_wal_size */ + if (targetSeg >= oldestSegMaxWalSize) return WALAVAIL_RESERVED; + + /* being retained by slots exceeding max_wal_size */ + return WALAVAIL_EXTENDED; } + /* WAL segments are no longer retained but haven't been removed yet */ + if (targetSeg >= oldestSeg) + return WALAVAIL_BEING_REMOVED; + /* Definitely lost */ return WALAVAIL_REMOVED; } diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index a7bbcf3499..a1786489ad 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -288,6 +288,7 @@ ReplicationSlotCreate(const char *name, bool db_specific, slot->candidate_xmin_lsn = InvalidXLogRecPtr; slot->candidate_restart_valid = InvalidXLogRecPtr; slot->candidate_restart_lsn = InvalidXLogRecPtr; + slot->last_invalidated_lsn = InvalidXLogRecPtr; /* * Create the slot on disk. We haven't actually marked the slot allocated @@ -1226,6 +1227,7 @@ restart: (uint32) restart_lsn))); SpinLockAcquire(&s->mutex); + s->last_invalidated_lsn = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); ReplicationSlotRelease(); diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 06e4955de7..3deccf3448 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -283,6 +283,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) bool nulls[PG_GET_REPLICATION_SLOTS_COLS]; WALAvailability walstate; XLogSegNo last_removed_seg; + XLogRecPtr targetLSN; int i; if (!slot->in_use) @@ -342,7 +343,13 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) else nulls[i++] = true; - walstate = GetWALAvailability(slot_contents.data.restart_lsn); + /* use last_invalidated_lsn when the slot is invalidated */ + if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)) + targetLSN = slot_contents.last_invalidated_lsn; + else + targetLSN = slot_contents.data.restart_lsn; + + walstate = GetWALAvailability(targetLSN); switch (walstate) { @@ -350,16 +357,33 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) nulls[i++] = true; break; - case WALAVAIL_NORMAL: - values[i++] = CStringGetTextDatum("normal"); - break; - case WALAVAIL_RESERVED: values[i++] = CStringGetTextDatum("reserved"); break; + case WALAVAIL_EXTENDED: + values[i++] = CStringGetTextDatum("extended"); + break; + + case WALAVAIL_BEING_REMOVED: + values[i++] = CStringGetTextDatum("being lost"); + break; + case WALAVAIL_REMOVED: - values[i++] = CStringGetTextDatum("lost"); + /* + * If the segment that walsender is currently reading has been + * just removed, but the walsender goes into the next segment + * just after, the state goes back to "reserved" or + * "extended". We regard that state as "being lost", rather + * than "lost" since the slot has not definitely dead yet. The + * same can happen when walsender is immediately restarted + * after invalidation. + */ + if (slot_contents.active_pid != 0) + values[i++] = CStringGetTextDatum("being lost"); + else + values[i++] = CStringGetTextDatum("lost"); + break; default: @@ -367,7 +391,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) } if (max_slot_wal_keep_size_mb >= 0 && - (walstate == WALAVAIL_NORMAL || walstate == WALAVAIL_RESERVED) && + (walstate == WALAVAIL_RESERVED || walstate == WALAVAIL_EXTENDED) && ((last_removed_seg = XLogGetLastRemovedSegno()) != 0)) { XLogRecPtr min_safe_lsn; diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 347a38f57c..85c1f67e57 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -270,8 +270,9 @@ extern CheckpointStatsData CheckpointStats; typedef enum WALAvailability { WALAVAIL_INVALID_LSN, /* parameter error */ - WALAVAIL_NORMAL, /* WAL segment is within max_wal_size */ - WALAVAIL_RESERVED, /* WAL segment is reserved by a slot */ + WALAVAIL_RESERVED, /* WAL segment is within max_wal_size */ + WALAVAIL_EXTENDED, /* WAL segment is reserved by a slot */ + WALAVAIL_BEING_REMOVED, /* WAL segment is being removed */ WALAVAIL_REMOVED /* WAL segment has been removed */ } WALAvailability; @@ -326,7 +327,7 @@ extern void ShutdownXLOG(int code, Datum arg); extern void InitXLOGAccess(void); extern void CreateCheckPoint(int flags); extern bool CreateRestartPoint(int flags); -extern WALAvailability GetWALAvailability(XLogRecPtr restart_lsn); +extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN); extern XLogRecPtr CalculateMaxmumSafeLSN(void); extern void XLogPutNextOid(Oid nextOid); extern XLogRecPtr XLogRestorePoint(const char *rpName); diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 917876010e..8090ca81fe 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -156,6 +156,9 @@ typedef struct ReplicationSlot XLogRecPtr candidate_xmin_lsn; XLogRecPtr candidate_restart_valid; XLogRecPtr candidate_restart_lsn; + + /* restart_lsn is copied here when the slot is invalidated */ + XLogRecPtr last_invalidated_lsn; } ReplicationSlot; #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid) diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index cba7df920c..ac0059bc7e 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -56,7 +56,7 @@ $node_standby->stop; $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal|t", 'check the catching-up state'); +is($result, "reserved|t", 'check the catching-up state'); # Advance WAL by five segments (= 5MB) on master advance_wal($node_master, 1); @@ -66,7 +66,7 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal|t", 'check that it is safe if WAL fits in max_wal_size'); +is($result, "reserved|t", 'check that it is safe if WAL fits in max_wal_size'); advance_wal($node_master, 4); $node_master->safe_psql('postgres', "CHECKPOINT;"); @@ -75,7 +75,7 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal|t", 'check that slot is working'); +is($result, "reserved|t", 'check that slot is working'); # The standby can reconnect to master $node_standby->start; @@ -99,7 +99,7 @@ $node_master->reload; $result = $node_master->safe_psql('postgres', "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'"); -is($result, "normal", 'check that max_slot_wal_keep_size is working'); +is($result, "reserved", 'check that max_slot_wal_keep_size is working'); # Advance WAL again then checkpoint, reducing remain by 2 MB. advance_wal($node_master, 2); @@ -108,7 +108,7 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); # The slot is still working $result = $node_master->safe_psql('postgres', "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'"); -is($result, "normal", +is($result, "reserved", 'check that min_safe_lsn gets close to the current LSN'); # The standby can reconnect to master @@ -125,7 +125,7 @@ advance_wal($node_master, 6); $result = $node_master->safe_psql('postgres', "SELECT wal_status as remain FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal", +is($result, "extended", 'check that wal_keep_segments overrides max_slot_wal_keep_size'); # restore wal_keep_segments $result = $node_master->safe_psql('postgres', @@ -138,12 +138,15 @@ $node_master->wait_for_catchup($node_standby, 'replay', $start_lsn); $node_standby->stop; # Advance WAL again without checkpoint, reducing remain by 6 MB. +$result = $node_master->safe_psql('postgres', + "SELECT wal_status, restart_lsn, min_safe_lsn FROM pg_replication_slots WHERE slot_name= 'rep1'"); +print $result, "\n"; advance_wal($node_master, 6); # Slot gets into 'reserved' state $result = $node_master->safe_psql('postgres', "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'"); -is($result, "reserved", 'check that the slot state changes to "reserved"'); +is($result, "extended", 'check that the slot state changes to "extended"'); # do checkpoint so that the next checkpoint runs too early $node_master->safe_psql('postgres', "CHECKPOINT;"); @@ -151,11 +154,11 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); # Advance WAL again without checkpoint; remain goes to 0. advance_wal($node_master, 1); -# Slot gets into 'lost' state +# Slot gets into 'being lost' state $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "lost|t", 'check that the slot state changes to "lost"'); +is($result, "being lost|t", 'check that the slot state changes to "being lost"'); # The standby still can connect to master before a checkpoint $node_standby->start; @@ -186,7 +189,7 @@ ok( find_in_log( $result = $node_master->safe_psql('postgres', "SELECT slot_name, active, restart_lsn IS NULL, wal_status, min_safe_lsn FROM pg_replication_slots WHERE slot_name ='rep1'" ); -is($result, "rep1|f|t||", 'check that the slot became inactive'); +is($result, "rep1|f|t|lost|", 'check that the slot became inactive and the state "lost" persists'); # The standby no longer can connect to the master $logstart = get_log_size($node_standby); @@ -259,7 +262,7 @@ sub advance_wal for (my $i = 0; $i < $n; $i++) { $node->safe_psql('postgres', - "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();"); + "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();"); } return; } -- 2.18.4
On 2020-Jun-16, Kyotaro Horiguchi wrote: > I saw that the "reserved" is the state where slots are working to > retain segments, and "normal" is the state to indicate that "WAL > segments are within max_wal_size", which is orthogonal to the notion > of "reserved". So it seems to me useless when the retained WAL > segments cannot exceeds max_wal_size. > > With longer description they would be: > > "reserved under max_wal_size" > "reserved over max_wal_size" > "lost some segements" > Come to think of that, I realized that my trouble was just the > wording. Are the following wordings make sense to you? > > "reserved" - retained within max_wal_size > "extended" - retained over max_wal_size > "lost" - lost some segments So let's add Unreserved to denote the state that it's over the slot size but no segments have been removed yet: * Reserved under max_wal_size * Extended past max_wal_size, but still within wal_keep_segments or maximum slot size. * Unreserved Past wal_keep_segments and the maximum slot size, but not yet removed. Recoverable condition. * Lost lost segments. Unrecoverable condition. It seems better to me to save the invalidation LSN in the persistent data rather than the in-memory data that's lost on restart. As is, we would lose the status in a restart, which doesn't seem good to me. It's just eight extra bytes to write ... should be pretty much free. This version I propose is based on the one you posted earlier today and is what I propose for commit. Thanks! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
At Tue, 23 Jun 2020 19:06:25 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Jun-16, Kyotaro Horiguchi wrote: > > > I saw that the "reserved" is the state where slots are working to > > retain segments, and "normal" is the state to indicate that "WAL > > segments are within max_wal_size", which is orthogonal to the notion > > of "reserved". So it seems to me useless when the retained WAL > > segments cannot exceeds max_wal_size. > > > > With longer description they would be: > > > > "reserved under max_wal_size" > > "reserved over max_wal_size" > > "lost some segements" > > > Come to think of that, I realized that my trouble was just the > > wording. Are the following wordings make sense to you? > > > > "reserved" - retained within max_wal_size > > "extended" - retained over max_wal_size > > "lost" - lost some segments > > So let's add Unreserved to denote the state that it's over the slot size > but no segments have been removed yet: Oh! Thanks for the more proper word. It looks good to me. > * Reserved under max_wal_size > * Extended past max_wal_size, but still within wal_keep_segments or > maximum slot size. > * Unreserved Past wal_keep_segments and the maximum slot size, but > not yet removed. Recoverable condition. > * Lost lost segments. Unrecoverable condition. Look good, too. > It seems better to me to save the invalidation LSN in the persistent > data rather than the in-memory data that's lost on restart. As is, we > would lose the status in a restart, which doesn't seem good to me. It's > just eight extra bytes to write ... should be pretty much free. Agreed. > This version I propose is based on the one you posted earlier today and > is what I propose for commit. - /* slot does not reserve WAL. Either deactivated, or has never been active */ + /* + * slot does not reserve WAL. Either deactivated, or has never been active + */ Sorry, this is my fault. The change is useless. The code for WALAVAIL_REMOVED looks good. # Advance WAL again without checkpoint, reducing remain by 6 MB. +$result = $node_master->safe_psql('postgres', + "SELECT wal_status, restart_lsn, min_safe_lsn FROM pg_replication_slots WHERE slot_name = 'rep1'" +); +print $result, "\n"; Sorry this is my fault, too. Removed in the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 5a66115df1..49a881b262 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -11239,19 +11239,29 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx Possible values are: <itemizedlist> <listitem> - <para><literal>normal</literal> means that the claimed files + <para><literal>reserved</literal> means that the claimed files are within <varname>max_wal_size</varname>.</para> </listitem> <listitem> - <para><literal>reserved</literal> means + <para><literal>extended</literal> means that <varname>max_wal_size</varname> is exceeded but the files are - still held, either by some replication slot or - by <varname>wal_keep_segments</varname>.</para> + still retained, either by the replication slot or + by <varname>wal_keep_segments</varname>. + </para> </listitem> <listitem> - <para><literal>lost</literal> means that some WAL files are - definitely lost and this slot cannot be used to resume replication - anymore.</para> + <para> + <literal>unreserved</literal> means that the slot no longer + retains the required WAL files and some of them are to be removed at + the next checkpoint. This state can return + to <literal>reserved</literal> or <literal>extended</literal>. + </para> + </listitem> + <listitem> + <para> + <literal>lost</literal> means that some required WAL files have + been removed and this slot is no longer usable. + </para> </listitem> </itemizedlist> The last two states are seen only when diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a1256a103b..4a4bb30418 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9485,15 +9485,20 @@ CreateRestartPoint(int flags) * (typically a slot's restart_lsn) * * Returns one of the following enum values: - * * WALAVAIL_NORMAL means targetLSN is available because it is in the range - * of max_wal_size. * - * * WALAVAIL_PRESERVED means it is still available by preserving extra + * * WALAVAIL_RESERVED means targetLSN is available and it is in the range of + * max_wal_size. + * + * * WALAVAIL_EXTENDED means it is still available by preserving extra * segments beyond max_wal_size. If max_slot_wal_keep_size is smaller * than max_wal_size, this state is not returned. * - * * WALAVAIL_REMOVED means it is definitely lost. A replication stream on - * a slot with this LSN cannot continue. + * * WALAVAIL_UNRESERVED means it is being lost and the next checkpoint will + * remove reserved segments. The walsender using this slot may return to the + * above. + * + * * WALAVAIL_REMOVED means it has been removed. A replication stream on + * a slot with this LSN cannot continue after a restart. * * * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL. */ @@ -9509,13 +9514,18 @@ GetWALAvailability(XLogRecPtr targetLSN) * slot */ uint64 keepSegs; - /* slot does not reserve WAL. Either deactivated, or has never been active */ + /* + * slot does not reserve WAL. Either deactivated, or has never been active + */ if (XLogRecPtrIsInvalid(targetLSN)) return WALAVAIL_INVALID_LSN; currpos = GetXLogWriteRecPtr(); - /* calculate oldest segment currently needed by slots */ + /* + * calculate the oldest segment currently reserved by all slots, + * considering wal_keep_segments and max_slot_wal_keep_size + */ XLByteToSeg(targetLSN, targetSeg, wal_segment_size); KeepLogSeg(currpos, &oldestSlotSeg); @@ -9526,10 +9536,9 @@ GetWALAvailability(XLogRecPtr targetLSN) */ oldestSeg = XLogGetLastRemovedSegno() + 1; - /* calculate oldest segment by max_wal_size and wal_keep_segments */ + /* calculate oldest segment by max_wal_size */ XLByteToSeg(currpos, currSeg, wal_segment_size); - keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments), - wal_segment_size) + 1; + keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1; if (currSeg > keepSegs) oldestSegMaxWalSize = currSeg - keepSegs; @@ -9537,27 +9546,23 @@ GetWALAvailability(XLogRecPtr targetLSN) oldestSegMaxWalSize = 1; /* - * If max_slot_wal_keep_size has changed after the last call, the segment - * that would been kept by the current setting might have been lost by the - * previous setting. No point in showing normal or keeping status values - * if the targetSeg is known to be lost. + * No point in returning reserved or extended status values if the + * targetSeg is known to be lost. */ - if (targetSeg >= oldestSeg) + if (targetSeg >= oldestSlotSeg) { - /* - * show "normal" when targetSeg is within max_wal_size, even if - * max_slot_wal_keep_size is smaller than max_wal_size. - */ - if ((max_slot_wal_keep_size_mb <= 0 || - max_slot_wal_keep_size_mb >= max_wal_size_mb) && - oldestSegMaxWalSize <= targetSeg) - return WALAVAIL_NORMAL; - - /* being retained by slots */ - if (oldestSlotSeg <= targetSeg) + /* show "reserved" when targetSeg is within max_wal_size */ + if (targetSeg >= oldestSegMaxWalSize) return WALAVAIL_RESERVED; + + /* being retained by slots exceeding max_wal_size */ + return WALAVAIL_EXTENDED; } + /* WAL segments are no longer retained but haven't been removed yet */ + if (targetSeg >= oldestSeg) + return WALAVAIL_UNRESERVED; + /* Definitely lost */ return WALAVAIL_REMOVED; } diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index a7bbcf3499..e8761f3a18 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1226,6 +1226,7 @@ restart: (uint32) restart_lsn))); SpinLockAcquire(&s->mutex); + s->data.invalidated_at = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); ReplicationSlotRelease(); diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 06e4955de7..df854bc6e3 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -283,6 +283,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) bool nulls[PG_GET_REPLICATION_SLOTS_COLS]; WALAvailability walstate; XLogSegNo last_removed_seg; + XLogRecPtr targetLSN; int i; if (!slot->in_use) @@ -342,7 +343,15 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) else nulls[i++] = true; - walstate = GetWALAvailability(slot_contents.data.restart_lsn); + /* + * Report availability from invalidated_at when the slot has been + * invalidated; otherwise slots would appear as invalid without any + * more clues as to what happened. + */ + targetLSN = XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) ? + slot_contents.data.invalidated_at : + slot_contents.data.restart_lsn; + walstate = GetWALAvailability(targetLSN); switch (walstate) { @@ -350,24 +359,47 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) nulls[i++] = true; break; - case WALAVAIL_NORMAL: - values[i++] = CStringGetTextDatum("normal"); - break; - case WALAVAIL_RESERVED: values[i++] = CStringGetTextDatum("reserved"); break; + case WALAVAIL_EXTENDED: + values[i++] = CStringGetTextDatum("extended"); + break; + + case WALAVAIL_UNRESERVED: + values[i++] = CStringGetTextDatum("unreserved"); + break; + case WALAVAIL_REMOVED: + + /* + * If we read the restart_lsn long enough ago, maybe that file + * has been removed by now. However, the walsender could have + * moved forward enough that it jumped to another file after + * we looked. If checkpointer signalled the process to + * termination, then it's definitely lost; but if a process is + * still alive, then "unreserved" seems more appropriate. + */ + if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)) + { + int pid; + + SpinLockAcquire(&slot->mutex); + pid = slot->active_pid; + SpinLockRelease(&slot->mutex); + if (pid != 0) + { + values[i++] = CStringGetTextDatum("unreserved"); + break; + } + } values[i++] = CStringGetTextDatum("lost"); break; - - default: - elog(ERROR, "invalid walstate: %d", (int) walstate); } if (max_slot_wal_keep_size_mb >= 0 && - (walstate == WALAVAIL_NORMAL || walstate == WALAVAIL_RESERVED) && + (walstate == WALAVAIL_RESERVED || walstate == WALAVAIL_EXTENDED) && ((last_removed_seg = XLogGetLastRemovedSegno()) != 0)) { XLogRecPtr min_safe_lsn; diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 347a38f57c..77ac4e785f 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -270,8 +270,10 @@ extern CheckpointStatsData CheckpointStats; typedef enum WALAvailability { WALAVAIL_INVALID_LSN, /* parameter error */ - WALAVAIL_NORMAL, /* WAL segment is within max_wal_size */ - WALAVAIL_RESERVED, /* WAL segment is reserved by a slot */ + WALAVAIL_RESERVED, /* WAL segment is within max_wal_size */ + WALAVAIL_EXTENDED, /* WAL segment is reserved by a slot or + * wal_keep_segments */ + WALAVAIL_UNRESERVED, /* no longer reserved, but not removed yet */ WALAVAIL_REMOVED /* WAL segment has been removed */ } WALAvailability; @@ -326,7 +328,7 @@ extern void ShutdownXLOG(int code, Datum arg); extern void InitXLOGAccess(void); extern void CreateCheckPoint(int flags); extern bool CreateRestartPoint(int flags); -extern WALAvailability GetWALAvailability(XLogRecPtr restart_lsn); +extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN); extern XLogRecPtr CalculateMaxmumSafeLSN(void); extern void XLogPutNextOid(Oid nextOid); extern XLogRecPtr XLogRestorePoint(const char *rpName); diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 917876010e..31362585ec 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -79,6 +79,9 @@ typedef struct ReplicationSlotPersistentData /* oldest LSN that might be required by this replication slot */ XLogRecPtr restart_lsn; + /* restart_lsn is copied here when the slot is invalidated */ + XLogRecPtr invalidated_at; + /* * Oldest LSN that the client has acked receipt for. This is used as the * start_lsn point in case the client doesn't specify one, and also as a diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index cba7df920c..7d22ae5720 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -56,7 +56,7 @@ $node_standby->stop; $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal|t", 'check the catching-up state'); +is($result, "reserved|t", 'check the catching-up state'); # Advance WAL by five segments (= 5MB) on master advance_wal($node_master, 1); @@ -66,7 +66,8 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal|t", 'check that it is safe if WAL fits in max_wal_size'); +is($result, "reserved|t", + 'check that it is safe if WAL fits in max_wal_size'); advance_wal($node_master, 4); $node_master->safe_psql('postgres', "CHECKPOINT;"); @@ -75,7 +76,7 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal|t", 'check that slot is working'); +is($result, "reserved|t", 'check that slot is working'); # The standby can reconnect to master $node_standby->start; @@ -99,7 +100,7 @@ $node_master->reload; $result = $node_master->safe_psql('postgres', "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'"); -is($result, "normal", 'check that max_slot_wal_keep_size is working'); +is($result, "reserved", 'check that max_slot_wal_keep_size is working'); # Advance WAL again then checkpoint, reducing remain by 2 MB. advance_wal($node_master, 2); @@ -108,7 +109,7 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); # The slot is still working $result = $node_master->safe_psql('postgres', "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'"); -is($result, "normal", +is($result, "reserved", 'check that min_safe_lsn gets close to the current LSN'); # The standby can reconnect to master @@ -125,7 +126,7 @@ advance_wal($node_master, 6); $result = $node_master->safe_psql('postgres', "SELECT wal_status as remain FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "normal", +is($result, "extended", 'check that wal_keep_segments overrides max_slot_wal_keep_size'); # restore wal_keep_segments $result = $node_master->safe_psql('postgres', @@ -143,7 +144,7 @@ advance_wal($node_master, 6); # Slot gets into 'reserved' state $result = $node_master->safe_psql('postgres', "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1'"); -is($result, "reserved", 'check that the slot state changes to "reserved"'); +is($result, "extended", 'check that the slot state changes to "extended"'); # do checkpoint so that the next checkpoint runs too early $node_master->safe_psql('postgres', "CHECKPOINT;"); @@ -151,11 +152,12 @@ $node_master->safe_psql('postgres', "CHECKPOINT;"); # Advance WAL again without checkpoint; remain goes to 0. advance_wal($node_master, 1); -# Slot gets into 'lost' state +# Slot gets into 'unreserved' state $result = $node_master->safe_psql('postgres', "SELECT wal_status, min_safe_lsn is NULL FROM pg_replication_slots WHERE slot_name = 'rep1'" ); -is($result, "lost|t", 'check that the slot state changes to "lost"'); +is($result, "unreserved|t", + 'check that the slot state changes to "unreserved"'); # The standby still can connect to master before a checkpoint $node_standby->start; @@ -186,7 +188,8 @@ ok( find_in_log( $result = $node_master->safe_psql('postgres', "SELECT slot_name, active, restart_lsn IS NULL, wal_status, min_safe_lsn FROM pg_replication_slots WHERE slot_name ='rep1'" ); -is($result, "rep1|f|t||", 'check that the slot became inactive'); +is($result, "rep1|f|t|lost|", + 'check that the slot became inactive and the state "lost" persists'); # The standby no longer can connect to the master $logstart = get_log_size($node_standby);
Thanks for those corrections. I have pushed this. I think all problems Masao-san reported have been dealt with, so we're done here. Thanks! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/06/25 3:27, Alvaro Herrera wrote: > Thanks for those corrections. > > I have pushed this. I think all problems Masao-san reported have been > dealt with, so we're done here. Sorry for my late to reply here... Thanks for committing the patch and improving the feature! /* * Find the oldest extant segment file. We get 1 until checkpoint removes * the first WAL segment file since startup, which causes the status being * wrong under certain abnormal conditions but that doesn't actually harm. */ oldestSeg = XLogGetLastRemovedSegno() + 1; I see the point of the above comment, but this can cause wal_status to be changed from "lost" to "unreserved" after the server restart. Isn't this really confusing? At least it seems better to document that behavior. Or if we *can ensure* that the slot with invalidated_at set always means "lost" slot, we can judge that wal_status is "lost" without using fragile XLogGetLastRemovedSegno(). Thought? Or XLogGetLastRemovedSegno() should be fixed so that it returns valid value even after the restart? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020-Jun-25, Fujii Masao wrote: > /* > * Find the oldest extant segment file. We get 1 until checkpoint removes > * the first WAL segment file since startup, which causes the status being > * wrong under certain abnormal conditions but that doesn't actually harm. > */ > oldestSeg = XLogGetLastRemovedSegno() + 1; > > I see the point of the above comment, but this can cause wal_status to be > changed from "lost" to "unreserved" after the server restart. Isn't this > really confusing? At least it seems better to document that behavior. Hmm. > Or if we *can ensure* that the slot with invalidated_at set always means > "lost" slot, we can judge that wal_status is "lost" without using fragile > XLogGetLastRemovedSegno(). Thought? Hmm, this sounds compelling -- I think it just means we need to ensure we reset invalidated_at to zero if the slot's restart_lsn is set to a correct position afterwards. I don't think we have any operation that does that, so it should be safe -- hopefully I didn't overlook anything? Neither copy nor advance seem to work with a slot that has invalid restart_lsn. > Or XLogGetLastRemovedSegno() should be fixed so that it returns valid > value even after the restart? This seems more work to implement. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/06/25 12:57, Alvaro Herrera wrote: > On 2020-Jun-25, Fujii Masao wrote: > >> /* >> * Find the oldest extant segment file. We get 1 until checkpoint removes >> * the first WAL segment file since startup, which causes the status being >> * wrong under certain abnormal conditions but that doesn't actually harm. >> */ >> oldestSeg = XLogGetLastRemovedSegno() + 1; >> >> I see the point of the above comment, but this can cause wal_status to be >> changed from "lost" to "unreserved" after the server restart. Isn't this >> really confusing? At least it seems better to document that behavior. > > Hmm. > >> Or if we *can ensure* that the slot with invalidated_at set always means >> "lost" slot, we can judge that wal_status is "lost" without using fragile >> XLogGetLastRemovedSegno(). Thought? > > Hmm, this sounds compelling -- I think it just means we need to ensure > we reset invalidated_at to zero if the slot's restart_lsn is set to a > correct position afterwards. Yes. > I don't think we have any operation that > does that, so it should be safe -- hopefully I didn't overlook anything? We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave() just after setting invalidated_at and restart_lsn in InvalidateObsoleteReplicationSlots()? Otherwise, restart_lsn can go back to the previous value after the restart. diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index e8761f3a18..5584e5dd2c 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1229,6 +1229,13 @@ restart: s->data.invalidated_at = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); + + /* + * Save this invalidated slot to disk, to ensure that the slot + * is still invalid even after the server restart. + */ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); ReplicationSlotRelease(); /* if we did anything, start from scratch */ Maybe we don't need to do this if the slot is temporary? > Neither copy nor advance seem to work with a slot that has invalid > restart_lsn. > >> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid >> value even after the restart? > > This seems more work to implement. Yes. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Thu, 25 Jun 2020 14:35:34 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/06/25 12:57, Alvaro Herrera wrote: > > On 2020-Jun-25, Fujii Masao wrote: > > > >> /* > >> * Find the oldest extant segment file. We get 1 until checkpoint removes > >> * the first WAL segment file since startup, which causes the status being > >> * wrong under certain abnormal conditions but that doesn't actually harm. > >> */ > >> oldestSeg = XLogGetLastRemovedSegno() + 1; > >> > >> I see the point of the above comment, but this can cause wal_status to > >> be > >> changed from "lost" to "unreserved" after the server restart. Isn't > >> this > >> really confusing? At least it seems better to document that behavior. > > Hmm. > > > >> Or if we *can ensure* that the slot with invalidated_at set always > >> means > >> "lost" slot, we can judge that wal_status is "lost" without using > >> fragile > >> XLogGetLastRemovedSegno(). Thought? > > Hmm, this sounds compelling -- I think it just means we need to ensure > > we reset invalidated_at to zero if the slot's restart_lsn is set to a > > correct position afterwards. > > Yes. It is error-prone restriction, as discussed before. Without changing updator-side, invalid restart_lsn AND valid invalidated_at can be regarded as the lost state. With the following change suggested by Fujii-san we can avoid the confusing status. With attached first patch on top of the slot-dirtify fix below, we get "lost" for invalidated slots after restart. > > I don't think we have any operation that > > does that, so it should be safe -- hopefully I didn't overlook > > anything? > > We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave() > just after setting invalidated_at and restart_lsn in > InvalidateObsoleteReplicationSlots()? > Otherwise, restart_lsn can go back to the previous value after the > restart. > > diff --git a/src/backend/replication/slot.c > b/src/backend/replication/slot.c > index e8761f3a18..5584e5dd2c 100644 > --- a/src/backend/replication/slot.c > +++ b/src/backend/replication/slot.c > @@ -1229,6 +1229,13 @@ restart: > s->data.invalidated_at = s->data.restart_lsn; > s->data.restart_lsn = InvalidXLogRecPtr; > SpinLockRelease(&s->mutex); > + > + /* > + * Save this invalidated slot to disk, to ensure that the slot > + * is still invalid even after the server restart. > + */ > + ReplicationSlotMarkDirty(); > + ReplicationSlotSave(); > ReplicationSlotRelease(); > /* if we did anything, start from scratch */ > > Maybe we don't need to do this if the slot is temporary? The only difference of temprary slots from persistent one seems to be an attribute "persistency". Actually, create_physica_replication_slot() does the aboves for temporary slots. > > Neither copy nor advance seem to work with a slot that has invalid > > restart_lsn. > > > >> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid > >> value even after the restart? > > This seems more work to implement. > > Yes. The confusing status can be avoided without fixing it, but I prefer to fix it. As Fujii-san suggested upthread, couldn't we remember lastRemovedSegNo in the contorl file? (Yeah, it cuases a bump of PG_CONTROL_VERSION and CATALOG_VERSION_NO?). -- Kyotaro Horiguchi NTT Open Source Software Center From 7c8803f2bd7267d166f8a6f1a4ca5b129aa5ae96 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 25 Jun 2020 17:03:24 +0900 Subject: [PATCH 1/2] Make slot invalidation persistent --- src/backend/replication/slot.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index e8761f3a18..26f874e3cb 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1229,7 +1229,15 @@ restart: s->data.invalidated_at = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); + + /* + * Save this invalidated slot to disk, to ensure that the slot + * is still invalid even after the server restart. + */ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); ReplicationSlotRelease(); + /* if we did anything, start from scratch */ CHECK_FOR_INTERRUPTS(); -- 2.18.4 From 0792c41c915b87474958689a2acd5c214b393015 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 25 Jun 2020 17:04:01 +0900 Subject: [PATCH 2/2] Show correct value in pg_replication_slots.wal_status after restart The column may show bogus staus until checkpoint removes at least one WAL segment. This patch changes the logic to detect the state so that the column shows the correct status after restart. --- src/backend/replication/slotfuncs.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index fca18ffae5..889d6f53b6 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -283,7 +283,6 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) bool nulls[PG_GET_REPLICATION_SLOTS_COLS]; WALAvailability walstate; XLogSegNo last_removed_seg; - XLogRecPtr targetLSN; int i; if (!slot->in_use) @@ -348,10 +347,19 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) * invalidated; otherwise slots would appear as invalid without any * more clues as to what happened. */ - targetLSN = XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) ? - slot_contents.data.invalidated_at : - slot_contents.data.restart_lsn; - walstate = GetWALAvailability(targetLSN); + + /* + * valid invalidated_at means the slot have invalidated before. If + * restart_lsn is invalid addition to that, that means the slot has + * lost reqruied segments. We could just pass invalidated_at to + * GetWALAvailability if lastRemovedSegNo has been initialized, but the + * result is not reliable until checkpoint removes some segments. + */ + if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) && + !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at)) + walstate = WALAVAIL_REMOVED; + else + walstate = GetWALAvailability(slot_contents.data.restart_lsn); switch (walstate) { -- 2.18.4
On 2020-Jun-25, Kyotaro Horiguchi wrote: > It is error-prone restriction, as discussed before. > > Without changing updator-side, invalid restart_lsn AND valid > invalidated_at can be regarded as the lost state. With the following > change suggested by Fujii-san we can avoid the confusing status. > > With attached first patch on top of the slot-dirtify fix below, we get > "lost" for invalidated slots after restart. Makes sense. I pushed this change, thanks. > The confusing status can be avoided without fixing it, but I prefer to > fix it. As Fujii-san suggested upthread, couldn't we remember > lastRemovedSegNo in the contorl file? (Yeah, it cuases a bump of > PG_CONTROL_VERSION and CATALOG_VERSION_NO?). I think that's a pg14 change. Feel free to submit a patch to the commitfest. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services