Обсуждение: pg_upgrade: optimize replication slot caught-up check
Hi all, Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical slots. Currently, to check if the slots are ready to be migrated, we call binary_upgrade_logical_slot_has_caught_up() for every single slot. This checks if there are any unconsumed WAL records. However, we noticed a performance issue. If there are many slots (e.g., 100 or more) or if there is a WAL backlog, checking all slots one by one takes a long time. Here are some test results from my environment: With an empty cluster: 1.55s With 200 slots and 30MB backlog: 15.51s Commit 6d3d2e8e5 introduced parallel checks per database, but a single job might still have to check too many slots, causing delays. Since binary_upgrade_logical_slot_has_caught_up() essentially checks if any decodable record exists in the database, IIUC it is not necessary to check every slot. We can optimize this by checking only the slot with the minimum confirmed_flush_lsn. If that slot is caught up, we can assume others are too. The attached patch implements this optimization. With the patch, the test with 200 slots finished in 2.512s. The execution time is now stable regardless of the number of slots. One thing to note is that DecodeTXNNeedSkip() also considers replication origin filters. Theoretically, a plugin could filter out specific origins, which might lead to different results. However, this is a very rare case. Even if it happens, it would just result in a false positive (the upgrade fails safely), so the impact is minimal. Therefore, the patch simplifies the check to be per-database instead of per-slot. Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
> On Jan 6, 2026, at 02:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi all, > > Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical > slots. Currently, to check if the slots are ready to be migrated, we > call binary_upgrade_logical_slot_has_caught_up() for every single > slot. This checks if there are any unconsumed WAL records. However, we > noticed a performance issue. If there are many slots (e.g., 100 or > more) or if there is a WAL backlog, checking all slots one by one > takes a long time. > > Here are some test results from my environment: > With an empty cluster: 1.55s > With 200 slots and 30MB backlog: 15.51s > > Commit 6d3d2e8e5 introduced parallel checks per database, but a single > job might still have to check too many slots, causing delays. > > Since binary_upgrade_logical_slot_has_caught_up() essentially checks > if any decodable record exists in the database, IIUC it is not > necessary to check every slot. We can optimize this by checking only > the slot with the minimum confirmed_flush_lsn. If that slot is caught > up, we can assume others are too. The attached patch implements this > optimization. With the patch, the test with 200 slots finished in > 2.512s. The execution time is now stable regardless of the number of > slots. > > One thing to note is that DecodeTXNNeedSkip() also considers > replication origin filters. Theoretically, a plugin could filter out > specific origins, which might lead to different results. However, this > is a very rare case. Even if it happens, it would just result in a > false positive (the upgrade fails safely), so the impact is minimal. > Therefore, the patch simplifies the check to be per-database instead > of per-slot. > > Feedback is very welcome. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com > <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> Hi Masahiko-san, Basically I think the optimization idea makes sense to me. Reducing the check from O(slots × WAL) to O(databases × WAL) soundsa win. Just a few small comments: 1 ``` +static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, PGresult *res, void *arg); ``` Typo in the function name: cluter->cluster 2 ··· - logical_slot_infos_query = get_old_cluster_logical_slot_infos_query(); + const char *slot_info_query = "SELECT slot_name, plugin, two_phase, failover, “ ··· logical_slot_infos_query is no longer used, should be removed. 3 ``` + " ORDER BY confirmed_flush_lsn ASC " ``` I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may want to add “NULLS LAST”. 4 ``` + Assert(num_slots == dbinfo->slot_arr.nslots); + + for (int i = 0; i < num_slots; i++) + { + char *slotname; + bool caught_up; + + slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name")); + caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, "caught_up")), "t") == 0); + + for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; slotnum++) + { ``` As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) check. Given this patch’s goal is to eliminate theburden from large number of slots, maybe we can build a hash table to make the check faster. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Mon, Jan 5, 2026 at 6:55 PM Chao Li <li.evan.chao@gmail.com> wrote: > > > > > On Jan 6, 2026, at 02:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi all, > > > > Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical > > slots. Currently, to check if the slots are ready to be migrated, we > > call binary_upgrade_logical_slot_has_caught_up() for every single > > slot. This checks if there are any unconsumed WAL records. However, we > > noticed a performance issue. If there are many slots (e.g., 100 or > > more) or if there is a WAL backlog, checking all slots one by one > > takes a long time. > > > > Here are some test results from my environment: > > With an empty cluster: 1.55s > > With 200 slots and 30MB backlog: 15.51s > > > > Commit 6d3d2e8e5 introduced parallel checks per database, but a single > > job might still have to check too many slots, causing delays. > > > > Since binary_upgrade_logical_slot_has_caught_up() essentially checks > > if any decodable record exists in the database, IIUC it is not > > necessary to check every slot. We can optimize this by checking only > > the slot with the minimum confirmed_flush_lsn. If that slot is caught > > up, we can assume others are too. The attached patch implements this > > optimization. With the patch, the test with 200 slots finished in > > 2.512s. The execution time is now stable regardless of the number of > > slots. > > > > One thing to note is that DecodeTXNNeedSkip() also considers > > replication origin filters. Theoretically, a plugin could filter out > > specific origins, which might lead to different results. However, this > > is a very rare case. Even if it happens, it would just result in a > > false positive (the upgrade fails safely), so the impact is minimal. > > Therefore, the patch simplifies the check to be per-database instead > > of per-slot. > > > > Feedback is very welcome. > > > > Regards, > > > > -- > > Masahiko Sawada > > Amazon Web Services: https://aws.amazon.com > > <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> > > Hi Masahiko-san, > > Basically I think the optimization idea makes sense to me. Reducing the check from O(slots × WAL) to O(databases × WAL)sounds a win. Just a few small comments: Thank you for reviewing the patch! > > 1 > ``` > +static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, PGresult *res, void *arg); > ``` > > Typo in the function name: cluter->cluster > > 2 > ··· > - logical_slot_infos_query = get_old_cluster_logical_slot_infos_query(); > + const char *slot_info_query = "SELECT slot_name, plugin, two_phase, failover, “ > ··· > > logical_slot_infos_query is no longer used, should be removed. Fixed. > > 3 > ``` > + " ORDER BY confirmed_flush_lsn ASC " > ``` > > I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may want to add “NULLS LAST”. Given that the query is not executed when live-check, I think confirmed_flush cannot be NULL. temporary should also not be true but it's for consistency with slot_info_query. > > 4 > ``` > + Assert(num_slots == dbinfo->slot_arr.nslots); > + > + for (int i = 0; i < num_slots; i++) > + { > + char *slotname; > + bool caught_up; > + > + slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name")); > + caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, "caught_up")), "t") == 0); > + > + for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; slotnum++) > + { > ``` > > As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) check. Given this patch’s goal is to eliminatethe burden from large number of slots, maybe we can build a hash table to make the check faster. > Or if we sort both queries in the same order we can simply update the slots sequentially. One problem I can see in this approach is that we could wrongly report the invalid slots in invalid_logical_slots.txt. Even if the slot with the oldest confirmed_flush_lsn has unconsumed decodable records, other slots might have already been caught up. I'll think of how to deal with this problem. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jan 6, 2026 at 12:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 5, 2026 at 6:55 PM Chao Li <li.evan.chao@gmail.com> wrote: > > > > > > > > > On Jan 6, 2026, at 02:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Hi all, > > > > > > Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical > > > slots. Currently, to check if the slots are ready to be migrated, we > > > call binary_upgrade_logical_slot_has_caught_up() for every single > > > slot. This checks if there are any unconsumed WAL records. However, we > > > noticed a performance issue. If there are many slots (e.g., 100 or > > > more) or if there is a WAL backlog, checking all slots one by one > > > takes a long time. > > > > > > Here are some test results from my environment: > > > With an empty cluster: 1.55s > > > With 200 slots and 30MB backlog: 15.51s > > > > > > Commit 6d3d2e8e5 introduced parallel checks per database, but a single > > > job might still have to check too many slots, causing delays. > > > > > > Since binary_upgrade_logical_slot_has_caught_up() essentially checks > > > if any decodable record exists in the database, IIUC it is not > > > necessary to check every slot. We can optimize this by checking only > > > the slot with the minimum confirmed_flush_lsn. If that slot is caught > > > up, we can assume others are too. The attached patch implements this > > > optimization. With the patch, the test with 200 slots finished in > > > 2.512s. The execution time is now stable regardless of the number of > > > slots. > > > > > > One thing to note is that DecodeTXNNeedSkip() also considers > > > replication origin filters. Theoretically, a plugin could filter out > > > specific origins, which might lead to different results. However, this > > > is a very rare case. Even if it happens, it would just result in a > > > false positive (the upgrade fails safely), so the impact is minimal. > > > Therefore, the patch simplifies the check to be per-database instead > > > of per-slot. > > > > > > Feedback is very welcome. > > > > > > Regards, > > > > > > -- > > > Masahiko Sawada > > > Amazon Web Services: https://aws.amazon.com > > > <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> > > > > Hi Masahiko-san, > > > > Basically I think the optimization idea makes sense to me. Reducing the check from O(slots × WAL) to O(databases × WAL)sounds a win. Just a few small comments: > > Thank you for reviewing the patch! > > > > > 1 > > ``` > > +static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, PGresult *res, void *arg); > > ``` > > > > Typo in the function name: cluter->cluster > > > > 2 > > ··· > > - logical_slot_infos_query = get_old_cluster_logical_slot_infos_query(); > > + const char *slot_info_query = "SELECT slot_name, plugin, two_phase, failover, “ > > ··· > > > > logical_slot_infos_query is no longer used, should be removed. > > Fixed. > > > > > 3 > > ``` > > + " ORDER BY confirmed_flush_lsn ASC " > > ``` > > > > I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may want to add “NULLS LAST”. > > Given that the query is not executed when live-check, I think > confirmed_flush cannot be NULL. temporary should also not be true but > it's for consistency with slot_info_query. > > > > > 4 > > ``` > > + Assert(num_slots == dbinfo->slot_arr.nslots); > > + > > + for (int i = 0; i < num_slots; i++) > > + { > > + char *slotname; > > + bool caught_up; > > + > > + slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name")); > > + caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, "caught_up")), "t") == 0); > > + > > + for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; slotnum++) > > + { > > ``` > > > > As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) check. Given this patch’s goal is to eliminatethe burden from large number of slots, maybe we can build a hash table to make the check faster. > > > > Or if we sort both queries in the same order we can simply update the > slots sequentially. > > One problem I can see in this approach is that we could wrongly report > the invalid slots in invalid_logical_slots.txt. Even if the slot with > the oldest confirmed_flush_lsn has unconsumed decodable records, other > slots might have already been caught up. I'll think of how to deal > with this problem. I came up with a simple idea; instead of binary_upgrade_logical_slot_has_caught_up() returning true/false, we can have it return the last decodable WAL record's LSN, and consider logical slots as caught-up if its confirmed_flush_lsn already passed that LSN. This way, we can keep scanning at most one logical slot per database and reporting the same contents as today. The function nwo needs to scan all WAL backlogs but it's still much better than the current method. I've implemented this idea in the v2 patch with some updates: - incorporated the review comments. - added logic to support PG18 or older. - added regression tests. Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Wed, Jan 7, 2026 at 12:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jan 6, 2026 at 12:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jan 5, 2026 at 6:55 PM Chao Li <li.evan.chao@gmail.com> wrote: > > > > > > > > > > > > > On Jan 6, 2026, at 02:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > Hi all, > > > > > > > > Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical > > > > slots. Currently, to check if the slots are ready to be migrated, we > > > > call binary_upgrade_logical_slot_has_caught_up() for every single > > > > slot. This checks if there are any unconsumed WAL records. However, we > > > > noticed a performance issue. If there are many slots (e.g., 100 or > > > > more) or if there is a WAL backlog, checking all slots one by one > > > > takes a long time. > > > > > > > > Here are some test results from my environment: > > > > With an empty cluster: 1.55s > > > > With 200 slots and 30MB backlog: 15.51s > > > > > > > > Commit 6d3d2e8e5 introduced parallel checks per database, but a single > > > > job might still have to check too many slots, causing delays. > > > > > > > > Since binary_upgrade_logical_slot_has_caught_up() essentially checks > > > > if any decodable record exists in the database, IIUC it is not > > > > necessary to check every slot. We can optimize this by checking only > > > > the slot with the minimum confirmed_flush_lsn. If that slot is caught > > > > up, we can assume others are too. The attached patch implements this > > > > optimization. With the patch, the test with 200 slots finished in > > > > 2.512s. The execution time is now stable regardless of the number of > > > > slots. > > > > > > > > One thing to note is that DecodeTXNNeedSkip() also considers > > > > replication origin filters. Theoretically, a plugin could filter out > > > > specific origins, which might lead to different results. However, this > > > > is a very rare case. Even if it happens, it would just result in a > > > > false positive (the upgrade fails safely), so the impact is minimal. > > > > Therefore, the patch simplifies the check to be per-database instead > > > > of per-slot. > > > > > > > > Feedback is very welcome. > > > > > > > > Regards, > > > > > > > > -- > > > > Masahiko Sawada > > > > Amazon Web Services: https://aws.amazon.com > > > > <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> > > > > > > Hi Masahiko-san, > > > > > > Basically I think the optimization idea makes sense to me. Reducing the check from O(slots × WAL) to O(databases ×WAL) sounds a win. Just a few small comments: > > > > Thank you for reviewing the patch! > > > > > > > > 1 > > > ``` > > > +static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, PGresult *res, void *arg); > > > ``` > > > > > > Typo in the function name: cluter->cluster > > > > > > 2 > > > ··· > > > - logical_slot_infos_query = get_old_cluster_logical_slot_infos_query(); > > > + const char *slot_info_query = "SELECT slot_name, plugin, two_phase, failover, “ > > > ··· > > > > > > logical_slot_infos_query is no longer used, should be removed. > > > > Fixed. > > > > > > > > 3 > > > ``` > > > + " ORDER BY confirmed_flush_lsn ASC " > > > ``` > > > > > > I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may want to add “NULLS LAST”. > > > > Given that the query is not executed when live-check, I think > > confirmed_flush cannot be NULL. temporary should also not be true but > > it's for consistency with slot_info_query. > > > > > > > > 4 > > > ``` > > > + Assert(num_slots == dbinfo->slot_arr.nslots); > > > + > > > + for (int i = 0; i < num_slots; i++) > > > + { > > > + char *slotname; > > > + bool caught_up; > > > + > > > + slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name")); > > > + caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, "caught_up")), "t") == 0); > > > + > > > + for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; slotnum++) > > > + { > > > ``` > > > > > > As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) check. Given this patch’s goal is to eliminatethe burden from large number of slots, maybe we can build a hash table to make the check faster. > > > > > > > Or if we sort both queries in the same order we can simply update the > > slots sequentially. > > > > One problem I can see in this approach is that we could wrongly report > > the invalid slots in invalid_logical_slots.txt. Even if the slot with > > the oldest confirmed_flush_lsn has unconsumed decodable records, other > > slots might have already been caught up. I'll think of how to deal > > with this problem. > > I came up with a simple idea; instead of > binary_upgrade_logical_slot_has_caught_up() returning true/false, we > can have it return the last decodable WAL record's LSN, and consider > logical slots as caught-up if its confirmed_flush_lsn already passed > that LSN. This way, we can keep scanning at most one logical slot per > database and reporting the same contents as today. The function nwo > needs to scan all WAL backlogs but it's still much better than the > current method. > > I've implemented this idea in the v2 patch with some updates: > > - incorporated the review comments. > - added logic to support PG18 or older. > - added regression tests. > > Feedback is very welcome. > I've fixed the test failures reported by CI. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
> On Jan 8, 2026, at 04:18, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jan 6, 2026 at 12:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Mon, Jan 5, 2026 at 6:55 PM Chao Li <li.evan.chao@gmail.com> wrote: >>> >>> >>> >>>> On Jan 6, 2026, at 02:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> >>>> Hi all, >>>> >>>> Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical >>>> slots. Currently, to check if the slots are ready to be migrated, we >>>> call binary_upgrade_logical_slot_has_caught_up() for every single >>>> slot. This checks if there are any unconsumed WAL records. However, we >>>> noticed a performance issue. If there are many slots (e.g., 100 or >>>> more) or if there is a WAL backlog, checking all slots one by one >>>> takes a long time. >>>> >>>> Here are some test results from my environment: >>>> With an empty cluster: 1.55s >>>> With 200 slots and 30MB backlog: 15.51s >>>> >>>> Commit 6d3d2e8e5 introduced parallel checks per database, but a single >>>> job might still have to check too many slots, causing delays. >>>> >>>> Since binary_upgrade_logical_slot_has_caught_up() essentially checks >>>> if any decodable record exists in the database, IIUC it is not >>>> necessary to check every slot. We can optimize this by checking only >>>> the slot with the minimum confirmed_flush_lsn. If that slot is caught >>>> up, we can assume others are too. The attached patch implements this >>>> optimization. With the patch, the test with 200 slots finished in >>>> 2.512s. The execution time is now stable regardless of the number of >>>> slots. >>>> >>>> One thing to note is that DecodeTXNNeedSkip() also considers >>>> replication origin filters. Theoretically, a plugin could filter out >>>> specific origins, which might lead to different results. However, this >>>> is a very rare case. Even if it happens, it would just result in a >>>> false positive (the upgrade fails safely), so the impact is minimal. >>>> Therefore, the patch simplifies the check to be per-database instead >>>> of per-slot. >>>> >>>> Feedback is very welcome. >>>> >>>> Regards, >>>> >>>> -- >>>> Masahiko Sawada >>>> Amazon Web Services: https://aws.amazon.com >>>> <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> >>> >>> Hi Masahiko-san, >>> >>> Basically I think the optimization idea makes sense to me. Reducing the check from O(slots × WAL) to O(databases × WAL)sounds a win. Just a few small comments: >> >> Thank you for reviewing the patch! >> >>> >>> 1 >>> ``` >>> +static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, PGresult *res, void *arg); >>> ``` >>> >>> Typo in the function name: cluter->cluster >>> >>> 2 >>> ··· >>> - logical_slot_infos_query = get_old_cluster_logical_slot_infos_query(); >>> + const char *slot_info_query = "SELECT slot_name, plugin, two_phase, failover, “ >>> ··· >>> >>> logical_slot_infos_query is no longer used, should be removed. >> >> Fixed. >> >>> >>> 3 >>> ``` >>> + " ORDER BY confirmed_flush_lsn ASC " >>> ``` >>> >>> I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may want to add “NULLS LAST”. >> >> Given that the query is not executed when live-check, I think >> confirmed_flush cannot be NULL. temporary should also not be true but >> it's for consistency with slot_info_query. >> >>> >>> 4 >>> ``` >>> + Assert(num_slots == dbinfo->slot_arr.nslots); >>> + >>> + for (int i = 0; i < num_slots; i++) >>> + { >>> + char *slotname; >>> + bool caught_up; >>> + >>> + slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name")); >>> + caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, "caught_up")), "t") == 0); >>> + >>> + for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; slotnum++) >>> + { >>> ``` >>> >>> As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) check. Given this patch’s goal is to eliminatethe burden from large number of slots, maybe we can build a hash table to make the check faster. >>> >> >> Or if we sort both queries in the same order we can simply update the >> slots sequentially. >> >> One problem I can see in this approach is that we could wrongly report >> the invalid slots in invalid_logical_slots.txt. Even if the slot with >> the oldest confirmed_flush_lsn has unconsumed decodable records, other >> slots might have already been caught up. I'll think of how to deal >> with this problem. > > I came up with a simple idea; instead of > binary_upgrade_logical_slot_has_caught_up() returning true/false, we > can have it return the last decodable WAL record's LSN, and consider > logical slots as caught-up if its confirmed_flush_lsn already passed > that LSN. This way, we can keep scanning at most one logical slot per > database and reporting the same contents as today. The function nwo > needs to scan all WAL backlogs but it's still much better than the > current method. > Sounds better, but I need more time to look into the detail. I will spend time on this tomorrow. > I've implemented this idea in the v2 patch with some updates: > > - incorporated the review comments. > - added logic to support PG18 or older. > - added regression tests. > > Feedback is very welcome. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com > <v2-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
> On Jan 8, 2026, at 05:08, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 7, 2026 at 12:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Tue, Jan 6, 2026 at 12:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> >>> On Mon, Jan 5, 2026 at 6:55 PM Chao Li <li.evan.chao@gmail.com> wrote: >>>> >>>> >>>> >>>>> On Jan 6, 2026, at 02:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical >>>>> slots. Currently, to check if the slots are ready to be migrated, we >>>>> call binary_upgrade_logical_slot_has_caught_up() for every single >>>>> slot. This checks if there are any unconsumed WAL records. However, we >>>>> noticed a performance issue. If there are many slots (e.g., 100 or >>>>> more) or if there is a WAL backlog, checking all slots one by one >>>>> takes a long time. >>>>> >>>>> Here are some test results from my environment: >>>>> With an empty cluster: 1.55s >>>>> With 200 slots and 30MB backlog: 15.51s >>>>> >>>>> Commit 6d3d2e8e5 introduced parallel checks per database, but a single >>>>> job might still have to check too many slots, causing delays. >>>>> >>>>> Since binary_upgrade_logical_slot_has_caught_up() essentially checks >>>>> if any decodable record exists in the database, IIUC it is not >>>>> necessary to check every slot. We can optimize this by checking only >>>>> the slot with the minimum confirmed_flush_lsn. If that slot is caught >>>>> up, we can assume others are too. The attached patch implements this >>>>> optimization. With the patch, the test with 200 slots finished in >>>>> 2.512s. The execution time is now stable regardless of the number of >>>>> slots. >>>>> >>>>> One thing to note is that DecodeTXNNeedSkip() also considers >>>>> replication origin filters. Theoretically, a plugin could filter out >>>>> specific origins, which might lead to different results. However, this >>>>> is a very rare case. Even if it happens, it would just result in a >>>>> false positive (the upgrade fails safely), so the impact is minimal. >>>>> Therefore, the patch simplifies the check to be per-database instead >>>>> of per-slot. >>>>> >>>>> Feedback is very welcome. >>>>> >>>>> Regards, >>>>> >>>>> -- >>>>> Masahiko Sawada >>>>> Amazon Web Services: https://aws.amazon.com >>>>> <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> >>>> >>>> Hi Masahiko-san, >>>> >>>> Basically I think the optimization idea makes sense to me. Reducing the check from O(slots × WAL) to O(databases × WAL)sounds a win. Just a few small comments: >>> >>> Thank you for reviewing the patch! >>> >>>> >>>> 1 >>>> ``` >>>> +static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, PGresult *res, void *arg); >>>> ``` >>>> >>>> Typo in the function name: cluter->cluster >>>> >>>> 2 >>>> ··· >>>> - logical_slot_infos_query = get_old_cluster_logical_slot_infos_query(); >>>> + const char *slot_info_query = "SELECT slot_name, plugin, two_phase, failover, “ >>>> ··· >>>> >>>> logical_slot_infos_query is no longer used, should be removed. >>> >>> Fixed. >>> >>>> >>>> 3 >>>> ``` >>>> + " ORDER BY confirmed_flush_lsn ASC " >>>> ``` >>>> >>>> I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may want to add “NULLS LAST”. >>> >>> Given that the query is not executed when live-check, I think >>> confirmed_flush cannot be NULL. temporary should also not be true but >>> it's for consistency with slot_info_query. >>> >>>> >>>> 4 >>>> ``` >>>> + Assert(num_slots == dbinfo->slot_arr.nslots); >>>> + >>>> + for (int i = 0; i < num_slots; i++) >>>> + { >>>> + char *slotname; >>>> + bool caught_up; >>>> + >>>> + slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name")); >>>> + caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, "caught_up")), "t") == 0); >>>> + >>>> + for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; slotnum++) >>>> + { >>>> ``` >>>> >>>> As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) check. Given this patch’s goal is to eliminatethe burden from large number of slots, maybe we can build a hash table to make the check faster. >>>> >>> >>> Or if we sort both queries in the same order we can simply update the >>> slots sequentially. >>> >>> One problem I can see in this approach is that we could wrongly report >>> the invalid slots in invalid_logical_slots.txt. Even if the slot with >>> the oldest confirmed_flush_lsn has unconsumed decodable records, other >>> slots might have already been caught up. I'll think of how to deal >>> with this problem. >> >> I came up with a simple idea; instead of >> binary_upgrade_logical_slot_has_caught_up() returning true/false, we >> can have it return the last decodable WAL record's LSN, and consider >> logical slots as caught-up if its confirmed_flush_lsn already passed >> that LSN. This way, we can keep scanning at most one logical slot per >> database and reporting the same contents as today. The function nwo >> needs to scan all WAL backlogs but it's still much better than the >> current method. >> >> I've implemented this idea in the v2 patch with some updates: >> >> - incorporated the review comments. >> - added logic to support PG18 or older. >> - added regression tests. >> >> Feedback is very welcome. >> > > I've fixed the test failures reported by CI. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com > <v3-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> I just looked into v3. Basically, it now does a shared WAL scan to find the newest decodable LSN and uses that to comparewith all slots’ confirmed_flush_lsn, which significantly reduces WAL scan effort when there are many slots. One thing I'm thinking about is that if all slots are far behind, the shared scan may still take a long time. Before thischange, a scan could stop as soon as it found a pending WAL. So after the change, when there are only a few slots andthey are far behind, the scan might end up doing more work than before. As a possible optimization, maybe we could also pass the newest confirmed_flush_lsn to the scan. Once it finds a decodableWAL record newer than that confirmed_flush_lsn, we already know all slots are behind, so the scan could stop atthat point. WRT the code change, I got a few comments: 1 ··· + * otherwise false. If last_pending_wal_p is set by the caller, it continues + * scanning WAL even after detecting a decodable WAL record, in order to + * get the last decodable WAL record's LSN. */ bool -LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal) +LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal, + XLogRecPtr *last_pending_wal_p) { bool has_pending_wal = false; Assert(MyReplicationSlot); + if (last_pending_wal_p != NULL) + *last_pending_wal_p = InvalidXLogRecPtr; ··· The header comment seems to conflict to the code. last_pending_wal_p is unconditionally set to InvalidXLogRecPtr, so whatevera caller set is overwritten. I think you meant to say “if last_pending_wal_p is not NULL”. 2 ``` @@ -286,9 +287,9 @@ binary_upgrade_logical_slot_has_caught_up(PG_FUNCTION_ARGS) { Name slot_name; XLogRecPtr end_of_wal; - bool found_pending_wal; + XLogRecPtr last_pending_wal; ``` The function header comment still says “returns true if …”, that should be updated. Also, with the change, the function name becomes misleading, where “has” implies a boolean result, but now it will returnthe newest docodeable wal when no catching up. The function name doesn’t reflect to the actual behavior. Looks likethe function is only used by pg_upgrade, so maybe rename it. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/