Обсуждение: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi, I came across a race condition in pg_get_publication_tables with concurrent DROP TABLE. pg_get_publication_tables collects table OIDs without locks on the first call, then opens each table on later calls. If a table is dropped in between, the function errors with "could not open relation with OID". This is common in environments where many tables are being created and dropped while pg_publication_tables is queried, such as with FOR ALL TABLES publications. Please find the attached patch that fixes this by skipping concurrently dropped tables instead of erroring out. Tables created after the list is built are simply not present in the result set, which is expected point-in-time behavior with no error. I've also added a TAP test for this issue. Thoughts? -- Bharath Rupireddy Amazon Web Services: https://aws.amazon.com
Вложения
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Thu, Apr 23, 2026 at 1:01 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > I came across a race condition in pg_get_publication_tables with > concurrent DROP TABLE. pg_get_publication_tables collects table OIDs > without locks on the first call, then opens each table on later calls. > If a table is dropped in between, the function errors with "could not > open relation with OID". > I agree with the problem statement, this is a weird error: postgres=# select * from pg_publication_tables; ERROR: could not open relation with OID 16390 > This is common in environments where many tables are being created and > dropped while pg_publication_tables is queried, such as with FOR ALL > TABLES publications. > Please find the attached patch that fixes this by skipping > concurrently dropped tables instead of erroring out. Tables created > after the list is built are simply not present in the result set, > which is expected point-in-time behavior with no error. I too think that this should be fixed by skipping the dropped table. Will reveiw patch soon. thanks Shveta
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Thu, Apr 23, 2026 at 4:45 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Apr 23, 2026 at 1:01 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Hi, > > > > I came across a race condition in pg_get_publication_tables with > > concurrent DROP TABLE. pg_get_publication_tables collects table OIDs > > without locks on the first call, then opens each table on later calls. > > If a table is dropped in between, the function errors with "could not > > open relation with OID". > > > > I agree with the problem statement, this is a weird error: > > postgres=# select * from pg_publication_tables; > ERROR: could not open relation with OID 16390 > > > This is common in environments where many tables are being created and > > dropped while pg_publication_tables is queried, such as with FOR ALL > > TABLES publications. > > Please find the attached patch that fixes this by skipping > > concurrently dropped tables instead of erroring out. Tables created > > after the list is built are simply not present in the result set, > > which is expected point-in-time behavior with no error. > > I too think that this should be fixed by skipping the dropped table. > Will reveiw patch soon. > Bharath, I reviewed the patch. I personally think that manually incrementing the call counter of SRF (funcctx->call_cntr++) in pg_get_publication_tables() is not a good idea. I think these are read-only for us and any changes to SRF fields must use SRF macros. I tried to find if any other code-part does that, found one refernce in hstore_svals(): /* ugly ugly ugly. why no macro for this? */ (funcctx)->call_cntr++; Having said that, I could not find any other way to implement the fix also. Did you try exploring 'SRF_RETURN_NEXT_NULL' in this case? I am not very sure about this as well, as it will end up retruning NULL and may impact output and its usage in tablesync too. Let's see what others have to say on this fix. thanks Shveta
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bertrand Drouvot
Дата:
Hi, On Fri, Apr 24, 2026 at 11:19:40AM +0530, shveta malik wrote: > On Thu, Apr 23, 2026 at 4:45 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Apr 23, 2026 at 1:01 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > I tried to find if any other code-part does that, found one refernce > in hstore_svals(): > > /* ugly ugly ugly. why no macro for this? */ > (funcctx)->call_cntr++; > > Having said that, I could not find any other way to implement the fix > also. What about introducing a publication_tables_state struct stored in user_fctx that carries both the list and a private position index? (kind of what pg_timezone_abbrevs_zone() is doing). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Fri, Apr 24, 2026 at 11:51 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Fri, Apr 24, 2026 at 11:19:40AM +0530, shveta malik wrote: > > On Thu, Apr 23, 2026 at 4:45 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Thu, Apr 23, 2026 at 1:01 AM Bharath Rupireddy > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > I tried to find if any other code-part does that, found one refernce > > in hstore_svals(): > > > > /* ugly ugly ugly. why no macro for this? */ > > (funcctx)->call_cntr++; > > > > Having said that, I could not find any other way to implement the fix > > also. > > What about introducing a publication_tables_state struct stored in user_fctx > that carries both the list and a private position index? (kind of what > pg_timezone_abbrevs_zone() is doing). > Yeah, that is a good idea. Seems doable. thanks Shveta
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi, On Fri, Apr 24, 2026 at 12:17 AM shveta malik <shveta.malik@gmail.com> wrote: > > > What about introducing a publication_tables_state struct stored in user_fctx > > that carries both the list and a private position index? (kind of what > > pg_timezone_abbrevs_zone() is doing). > > Yeah, that is a good idea. Seems doable. +1. Thanks for the pointer. Adding a new struct to carry both the table_infos and the current index into it seems simple with a smaller diff. If I were to think of another approach (I don't prefer this approach anyway), we could convert pg_get_publication_tables from the current value-per-call SRF function (SRF_IS_FIRSTCALL + SRF_RETURN_NEXT) to a materialized SRF function (InitMaterializedSRF + tuplestore_putvalues). With a materialized SRF function, there is no need to add a new structure or maintain per-call context - table_infos becomes a local variable, and we skip placing anything related to dropped tables into the tuplestore immediately in the second loop (the first loop remains the same, preparing the table_infos list). This approach seems more complex with a larger diff and requires use of InitMaterializedSRF for versions >= PG16, SetSingleFuncCall for PG15, and for PG14 requires inlining SetSingleFuncCall. I prefer adding the new struct to carry both table_infos and the current index into it with the current value-per-call SRF function, unless others have better ideas. If okay, I will send a new patch soon. Thank you! -- Bharath Rupireddy Amazon Web Services: https://aws.amazon.com
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Sat, Apr 25, 2026 at 1:51 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > On Fri, Apr 24, 2026 at 12:17 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > What about introducing a publication_tables_state struct stored in user_fctx > > > that carries both the list and a private position index? (kind of what > > > pg_timezone_abbrevs_zone() is doing). > > > > Yeah, that is a good idea. Seems doable. > > +1. Thanks for the pointer. Adding a new struct to carry both the > table_infos and the current index into it seems simple with a smaller > diff. > > If I were to think of another approach (I don't prefer this approach > anyway), we could convert pg_get_publication_tables from the current > value-per-call SRF function (SRF_IS_FIRSTCALL + SRF_RETURN_NEXT) to a > materialized SRF function (InitMaterializedSRF + > tuplestore_putvalues). With a materialized SRF function, there is no > need to add a new structure or maintain per-call context - table_infos > becomes a local variable, and we skip placing anything related to > dropped tables into the tuplestore immediately in the second loop (the > first loop remains the same, preparing the table_infos list). This > approach seems more complex with a larger diff and requires use of > InitMaterializedSRF for versions >= PG16, SetSingleFuncCall for PG15, > and for PG14 requires inlining SetSingleFuncCall. > > I prefer adding the new struct to carry both table_infos and the > current index into it with the current value-per-call SRF function, > unless others have better ideas. +1. It is simpler than the Materialization concept. > If okay, I will send a new patch > soon. Thank you! > Sure, Thanks! thanks Shveta
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi, On Sun, Apr 26, 2026 at 8:45 PM shveta malik <shveta.malik@gmail.com> wrote: > > > I prefer adding the new struct to carry both table_infos and the > > current index into it with the current value-per-call SRF function, > > unless others have better ideas. > > +1. It is simpler than the Materialization concept. > > > If okay, I will send a new patch > > soon. Thank you! > > Sure, Thanks! Attached v2 patch. I also refactored the test a bit. Please review. Thank you! I added this to the current CF: https://commitfest.postgresql.org/patch/6715/. -- Bharath Rupireddy Amazon Web Services: https://aws.amazon.com
Вложения
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bertrand Drouvot
Дата:
Hi, On Mon, Apr 27, 2026 at 01:36:00AM -0700, Bharath Rupireddy wrote: > Hi, > > On Sun, Apr 26, 2026 at 8:45 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > I prefer adding the new struct to carry both table_infos and the > > > current index into it with the current value-per-call SRF function, > > > unless others have better ideas. > > > > +1. It is simpler than the Materialization concept. > > > > > If okay, I will send a new patch > > > soon. Thank you! > > > > Sure, Thanks! > > Attached v2 patch. I also refactored the test a bit. Please review. Thank you! Thanks! I've 2 comments: 1/ What about having just one curr_idx increment? (right after list_nth(), before the skip checks). I think that would be less error-prone if new skip conditions are added later. 2/ I think that the test is racy and could also succeed even without the fixes. Indeed, I think that the drops can complete before any concurrent polling happens (I can see it by adding a pg_sleep(2) before the first poll in the DO block). What about using an injection point to ensure a relation is removed during the polling? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi, On Mon, Apr 27, 2026 at 3:01 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > I've 2 comments: Thank you for reviewing! > 1/ What about having just one curr_idx increment? (right after list_nth(), > before the skip checks). I think that would be less error-prone if new skip > conditions are added later. Done. > 2/ I think that the test is racy and could also succeed even without the fixes. > Indeed, I think that the drops can complete before any concurrent polling > happens (I can see it by adding a pg_sleep(2) before the first poll in the DO > block). What about using an injection point to ensure a relation is removed > during the polling? I initially considered an injection point but chose polling since the TAP test reproduced the bug consistently with hundreds of tables on my dev system. I've now added an injection point for predictability. I adjusted the commit message a bit. Please find the attached v3 patch for further review. Thank you! -- Bharath Rupireddy Amazon Web Services: https://aws.amazon.com
Вложения
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Tue, Apr 28, 2026 at 7:11 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> On Mon, Apr 27, 2026 at 3:01 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > I've 2 comments:
>
> Thank you for reviewing!
>
> > 1/ What about having just one curr_idx increment? (right after list_nth(),
> > before the skip checks). I think that would be less error-prone if new skip
> > conditions are added later.
>
> Done.
>
> > 2/ I think that the test is racy and could also succeed even without the fixes.
> > Indeed, I think that the drops can complete before any concurrent polling
> > happens (I can see it by adding a pg_sleep(2) before the first poll in the DO
> > block). What about using an injection point to ensure a relation is removed
> > during the polling?
>
> I initially considered an injection point but chose polling since the
> TAP test reproduced the bug consistently with hundreds of tables on my
> dev system. I've now added an injection point for predictability.
>
> I adjusted the commit message a bit. Please find the attached v3 patch
> for further review. Thank you!
>
Thanks Bharath. I have just one minor comment:
+ INJECTION_POINT("pg-get-publication-tables-build-list", NULL);
Shall we name it as 'pg-get-publication-tables-list-built' to be more
meaningful, as we are pausing after list is built.
thanks
Shveta
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi,
On Mon, Apr 27, 2026 at 8:59 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> > I initially considered an injection point but chose polling since the
> > TAP test reproduced the bug consistently with hundreds of tables on my
> > dev system. I've now added an injection point for predictability.
> >
> > I adjusted the commit message a bit. Please find the attached v3 patch
> > for further review. Thank you!
> >
>
> Thanks Bharath. I have just one minor comment:
>
> + INJECTION_POINT("pg-get-publication-tables-build-list", NULL);
>
> Shall we name it as 'pg-get-publication-tables-list-built' to be more
> meaningful, as we are pausing after list is built.
Sure. How about pg-get-publication-tables-after-list-built? It's 42
bytes, under the INJ_NAME_MAXLEN of 64 bytes, but meaningful.
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bertrand Drouvot
Дата:
Hi,
On Mon, Apr 27, 2026 at 09:21:54PM -0700, Bharath Rupireddy wrote:
> Hi,
>
> On Mon, Apr 27, 2026 at 8:59 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > > I initially considered an injection point but chose polling since the
> > > TAP test reproduced the bug consistently with hundreds of tables on my
> > > dev system. I've now added an injection point for predictability.
> > >
> > > I adjusted the commit message a bit. Please find the attached v3 patch
> > > for further review. Thank you!
Thanks for the new version!
> > Thanks Bharath. I have just one minor comment:
> >
> > + INJECTION_POINT("pg-get-publication-tables-build-list", NULL);
> >
> > Shall we name it as 'pg-get-publication-tables-list-built' to be more
> > meaningful, as we are pausing after list is built.
>
> Sure. How about pg-get-publication-tables-after-list-built? It's 42
> bytes, under the INJ_NAME_MAXLEN of 64 bytes, but meaningful.
That looks ok to me.
Just still 2 minor comments:
1/
+ # Drop the table while the SRF is paused.
s/the SRF/ pg_get_publication_tables()/ ?
2/
+# pg_get_publication_tables race with concurrent DROP TABLE.
What about describing a bit more the bug? The other tests in 100_bugs.pl describe
the bug they're testing.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi, On Mon, Apr 27, 2026 at 10:54 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > Sure. How about pg-get-publication-tables-after-list-built? It's 42 > > bytes, under the INJ_NAME_MAXLEN of 64 bytes, but meaningful. > > That looks ok to me. > > Just still 2 minor comments: > > 1/ > + # Drop the table while the SRF is paused. > > s/the SRF/ pg_get_publication_tables()/ ? > > 2/ > > +# pg_get_publication_tables race with concurrent DROP TABLE. > > What about describing a bit more the bug? The other tests in 100_bugs.pl describe > the bug they're testing. Please find the attached v4 patch. Thank yoU! -- Bharath Rupireddy Amazon Web Services: https://aws.amazon.com
Вложения
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bertrand Drouvot
Дата:
Hi, On Tue, Apr 28, 2026 at 07:17:00AM -0700, Bharath Rupireddy wrote: > Hi, > > On Mon, Apr 27, 2026 at 10:54 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > > Sure. How about pg-get-publication-tables-after-list-built? It's 42 > > > bytes, under the INJ_NAME_MAXLEN of 64 bytes, but meaningful. > > > > That looks ok to me. > > > > Just still 2 minor comments: > > > > 1/ > > + # Drop the table while the SRF is paused. > > > > s/the SRF/ pg_get_publication_tables()/ ? > > > > 2/ > > > > +# pg_get_publication_tables race with concurrent DROP TABLE. > > > > What about describing a bit more the bug? The other tests in 100_bugs.pl describe > > the bug they're testing. > > Please find the attached v4 patch. Thank yoU! Thanks, LGTM! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Ajin Cherian
Дата:
On Wed, Apr 29, 2026 at 12:17 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Please find the attached v4 patch. Thank yoU! > One small comment. The includes need to be in alphabetical order. injection_point.h should come after fmgroids.h #include "utils/fmgroids.h" #include "utils/lsyscache.h" #include "utils/rel.h" +#include "utils/injection_point.h" #include "utils/syscache.h" regards, Ajin Cherian Fujitsu Australia
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Wed, Apr 29, 2026 at 7:12 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Apr 29, 2026 at 12:17 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > Please find the attached v4 patch. Thank yoU! > > > > One small comment. The includes need to be in alphabetical order. > injection_point.h should come after fmgroids.h > > #include "utils/fmgroids.h" > #include "utils/lsyscache.h" > #include "utils/rel.h" > +#include "utils/injection_point.h" > #include "utils/syscache.h" > +1 Also there is a trailing whitespace issue while applying the patch. Other than these, the patch looks good. thanks Shveta
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi, On Tue, Apr 28, 2026 at 8:28 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Apr 29, 2026 at 7:12 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > One small comment. The includes need to be in alphabetical order. > > injection_point.h should come after fmgroids.h > > > > +#include "utils/injection_point.h" > > #include "utils/syscache.h" > > Also there is a trailing whitespace issue while applying the patch. > Other than these, the patch looks good. Fixed. Please find the attached v5 patch. The fix is needed only for PG16 and later, not PG15 or PG14. The bug was introduced by b7ae03953690 [1] in PG16, which added a table_open() call in pg_get_publication_tables(). PG15 and earlier only use get_rel_namespace() and syscache lookups, both of which gracefully handle dropped relations (returning InvalidOid/false rather than erroring). I verified the bug and the fix on all affected branches. Please find the attached version-specific patches for backpatching. Thank you! [1] b7ae03953690 - Ignore dropped and generated columns from the column list -- Bharath Rupireddy Amazon Web Services: https://aws.amazon.com
Вложения
> On Apr 29, 2026, at 15:15, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > On Tue, Apr 28, 2026 at 8:28 PM shveta malik <shveta.malik@gmail.com> wrote: >> >> On Wed, Apr 29, 2026 at 7:12 AM Ajin Cherian <itsajin@gmail.com> wrote: >> >>> One small comment. The includes need to be in alphabetical order. >>> injection_point.h should come after fmgroids.h >>> >>> +#include "utils/injection_point.h" >>> #include "utils/syscache.h" >> >> Also there is a trailing whitespace issue while applying the patch. >> Other than these, the patch looks good. > > Fixed. Please find the attached v5 patch. > > The fix is needed only for PG16 and later, not PG15 or PG14. The bug > was introduced by b7ae03953690 [1] in PG16, which added a table_open() > call in pg_get_publication_tables(). PG15 and earlier only use > get_rel_namespace() and syscache lookups, both of which gracefully > handle dropped relations (returning InvalidOid/false rather than > erroring). > > I verified the bug and the fix on all affected branches. Please find > the attached version-specific patches for backpatching. Thank you! > > [1] b7ae03953690 - Ignore dropped and generated columns from the column list > > -- > Bharath Rupireddy > Amazon Web Services: https://aws.amazon.com > <v5-0001-PG16-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-Fix-pg_get_publication_tables-race-with-concurren.patch><v5-0001-PG18-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-PG17-Fix-pg_get_publication_tables-race-with-conc.txt> I am afraid this is only a partial fix. ``` @@ -1599,12 +1621,18 @@ pg_get_publication_tables(FunctionCallInfo fcinfo, ArrayType *pubnames, /* Show all columns when the column list is not specified. */ if (nulls[2]) { - Relation rel = table_open(relid, AccessShareLock); + Relation rel = try_table_open(relid, AccessShareLock); int nattnums = 0; int16 *attnums; - TupleDesc desc = RelationGetDescr(rel); + TupleDesc desc; int i; + /* Skip if the relation has been concurrently dropped. */ + if (rel == NULL) + continue; ``` This change uses try_table_open() to detect whether a table has been dropped, but try_table_open() is only called when thecolumn list is not specified. If a table is included in the publication with an explicit column list, then even if itis dropped concurrently, it may still be returned by pg_get_publication_tables(). So this patch removes the “could not open relation with OID” error, but it does not fully ensure the accuracy of the returnedtable list. It also introduces inconsistent behavior between tables published with and without column lists. To resolve the race condition completely, I think we should try to open the table regardless of whether a column list isspecified. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi,
On Wed, Apr 29, 2026 at 8:41 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> >
<v5-0001-PG16-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-Fix-pg_get_publication_tables-race-with-concurren.patch><v5-0001-PG18-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-PG17-Fix-pg_get_publication_tables-race-with-conc.txt>
>
> I am afraid this is only a partial fix.
Thanks for reviewing it. Please find my responses below.
> ```
> @@ -1599,12 +1621,18 @@ pg_get_publication_tables(FunctionCallInfo fcinfo, ArrayType *pubnames,
> /* Show all columns when the column list is not specified. */
> if (nulls[2])
> {
> - Relation rel = table_open(relid, AccessShareLock);
> + Relation rel = try_table_open(relid, AccessShareLock);
> int nattnums = 0;
> int16 *attnums;
> - TupleDesc desc = RelationGetDescr(rel);
> + TupleDesc desc;
> int i;
>
> + /* Skip if the relation has been concurrently dropped. */
> + if (rel == NULL)
> + continue;
> ```
>
> This change uses try_table_open() to detect whether a table has been dropped, but try_table_open() is only called
whenthe column list is not specified. If a table is included in the publication with an explicit column list, then even
ifit is dropped concurrently, it may still be returned by pg_get_publication_tables().
Right. The try_table_open() is only needed there because that's the
only code path that actually opens the relation (to enumerate its
columns). The column list path reads from the pg_publication_rel
catalog - it never calls table_open(), so it cannot hit the ERROR.
> So this patch removes the “could not open relation with OID” error, but it does not fully ensure the accuracy of the
returnedtable list.
IMO, no function returning table OIDs can guarantee they remain valid
- a drop can happen right after we return the OID, and accuracy is in
the caller's hands. All the callers of pg_get_publication_tables()
already handle this by JOINing with pg_class.
> It also introduces inconsistent behavior between tables published with and without column lists.
These two paths do different things - one needs the relation open, the
other doesn't. For callers, the outcome is the same: any stale OID
gets filtered out by their pg_class JOIN.
> To resolve the race condition completely, I think we should try to open the table regardless of whether a column list
isspecified.
IMO, that would add unnecessary locking overhead in a path that
doesn't need it. The bug is the ERROR, and this patch fixes it.
In short, when no column list is specified,
pg_get_publication_tables() needs to open the relation to enumerate
all publishable columns and return them as an int2vector in the attrs
output column. It currently uses table_open() for this, which errors
out with concurrent table drops. This patch fixes it by using
try_table_open() and skipping the relation if it's been dropped.
Thoughts?
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
Bharath Rupireddy
Дата:
Hi, On Wed, Apr 29, 2026 at 12:15 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Fixed. Please find the attached v5 patch. > > The fix is needed only for PG16 and later, not PG15 or PG14. The bug > was introduced by b7ae03953690 [1] in PG16, which added a table_open() > call in pg_get_publication_tables(). PG15 and earlier only use > get_rel_namespace() and syscache lookups, both of which gracefully > handle dropped relations (returning InvalidOid/false rather than > erroring). > > I verified the bug and the fix on all affected branches. Please find > the attached version-specific patches for backpatching. Thank you! > > [1] b7ae03953690 - Ignore dropped and generated columns from the column list Please find the attached v6 patch, which fixes a test failure on FreeBSD. This variant of the build forces parallel query via debug_parallel_query=regress, causing the pg_get_publication_tables injection point to fire in parallel workers as well. I disabled forced parallel query for this test case. Thank you! -- Bharath Rupireddy Amazon Web Services: https://aws.amazon.com
Вложения
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Fri, May 1, 2026 at 7:00 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> On Wed, Apr 29, 2026 at 8:41 PM Chao Li <li.evan.chao@gmail.com> wrote:
> >
> > >
<v5-0001-PG16-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-Fix-pg_get_publication_tables-race-with-concurren.patch><v5-0001-PG18-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-PG17-Fix-pg_get_publication_tables-race-with-conc.txt>
> >
> > I am afraid this is only a partial fix.
>
> Thanks for reviewing it. Please find my responses below.
>
> > ```
> > @@ -1599,12 +1621,18 @@ pg_get_publication_tables(FunctionCallInfo fcinfo, ArrayType *pubnames,
> > /* Show all columns when the column list is not specified. */
> > if (nulls[2])
> > {
> > - Relation rel = table_open(relid, AccessShareLock);
> > + Relation rel = try_table_open(relid, AccessShareLock);
> > int nattnums = 0;
> > int16 *attnums;
> > - TupleDesc desc = RelationGetDescr(rel);
> > + TupleDesc desc;
> > int i;
> >
> > + /* Skip if the relation has been concurrently dropped. */
> > + if (rel == NULL)
> > + continue;
> > ```
> >
> > This change uses try_table_open() to detect whether a table has been dropped, but try_table_open() is only called
whenthe column list is not specified. If a table is included in the publication with an explicit column list, then even
ifit is dropped concurrently, it may still be returned by pg_get_publication_tables().
>
> Right. The try_table_open() is only needed there because that's the
> only code path that actually opens the relation (to enumerate its
> columns). The column list path reads from the pg_publication_rel
> catalog - it never calls table_open(), so it cannot hit the ERROR.
>
> > So this patch removes the “could not open relation with OID” error, but it does not fully ensure the accuracy of
thereturned table list.
>
> IMO, no function returning table OIDs can guarantee they remain valid
> - a drop can happen right after we return the OID, and accuracy is in
> the caller's hands. All the callers of pg_get_publication_tables()
> already handle this by JOINing with pg_class.
>
I agree with Bharath. Also I would like to add one more point. We do have this:
+ /* Skip if the relation has been concurrently dropped. */
+ if (!OidIsValid(schemaid))
+ continue;
So if a table is dropped before we could access its explicitly
mentioned column-list, above should handle it right even without
'try_table_open' done? Moreover, pg_publication_rel will not return
any row for dropped table, so we should be good. For 'table_infos',
the case was different, as we saved the list in first call and are
using that in subsequent call? But that is not the case with
pg_publication_rel.
> > It also introduces inconsistent behavior between tables published with and without column lists.
>
> These two paths do different things - one needs the relation open, the
> other doesn't. For callers, the outcome is the same: any stale OID
> gets filtered out by their pg_class JOIN.
>
> > To resolve the race condition completely, I think we should try to open the table regardless of whether a column
listis specified.
>
> IMO, that would add unnecessary locking overhead in a path that
> doesn't need it. The bug is the ERROR, and this patch fixes it.
>
> In short, when no column list is specified,
> pg_get_publication_tables() needs to open the relation to enumerate
> all publishable columns and return them as an int2vector in the attrs
> output column. It currently uses table_open() for this, which errors
> out with concurrent table drops. This patch fixes it by using
> try_table_open() and skipping the relation if it's been dropped.
>
> Thoughts?
>
> --
> Bharath Rupireddy
> Amazon Web Services: https://aws.amazon.com
> On May 1, 2026, at 09:30, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> On Wed, Apr 29, 2026 at 8:41 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>>>
<v5-0001-PG16-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-Fix-pg_get_publication_tables-race-with-concurren.patch><v5-0001-PG18-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-PG17-Fix-pg_get_publication_tables-race-with-conc.txt>
>>
>> I am afraid this is only a partial fix.
>
> Thanks for reviewing it. Please find my responses below.
>
>> ```
>> @@ -1599,12 +1621,18 @@ pg_get_publication_tables(FunctionCallInfo fcinfo, ArrayType *pubnames,
>> /* Show all columns when the column list is not specified. */
>> if (nulls[2])
>> {
>> - Relation rel = table_open(relid, AccessShareLock);
>> + Relation rel = try_table_open(relid, AccessShareLock);
>> int nattnums = 0;
>> int16 *attnums;
>> - TupleDesc desc = RelationGetDescr(rel);
>> + TupleDesc desc;
>> int i;
>>
>> + /* Skip if the relation has been concurrently dropped. */
>> + if (rel == NULL)
>> + continue;
>> ```
>>
>> This change uses try_table_open() to detect whether a table has been dropped, but try_table_open() is only called
whenthe column list is not specified. If a table is included in the publication with an explicit column list, then even
ifit is dropped concurrently, it may still be returned by pg_get_publication_tables().
>
> Right. The try_table_open() is only needed there because that's the
> only code path that actually opens the relation (to enumerate its
> columns). The column list path reads from the pg_publication_rel
> catalog - it never calls table_open(), so it cannot hit the ERROR.
>
>> So this patch removes the “could not open relation with OID” error, but it does not fully ensure the accuracy of the
returnedtable list.
>
> IMO, no function returning table OIDs can guarantee they remain valid
> - a drop can happen right after we return the OID, and accuracy is in
> the caller's hands. All the callers of pg_get_publication_tables()
> already handle this by JOINing with pg_class.
>
Fair.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
> On May 4, 2026, at 13:08, shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, May 1, 2026 at 7:00 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>>
>> Hi,
>>
>> On Wed, Apr 29, 2026 at 8:41 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>>
>>>>
<v5-0001-PG16-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-Fix-pg_get_publication_tables-race-with-concurren.patch><v5-0001-PG18-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-PG17-Fix-pg_get_publication_tables-race-with-conc.txt>
>>>
>>> I am afraid this is only a partial fix.
>>
>> Thanks for reviewing it. Please find my responses below.
>>
>>> ```
>>> @@ -1599,12 +1621,18 @@ pg_get_publication_tables(FunctionCallInfo fcinfo, ArrayType *pubnames,
>>> /* Show all columns when the column list is not specified. */
>>> if (nulls[2])
>>> {
>>> - Relation rel = table_open(relid, AccessShareLock);
>>> + Relation rel = try_table_open(relid, AccessShareLock);
>>> int nattnums = 0;
>>> int16 *attnums;
>>> - TupleDesc desc = RelationGetDescr(rel);
>>> + TupleDesc desc;
>>> int i;
>>>
>>> + /* Skip if the relation has been concurrently dropped. */
>>> + if (rel == NULL)
>>> + continue;
>>> ```
>>>
>>> This change uses try_table_open() to detect whether a table has been dropped, but try_table_open() is only called
whenthe column list is not specified. If a table is included in the publication with an explicit column list, then even
ifit is dropped concurrently, it may still be returned by pg_get_publication_tables().
>>
>> Right. The try_table_open() is only needed there because that's the
>> only code path that actually opens the relation (to enumerate its
>> columns). The column list path reads from the pg_publication_rel
>> catalog - it never calls table_open(), so it cannot hit the ERROR.
>>
>>> So this patch removes the “could not open relation with OID” error, but it does not fully ensure the accuracy of
thereturned table list.
>>
>> IMO, no function returning table OIDs can guarantee they remain valid
>> - a drop can happen right after we return the OID, and accuracy is in
>> the caller's hands. All the callers of pg_get_publication_tables()
>> already handle this by JOINing with pg_class.
>>
>
> I agree with Bharath. Also I would like to add one more point. We do have this:
>
> + /* Skip if the relation has been concurrently dropped. */
> + if (!OidIsValid(schemaid))
> + continue;
>
Actually, this is the other comment I have. Why the comment says “if the relation has been dropped”, but the actual
checkis on schema id?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
От
shveta malik
Дата:
On Tue, May 5, 2026 at 8:40 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
>
>
> > On May 4, 2026, at 13:08, shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Fri, May 1, 2026 at 7:00 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> On Wed, Apr 29, 2026 at 8:41 PM Chao Li <li.evan.chao@gmail.com> wrote:
> >>>
> >>>>
<v5-0001-PG16-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-Fix-pg_get_publication_tables-race-with-concurren.patch><v5-0001-PG18-Fix-pg_get_publication_tables-race-with-conc.txt><v5-0001-PG17-Fix-pg_get_publication_tables-race-with-conc.txt>
> >>>
> >>> I am afraid this is only a partial fix.
> >>
> >> Thanks for reviewing it. Please find my responses below.
> >>
> >>> ```
> >>> @@ -1599,12 +1621,18 @@ pg_get_publication_tables(FunctionCallInfo fcinfo, ArrayType *pubnames,
> >>> /* Show all columns when the column list is not specified. */
> >>> if (nulls[2])
> >>> {
> >>> - Relation rel = table_open(relid, AccessShareLock);
> >>> + Relation rel = try_table_open(relid, AccessShareLock);
> >>> int nattnums = 0;
> >>> int16 *attnums;
> >>> - TupleDesc desc = RelationGetDescr(rel);
> >>> + TupleDesc desc;
> >>> int i;
> >>>
> >>> + /* Skip if the relation has been concurrently dropped. */
> >>> + if (rel == NULL)
> >>> + continue;
> >>> ```
> >>>
> >>> This change uses try_table_open() to detect whether a table has been dropped, but try_table_open() is only called
whenthe column list is not specified. If a table is included in the publication with an explicit column list, then even
ifit is dropped concurrently, it may still be returned by pg_get_publication_tables().
> >>
> >> Right. The try_table_open() is only needed there because that's the
> >> only code path that actually opens the relation (to enumerate its
> >> columns). The column list path reads from the pg_publication_rel
> >> catalog - it never calls table_open(), so it cannot hit the ERROR.
> >>
> >>> So this patch removes the “could not open relation with OID” error, but it does not fully ensure the accuracy of
thereturned table list.
> >>
> >> IMO, no function returning table OIDs can guarantee they remain valid
> >> - a drop can happen right after we return the OID, and accuracy is in
> >> the caller's hands. All the callers of pg_get_publication_tables()
> >> already handle this by JOINing with pg_class.
> >>
> >
> > I agree with Bharath. Also I would like to add one more point. We do have this:
> >
> > + /* Skip if the relation has been concurrently dropped. */
> > + if (!OidIsValid(schemaid))
> > + continue;
> >
>
> Actually, this is the other comment I have. Why the comment says “if the relation has been dropped”, but the actual
checkis on schema id?
>
Okay, I see your point. Shall we tweak it to the below to make it more
understandable?
Oid schemaid;
/*
* get_rel_namespace() returns InvalidOid if the relation no longer exists
* (e.g., dropped concurrently). Skip such entries.
*/
if (!OidIsValid(schemaid = get_rel_namespace(relid)))
continue;
thanks
Shveta