Обсуждение: [PATCH] Fix stale relation close in sequence synchronization
Hi,
I found a crash in the logical replication sequence synchronization worker
when the publisher returns NULL sequence data for a sequence after at least
one sequence in the same sync batch has already been processed.
One way to reproduce this is to use a subscription that connects to the
publisher as a replication role that can read one published sequence but
cannot read another one. pg_get_sequence_data() returns NULLs for the
inaccessible sequence. In get_and_validate_seq_info(), that path returns
COPYSEQ_SKIPPED before assigning a new value to the Relation output argument.
copy_sequences() then still sees the Relation pointer left from the previous
row and calls table_close() on it again. On a cassert build, this trips:
TRAP: failed Assert("rel->rd_refcnt > 0"); "relcache.c" file
The attached patch clears the output Relation pointer at the start of
get_and_validate_seq_info() and clears the local pointer in copy_sequences()
after closing it. That keeps early returns from reusing a relation from a
previous row.
The patch also adds a TAP test to 036_sequences.pl.
Thoughts?
Regards,
Ayush
I found a crash in the logical replication sequence synchronization worker
when the publisher returns NULL sequence data for a sequence after at least
one sequence in the same sync batch has already been processed.
One way to reproduce this is to use a subscription that connects to the
publisher as a replication role that can read one published sequence but
cannot read another one. pg_get_sequence_data() returns NULLs for the
inaccessible sequence. In get_and_validate_seq_info(), that path returns
COPYSEQ_SKIPPED before assigning a new value to the Relation output argument.
copy_sequences() then still sees the Relation pointer left from the previous
row and calls table_close() on it again. On a cassert build, this trips:
TRAP: failed Assert("rel->rd_refcnt > 0"); "relcache.c" file
The attached patch clears the output Relation pointer at the start of
get_and_validate_seq_info() and clears the local pointer in copy_sequences()
after closing it. That keeps early returns from reusing a relation from a
previous row.
The patch also adds a TAP test to 036_sequences.pl.
Thoughts?
Regards,
Ayush
Вложения
Dear Ayush, > I found a crash in the logical replication sequence synchronization worker > when the publisher returns NULL sequence data for a sequence after at least > one sequence in the same sync batch has already been processed. Good catch. I confirmed the HEAD can crash with your added test. > The attached patch clears the output Relation pointer at the start of > get_and_validate_seq_info() and clears the local pointer in copy_sequences() > after closing it. That keeps early returns from reusing a relation from a > previous row. To confirm; can't we declare the sequence_rel in the inner-loop? My first impression was the bug caused by the wrong lifetime. Are there any other thoughts around here? Added Vignesh in CC because he was a primary author. Best regards, Hayato Kuroda FUJITSU LIMITED
Hi,
On Tue, 28 Apr 2026 at 17:44, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
Dear Ayush,
> I found a crash in the logical replication sequence synchronization worker
> when the publisher returns NULL sequence data for a sequence after at least
> one sequence in the same sync batch has already been processed.
Good catch. I confirmed the HEAD can crash with your added test.
Thanks for checking and confirming the crash.
> The attached patch clears the output Relation pointer at the start of
> get_and_validate_seq_info() and clears the local pointer in copy_sequences()
> after closing it. That keeps early returns from reusing a relation from a
> previous row.
To confirm; can't we declare the sequence_rel in the inner-loop? My first
impression was the bug caused by the wrong lifetime. Are there any other
thoughts around here?
I agree that declaring sequence_rel in the tuple-processing loop is
cleaner. The relation belongs only to the current publisher result row,
so limiting the variable's lifetime makes the intended ownership clearer
and prevents any value from carrying over to the next row.
cleaner. The relation belongs only to the current publisher result row,
so limiting the variable's lifetime makes the intended ownership clearer
and prevents any value from carrying over to the next row.
I have kept the initialization of the output argument in
get_and_validate_seq_info(), so every return path leaves it in a defined
state. In v2, the caller-side pointer is declared inside the inner loop,
and the explicit reset after table_close() is no longer needed.
Attached is v2 with that change.
get_and_validate_seq_info(), so every return path leaves it in a defined
state. In v2, the caller-side pointer is declared inside the inner loop,
and the explicit reset after table_close() is no longer needed.
Attached is v2 with that change.
Regards,
Ayush
Ayush
Вложения
On Tue, 28 Apr 2026 at 18:04, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
>
> Hi,
>
>
> On Tue, 28 Apr 2026 at 17:44, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
>>
>> Dear Ayush,
>>
>> > I found a crash in the logical replication sequence synchronization worker
>> > when the publisher returns NULL sequence data for a sequence after at least
>> > one sequence in the same sync batch has already been processed.
>>
>> Good catch. I confirmed the HEAD can crash with your added test.
>
>
> Thanks for checking and confirming the crash.
>
>>
>>
>> > The attached patch clears the output Relation pointer at the start of
>> > get_and_validate_seq_info() and clears the local pointer in copy_sequences()
>> > after closing it. That keeps early returns from reusing a relation from a
>> > previous row.
>>
>> To confirm; can't we declare the sequence_rel in the inner-loop? My first
>> impression was the bug caused by the wrong lifetime. Are there any other
>> thoughts around here?
>
>
> I agree that declaring sequence_rel in the tuple-processing loop is
> cleaner. The relation belongs only to the current publisher result row,
> so limiting the variable's lifetime makes the intended ownership clearer
> and prevents any value from carrying over to the next row.
>
> I have kept the initialization of the output argument in
> get_and_validate_seq_info(), so every return path leaves it in a defined
> state. In v2, the caller-side pointer is declared inside the inner loop,
> and the explicit reset after table_close() is no longer needed.
>
> Attached is v2 with that change.
Thanks for finding and reporting the issue. I was able to reproduce
the issue with the steps you provided.
Few comments:
1) Here "SELECT nextval('regress_no_select');" and "REVOKE ALL ON
SEQUENCE regress_no_select FROM PUBLIC;" is not required for this test
case:
+$node_publisher->safe_psql(
+ 'postgres', qq(
+ CREATE ROLE regress_seq_repl LOGIN REPLICATION;
+ CREATE SEQUENCE regress_no_select;
+ SELECT nextval('regress_no_select');
+ GRANT USAGE ON SCHEMA public TO regress_seq_repl;
+ GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO regress_seq_repl;
+ REVOKE ALL ON SEQUENCE regress_no_select FROM PUBLIC;
+ REVOKE ALL ON SEQUENCE regress_no_select FROM regress_seq_repl;
+));
2) Since the comment about “dropped concurrently” has been removed,
could you merge that context into the new wording:
/*
- * last_value can be NULL if the sequence was dropped concurrently (see
- * pg_get_sequence_data()).
+ * The sequence data can be NULL if it is not accessible on the publisher
+ * (see pg_get_sequence_data()).
*/
Something like:
/*
* The sequence data can be NULL due to insufficient privileges or if
* the sequence was dropped concurrently (see pg_get_sequence_data()).
*/
3) Can we change this:
##########
# A NULL sequence data row from the publisher must not make the subscriber
# close the previously synchronized sequence relation again.
##########
To something like:
##########
# Ensure that insufficient privileges on the publisher for a sequence
# do not disrupt the subscriber. The subscriber should log a warning
# and continue retrying.
##########
Regards,
Vignesh
Hi,
Thanks for reviewing and confirming the issue. On Tue, 28 Apr 2026 at 18:48, vignesh C <vignesh21@gmail.com> wrote:
Thanks for finding and reporting the issue. I was able to reproduce
the issue with the steps you provided.
Few comments:
1) Here "SELECT nextval('regress_no_select');" and "REVOKE ALL ON
SEQUENCE regress_no_select FROM PUBLIC;" is not required for this test
case:
+$node_publisher->safe_psql(
+ 'postgres', qq(
+ CREATE ROLE regress_seq_repl LOGIN REPLICATION;
+ CREATE SEQUENCE regress_no_select;
+ SELECT nextval('regress_no_select');
+ GRANT USAGE ON SCHEMA public TO regress_seq_repl;
+ GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO regress_seq_repl;
+ REVOKE ALL ON SEQUENCE regress_no_select FROM PUBLIC;
+ REVOKE ALL ON SEQUENCE regress_no_select FROM regress_seq_repl;
+));
Agreed. Removed both statements from the test.
2) Since the comment about “dropped concurrently” has been removed,
could you merge that context into the new wording:
concurrent drops.
3) Can we change this:
##########
# A NULL sequence data row from the publisher must not make the subscriber
# close the previously synchronized sequence relation again.
##########
To something like:
##########
# Ensure that insufficient privileges on the publisher for a sequence
# do not disrupt the subscriber. The subscriber should log a warning
# and continue retrying.
##########
Done. I used that wording for the TAP test block comment.
Attached is v3 with these changes.
Regards,
Ayush
Attached is v3 with these changes.
Regards,
Ayush
Вложения
On Tue, 28 Apr 2026 at 19:05, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
>
> Hi,
>
> Thanks for reviewing and confirming the issue.
>
> Attached is v3 with these changes.
Few comments:
1) Since we are setting the variable to NULL for every sequence now,
this is not required:
@@ -246,6 +246,8 @@ get_and_validate_seq_info(TupleTableSlot *slot,
Relation *sequence_rel,
Form_pg_sequence local_seq;
LogicalRepSequenceInfo *seqinfo_local;
+ *sequence_rel = NULL;
+
*seqidx = DatumGetInt32(slot_getattr(slot, ++col, &isnull));
Assert(!isnull);
2) Creating a subscription is costly as it has more work to do as it
has to sync all relations and requires more processes to be started,
instead shall we use "ALTER SUBSCRIPTION ... SET CONNECTION" followed
by "ALTER SUBSCRIPTION ... REFRESH SEQUENCES"
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION regress_seq_sub_no_select CONNECTION
'$publisher_limited_connstr' PUBLICATION regress_seq_pub WITH
(disable_on_error = true)"
+);
3) You can use sequence name as regress_s5 to be consistent with the
other sequence names nearby or alternatively you can change the privs
of an already existing sequence:
+ CREATE ROLE regress_seq_repl LOGIN REPLICATION;
+ CREATE SEQUENCE regress_no_select;
+ GRANT USAGE ON SCHEMA public TO regress_seq_repl;
4) I feel this is not required:
+$result = $node_subscriber->safe_psql('postgres', 'SELECT 1');
+is($result, '1',
+ 'subscriber remains running after publisher returns NULL
sequence data');
Regards,
Vignesh
Hi,
Thanks for the review.
On Thu, 30 Apr 2026 at 10:00, vignesh C <vignesh21@gmail.com> wrote:
Few comments:
1) Since we are setting the variable to NULL for every sequence now,
this is not required:
@@ -246,6 +246,8 @@ get_and_validate_seq_info(TupleTableSlot *slot,
Relation *sequence_rel,
Form_pg_sequence local_seq;
LogicalRepSequenceInfo *seqinfo_local;
+ *sequence_rel = NULL;
+
*seqidx = DatumGetInt32(slot_getattr(slot, ++col, &isnull));
Assert(!isnull);
I had added it as defence-in-depth, but yeah can be removed.
2) Creating a subscription is costly as it has more work to do as it
has to sync all relations and requires more processes to be started,
instead shall we use "ALTER SUBSCRIPTION ... SET CONNECTION" followed
by "ALTER SUBSCRIPTION ... REFRESH SEQUENCES"
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION regress_seq_sub_no_select CONNECTION
'$publisher_limited_connstr' PUBLICATION regress_seq_pub WITH
(disable_on_error = true)"
+);
Makes sense.
3) You can use sequence name as regress_s5 to be consistent with the
other sequence names nearby or alternatively you can change the privs
of an already existing sequence:
+ CREATE ROLE regress_seq_repl LOGIN REPLICATION;
+ CREATE SEQUENCE regress_no_select;
+ GRANT USAGE ON SCHEMA public TO regress_seq_repl;
Sounds good.
4) I feel this is not required:
+$result = $node_subscriber->safe_psql('postgres', 'SELECT 1');
+is($result, '1',
+ 'subscriber remains running after publisher returns NULL
sequence data');
Yeah, you are right.
I've made these changes, attaching patch v4.
I've made these changes, attaching patch v4.
Please review and let me know.
Regards,
Ayush
Ayush
Вложения
On Thu, 30 Apr 2026 at 10:23, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
>
> Hi,
>
> Thanks for the review.
>
> On Thu, 30 Apr 2026 at 10:00, vignesh C <vignesh21@gmail.com> wrote:
>>
>>
>>
>> Few comments:
>> 1) Since we are setting the variable to NULL for every sequence now,
>> this is not required:
>> @@ -246,6 +246,8 @@ get_and_validate_seq_info(TupleTableSlot *slot,
>> Relation *sequence_rel,
>> Form_pg_sequence local_seq;
>> LogicalRepSequenceInfo *seqinfo_local;
>>
>> + *sequence_rel = NULL;
>> +
>> *seqidx = DatumGetInt32(slot_getattr(slot, ++col, &isnull));
>> Assert(!isnull);
>
>
> I had added it as defence-in-depth, but yeah can be removed.
>
>>
>>
>> 2) Creating a subscription is costly as it has more work to do as it
>> has to sync all relations and requires more processes to be started,
>> instead shall we use "ALTER SUBSCRIPTION ... SET CONNECTION" followed
>> by "ALTER SUBSCRIPTION ... REFRESH SEQUENCES"
>> +$node_subscriber->safe_psql(
>> + 'postgres',
>> + "CREATE SUBSCRIPTION regress_seq_sub_no_select CONNECTION
>> '$publisher_limited_connstr' PUBLICATION regress_seq_pub WITH
>> (disable_on_error = true)"
>> +);
>
>
> Makes sense.
>>
>>
>> 3) You can use sequence name as regress_s5 to be consistent with the
>> other sequence names nearby or alternatively you can change the privs
>> of an already existing sequence:
>> + CREATE ROLE regress_seq_repl LOGIN REPLICATION;
>> + CREATE SEQUENCE regress_no_select;
>> + GRANT USAGE ON SCHEMA public TO regress_seq_repl;
>
>
> Sounds good.
>
>>
>>
>> 4) I feel this is not required:
>> +$result = $node_subscriber->safe_psql('postgres', 'SELECT 1');
>> +is($result, '1',
>> + 'subscriber remains running after publisher returns NULL
>> sequence data');
>
>
> Yeah, you are right.
>
> I've made these changes, attaching patch v4.
> Please review and let me know.
Thanks for the updated patch. After this error occurs, it currently
emits the message 'missing sequence on publisher
("public.regress_no_select")'. This can be misleading, as it suggests
the sequence does not exist, whereas the actual cause may be
insufficient privileges. As a result, users may find it difficult to
diagnose and resolve the issue.
Consider updating the error reported in report_sequence_errors to
convey that the failure could be due to either a missing sequence or
insufficient privileges (e.g., "missing sequence or insufficient
privilege on publisher"). While making this change, also update the
missing_seqs_idx variable name and the associated comments to
accurately reflect the broader scope of the condition.
Regards,
Vignesh
Hi,
On Thu, 30 Apr 2026 at 11:27, vignesh C <vignesh21@gmail.com> wrote:
Consider updating the error reported in report_sequence_errors to
convey that the failure could be due to either a missing sequence or
insufficient privileges (e.g., "missing sequence or insufficient
privilege on publisher"). While making this change, also update the
missing_seqs_idx variable name and the associated comments to
accurately reflect the broader scope of the condition.
sequence on publisher", so it covers both cases where the sequence is
actually missing and cases where the publisher-side role cannot access
its sequence data.
I also renamed the related variables from missing_* to unavailable_*,
and updated the associated comments and DEBUG message to avoid implying
that the sequence is necessarily absent on the publisher.
The TAP expectations for both the dropped-sequence case and the
insufficient-privileges case have been updated to match the new wording.
Attached is v5 with these changes.
Thoughts?
Regards,
Ayush
Thoughts?
Regards,
Ayush
Вложения
On Thu, Apr 30, 2026 at 12:16 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote: > > On Thu, 30 Apr 2026 at 11:27, vignesh C <vignesh21@gmail.com> wrote: >> >> >> Consider updating the error reported in report_sequence_errors to >> convey that the failure could be due to either a missing sequence or >> insufficient privileges (e.g., "missing sequence or insufficient >> privilege on publisher"). While making this change, also update the >> missing_seqs_idx variable name and the associated comments to >> accurately reflect the broader scope of the condition. > > > Agreed. > Sequence sync can also be skipped if the same sequence is a temporary sequence from some other session on publisher. So, the change in v5 was also incomplete if we want to make it more generic than now. I have pushed v4 for now but if you or Vignesh would like to make the message more generic, feel free to propose a new patch. -- With Regards, Amit Kapila.
This seems to be causing the below buildfarm failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2026-04-30%2018%3A50%3A23 regards, Ajin Cherian Fujitsu Australia
On Fri, 1 May 2026 at 08:58, Ajin Cherian <itsajin@gmail.com> wrote: > > This seems to be causing the below buildfarm failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2026-04-30%2018%3A50%3A23 Thanks for reporting this, we are analyzing this. We will analyze and propose a patch for the same. Regards, Vignesh
Hi,
On Fri, 1 May 2026 at 09:54, vignesh C <vignesh21@gmail.com> wrote:
On Fri, 1 May 2026 at 08:58, Ajin Cherian <itsajin@gmail.com> wrote:
>
> This seems to be causing the below buildfarm failure:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2026-04-30%2018%3A50%3A23
Thanks for reporting this, we are analyzing this. We will analyze and
propose a patch for the same.
From what I see it the failure is
not in the sequence-copy logic itself. It happens before
the test reaches the privilege check: after the test
switches the subscription connection to user=regress_seq_repl,
Windows tries SSPI auth for that login role and rejects it.
I could not run a native Windows SSPI test locally,
but this follows the existing TAP harness pattern
used by tests that authenticate as non-default roles.
The failure occurs because regress_seq_repl is used in
the subscription connection string but is not passed via
auth_extra, so pg_regress --config-auth does not add a
Windows SSPI ident mapping for it.
Meanwhile I'll try to set up a windows machine
and run the test
meson test -C build subscription/036_sequences --print-errorlogs --verbose
meson test -C build subscription/036_sequences --print-errorlogs --verbose
Regards,
Ayush
Вложения
On Fri, 1 May 2026 at 10:45, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote: > > Hi, > > On Fri, 1 May 2026 at 09:54, vignesh C <vignesh21@gmail.com> wrote: >> >> On Fri, 1 May 2026 at 08:58, Ajin Cherian <itsajin@gmail.com> wrote: >> > >> > This seems to be causing the below buildfarm failure: >> > >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2026-04-30%2018%3A50%3A23 >> >> Thanks for reporting this, we are analyzing this. We will analyze and >> propose a patch for the same. >> > > From what I see it the failure is > not in the sequence-copy logic itself. It happens before > the test reaches the privilege check: after the test > switches the subscription connection to user=regress_seq_repl, > Windows tries SSPI auth for that login role and rejects it. > > I could not run a native Windows SSPI test locally, > but this follows the existing TAP harness pattern > used by tests that authenticate as non-default roles. > The failure occurs because regress_seq_repl is used in > the subscription connection string but is not passed via > auth_extra, so pg_regress --config-auth does not add a > Windows SSPI ident mapping for it. > > Meanwhile I'll try to set up a windows machine > and run the test > > meson test -C build subscription/036_sequences --print-errorlogs --verbose Thanks, the patch worked in my environment. We have made a similar fix earlier at commit "def0ce3370689b939c6d7a3c3eb824d69989ef6e". Can you add one comment to say something like: # Make sure pg_hba.conf is set up to allow connections from regress_seq_repl. # This is only needed on Windows machines that don't use UNIX sockets. +$node_publisher->init( + allows_streaming => 'logical', + auth_extra => [ '--create-role' => 'regress_seq_repl' ]); Regards Vignesh
Hi,
Thanks for confirming.
Thanks for confirming.
On Fri, 1 May 2026 at 11:47, vignesh C <vignesh21@gmail.com> wrote:
Thanks, the patch worked in my environment. We have made a similar fix
earlier at commit "def0ce3370689b939c6d7a3c3eb824d69989ef6e".
Can you add one comment to say something like:
# Make sure pg_hba.conf is set up to allow connections from regress_seq_repl.
# This is only needed on Windows machines that don't use UNIX sockets.
+$node_publisher->init(
+ allows_streaming => 'logical',
+ auth_extra => [ '--create-role' => 'regress_seq_repl' ]);
Added the comment, attaching v2 patch.
Regards,
Ayush
Ayush
Вложения
On Fri, May 1, 2026 at 11:55 AM Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote: > > Thanks for confirming. > > On Fri, 1 May 2026 at 11:47, vignesh C <vignesh21@gmail.com> wrote: >> >> >> Thanks, the patch worked in my environment. We have made a similar fix >> earlier at commit "def0ce3370689b939c6d7a3c3eb824d69989ef6e". >> Can you add one comment to say something like: >> # Make sure pg_hba.conf is set up to allow connections from regress_seq_repl. >> # This is only needed on Windows machines that don't use UNIX sockets. >> >> +$node_publisher->init( >> + allows_streaming => 'logical', >> + auth_extra => [ '--create-role' => 'regress_seq_repl' ]); > > > Added the comment, attaching v2 patch. > Thanks for the investigation. Pushed. -- With Regards, Amit Kapila.