Обсуждение: Re: pgsql: Prevent invalidation of newly synced replication slots.

Поиск
Список
Период
Сортировка

Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Robert Haas
Дата:
On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:
> Prevent invalidation of newly synced replication slots.

This commit has broken CI for me. On the "Windows - Server 2022, VS
2019 - Meson & ninja" build, the following shows up in
046_checkpoint_logical_slot_standby.log:

2026-01-27 13:44:44.421 GMT startup[5172] FATAL:  could not rename
file "backup_label" to "backup_label.old": Permission denied

I imagine this is going to break CI for everybody else too, as well as cfbot.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:
>> Prevent invalidation of newly synced replication slots.

> This commit has broken CI for me.

Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
to believe a Windows-only problem, but at least hamerkop has run
since 851f664.

            regards, tom lane



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Tom Lane
Дата:
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> This commit has broken CI for me.

> Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
> to believe a Windows-only problem, but at least hamerkop has run
> since 851f664.

D'oh: hamerkop doesn't run any TAP tests, let alone ones that require
--enable-injection-points.  So that success proves nothing.

Our other Windows animals (drongo, fairywren, unicorn) seem to be
configured with -Dtap_tests=enabled, but nothing about injection
points, so they will also skip 046_checkpoint_logical_slot.
Seems like a bit of a blind spot in the buildfarm.

            regards, tom lane



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Robert Haas
Дата:
On Tue, Jan 27, 2026 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:
> >> Prevent invalidation of newly synced replication slots.
>
> > This commit has broken CI for me.
>
> Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
> to believe a Windows-only problem, but at least hamerkop has run
> since 851f664.

I don't understand it, either. There's a bunch of error codes that we
map to EACCES in _dosmaperr, but I don't know why any of those
problems would have occurred here:

ERROR_ACCESS_DENIED, EACCES
ERROR_CURRENT_DIRECTORY, EACCES
ERROR_LOCK_VIOLATION, EACCES
ERROR_SHARING_VIOLATION, EACCES
ERROR_NETWORK_ACCESS_DENIED, EACCES
ERROR_CANNOT_MAKE, EACCES
ERROR_FAIL_I24, EACCES
ERROR_DRIVE_LOCKED, EACCES
ERROR_SEEK_ON_DEVICE, EACCES
ERROR_NOT_LOCKED, EACCES
ERROR_LOCK_FAILED, EACCES

(Side note: Wouldn't it make a lot of sense to go back and kill
_dosmaperr in favor of display the actual Windows error code string?)

What's also puzzling is that what this test is doing seems to be
totally standard. 040_standby_failover_slots_sync.pl does this:

my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
$standby1->init_from_backup(
        $primary, $backup_name,
        has_streaming => 1,
        has_restoring => 1);

And 046_checkpont_logical_slot.pl does this:

my $standby = PostgreSQL::Test::Cluster->new('standby');
$standby->init_from_backup(
    $primary, $backup_name,
    has_streaming => 1,
    has_restoring => 1);

So why is 046 failing and 040 is fine? I have no idea.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> What's also puzzling is that what this test is doing seems to be
> totally standard.

Yeah.  I do notice something interesting when running it here:
046_checkpoint_logical_slot_mike.log shows that we are triggering
quite a few checkpoints (via pg_switch_wal()) in quick succession
on the primary.  I wonder if that is somehow tickling a Windows
filesystem restriction.

            regards, tom lane



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Thomas Munro
Дата:
On Wed, Jan 28, 2026 at 3:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I imagine this is going to break CI for everybody else too, as well as cfbot.

Just by the way, on that last point, we trained cfbot to watch out for
CI pass/fail in this account:

https://github.com/postgres/postgres/commits/master/

and then use the most recent pass as the base commit when applying
patches to make test branches.  So if master is broken for a while, it
no longer takes all the cfbot runs with it.  Mentioning just in case
anyone is confused by that...

As for what's happening... hmm, there are a few holes in the "shared
locking" stuff you get with the flags we use.  For example you can't
unlink a directory that contains a file that has been unlinked but
someone still holds open.  Doesn't seem to be the case here.  But I
wonder if you can't rename("old", "new") where "new" is a file that
has already been unlinked (or renamed over) that someone still holds
open, or something like that...



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Andres Freund
Дата:
Hi,

On 2026-01-27 10:51:58 -0500, Robert Haas wrote:
> I don't understand it, either. There's a bunch of error codes that we
> map to EACCES in _dosmaperr, but I don't know why any of those
> problems would have occurred here:
> 
> ERROR_ACCESS_DENIED, EACCES
> ERROR_CURRENT_DIRECTORY, EACCES
> ERROR_LOCK_VIOLATION, EACCES
> ERROR_SHARING_VIOLATION, EACCES
> ERROR_NETWORK_ACCESS_DENIED, EACCES
> ERROR_CANNOT_MAKE, EACCES
> ERROR_FAIL_I24, EACCES
> ERROR_DRIVE_LOCKED, EACCES
> ERROR_SEEK_ON_DEVICE, EACCES
> ERROR_NOT_LOCKED, EACCES
> ERROR_LOCK_FAILED, EACCES
> 
> (Side note: Wouldn't it make a lot of sense to go back and kill
> _dosmaperr in favor of display the actual Windows error code string?)

It'd be great to somehow preserve the mapping to preserve the original error
message, but I don't really see how we could just give up on our mapping. We
rely on e.g. knowing that a read failed due to ENOENT, not
ERROR_FILE_NOT_FOUND or whatnot.


> What's also puzzling is that what this test is doing seems to be
> totally standard. 040_standby_failover_slots_sync.pl does this:
> 
> my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
> $standby1->init_from_backup(
>         $primary, $backup_name,
>         has_streaming => 1,
>         has_restoring => 1);
> 
> And 046_checkpont_logical_slot.pl does this:
> 
> my $standby = PostgreSQL::Test::Cluster->new('standby');
> $standby->init_from_backup(
>     $primary, $backup_name,
>     has_streaming => 1,
>     has_restoring => 1);
> 
> So why is 046 failing and 040 is fine? I have no idea.

046 does a fair bit of stuff before the base backup is being taken, I guess?
But what that concretely could be, I have no idea.

It'd be one thing if it failed while creating a base backup, but the fact that
it allows the base backup being created, but then fails during startup is just
plain odd.  The typical sharing violation issue seems like it'd require that
we somehow are not waiting for pg_basebackup to actually have terminated?

Greetings,

Andres Freund



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Robert Haas
Дата:
On Tue, Jan 27, 2026 at 11:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > What's also puzzling is that what this test is doing seems to be
> > totally standard.
>
> Yeah.  I do notice something interesting when running it here:
> 046_checkpoint_logical_slot_mike.log shows that we are triggering
> quite a few checkpoints (via pg_switch_wal()) in quick succession
> on the primary.  I wonder if that is somehow tickling a Windows
> filesystem restriction.

Maybe, but it seems unlikely to me that this would mess up the
standby, since it's a totally different node. What I kind of wonder is
if somehow there's still a process that has backup_label open, or has
closed it but not recently enough for Windows to unlock it. However, I
don't see why that would affect this test case and not others.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Andres Freund
Дата:
Hi,

On 2026-01-28 05:16:13 +1300, Thomas Munro wrote:
> On Wed, Jan 28, 2026 at 3:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I imagine this is going to break CI for everybody else too, as well as cfbot.
> 
> Just by the way, on that last point, we trained cfbot to watch out for
> CI pass/fail in this account:
> 
> https://github.com/postgres/postgres/commits/master/
> 
> and then use the most recent pass as the base commit when applying
> patches to make test branches.  So if master is broken for a while, it
> no longer takes all the cfbot runs with it.  Mentioning just in case
> anyone is confused by that...

Ah. I was indeed confused by that for a bit.


> But I wonder if you can't rename("old", "new") where "new" is a file that
> has already been unlinked (or renamed over) that someone still holds open,
> or something like that...

I don't see a source of that that would be specific to this test though :(. We
do wait for pg_basebackup to have shut down, which wrote backup.label (which
was "manifactured" during streaming by basebackup.c).


Perhaps we should crank up log level in the test? No idea if it'll help, but
right now I don't even know where to start looking.

Greetings,

Andres Freund



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Greg Burd
Дата:
> On Jan 27, 2026, at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> This commit has broken CI for me.
>
>> Hmm, I wonder why the buildfarm seems fine with it ... I'm prepared
>> to believe a Windows-only problem, but at least hamerkop has run
>> since 851f664.
>
> D'oh: hamerkop doesn't run any TAP tests, let alone ones that require
> --enable-injection-points.  So that success proves nothing.
>
> Our other Windows animals (drongo, fairywren, unicorn) seem to be
> configured with -Dtap_tests=enabled, but nothing about injection
> points, so they will also skip 046_checkpoint_logical_slot.
> Seems like a bit of a blind spot in the buildfarm.
>
> regards, tom lane
>


I'll see if I can update unicorn today to enable injection points to add some coverage on Win11/ARM64/MSVC.  No
promisesthat will be diagnostic at all, but it seems like a good idea. 


-Dinjection_points=true


-greg


Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Robert Haas
Дата:
On Tue, Jan 27, 2026 at 11:53 AM Greg Burd <greg@burd.me> wrote:
> I'll see if I can update unicorn today to enable injection points to add some coverage on Win11/ARM64/MSVC.  No
promisesthat will be diagnostic at all, but it seems like a good idea. 
> -Dinjection_points=true

Sounds good!

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Robert Haas
Дата:
On Tue, Jan 27, 2026 at 11:37 AM Andres Freund <andres@anarazel.de> wrote:
> > But I wonder if you can't rename("old", "new") where "new" is a file that
> > has already been unlinked (or renamed over) that someone still holds open,
> > or something like that...
>
> I don't see a source of that that would be specific to this test though :(. We
> do wait for pg_basebackup to have shut down, which wrote backup.label (which
> was "manifactured" during streaming by basebackup.c).
>
> Perhaps we should crank up log level in the test? No idea if it'll help, but
> right now I don't even know where to start looking.

I tried sticking a pg_sleep(30) in just before starting the standby
node, and that didn't help, so it doesn't seem like it's a race
condition.

Here's what the standby log file looks like with log_min_messages=DEBUG2:

2026-01-27 17:19:25.262 GMT postmaster[4932] DEBUG:  registering
background worker "logical replication launcher"
2026-01-27 17:19:25.264 GMT postmaster[4932] DEBUG:  dynamic shared
memory system will support 229 segments
2026-01-27 17:19:25.264 GMT postmaster[4932] DEBUG:  created dynamic
shared memory control segment 3769552926 (9176 bytes)
2026-01-27 17:19:25.266 GMT postmaster[4932] DEBUG:  max_safe_fds =
990, usable_fds = 1000, already_open = 3
2026-01-27 17:19:25.268 GMT postmaster[4932] LOG:  starting PostgreSQL
19devel on x86_64-windows, compiled by msvc-19.29.30159, 64-bit
2026-01-27 17:19:25.271 GMT postmaster[4932] LOG:  listening on Unix
socket "C:/Windows/TEMP/3xesO1s4ba/.s.PGSQL.17575"
2026-01-27 17:19:25.273 GMT postmaster[4932] DEBUG:  updating PMState
from PM_INIT to PM_STARTUP
2026-01-27 17:19:25.273 GMT postmaster[4932] DEBUG:  assigned pm child
slot 57 for io worker
2026-01-27 17:19:25.275 GMT postmaster[4932] DEBUG:  assigned pm child
slot 58 for io worker
2026-01-27 17:19:25.277 GMT postmaster[4932] DEBUG:  assigned pm child
slot 59 for io worker
2026-01-27 17:19:25.278 GMT postmaster[4932] DEBUG:  assigned pm child
slot 56 for checkpointer
2026-01-27 17:19:25.280 GMT postmaster[4932] DEBUG:  assigned pm child
slot 55 for background writer
2026-01-27 17:19:25.281 GMT postmaster[4932] DEBUG:  assigned pm child
slot 89 for startup
2026-01-27 17:19:25.308 GMT checkpointer[6560] DEBUG:  checkpointer
updated shared memory configuration values
2026-01-27 17:19:25.314 GMT startup[2488] LOG:  database system was
interrupted; last known up at 2026-01-27 17:19:21 GMT
2026-01-27 17:19:25.317 GMT startup[2488] DEBUG:  removing all
temporary WAL segments
The system cannot find the file specified.
2026-01-27 17:19:25.336 GMT startup[2488] DEBUG:  could not restore
file "00000002.history" from archive: child process exited with exit
code 1
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG:  backup time
2026-01-27 17:19:21 GMT in file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG:  backup label
pg_basebackup base backup in file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] DEBUG:  backup timeline 1 in
file "backup_label"
2026-01-27 17:19:25.337 GMT startup[2488] LOG:  starting backup
recovery with redo LSN 0/2A000028, checkpoint LSN 0/2A000080, on
timeline ID 1
The system cannot find the file specified.
2026-01-27 17:19:25.352 GMT startup[2488] DEBUG:  could not restore
file "00000001000000000000002A" from archive: child process exited
with exit code 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  checkpoint record is
at 0/2A000080
2026-01-27 17:19:25.353 GMT startup[2488] LOG:  entering standby mode
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  redo record is at
0/2A000028; shutdown false
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  next transaction ID:
769; next OID: 24576
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  next MultiXactId: 1;
next MultiXactOffset: 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  oldest unfrozen
transaction ID: 760, in database 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  oldest MultiXactId:
1, in database 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  commit timestamp Xid
oldest/newest: 0/0
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  transaction ID wrap
limit is 2147484407, limited by database with OID 1
2026-01-27 17:19:25.353 GMT startup[2488] DEBUG:  MultiXactId wrap
limit is 2147483648, limited by database with OID 1
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG:  starting up replication slots
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG:  xmin required by
slots: data 0, catalog 0
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG:  starting up
replication origin progress state
2026-01-27 17:19:25.354 GMT startup[2488] DEBUG:  didn't need to
unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
2026-01-27 17:19:38.938 GMT startup[2488] FATAL:  could not rename
file "backup_label" to "backup_label.old": Permission denied
2026-01-27 17:19:38.983 GMT postmaster[4932] DEBUG:  releasing pm child slot 89
2026-01-27 17:19:38.983 GMT postmaster[4932] LOG:  startup process
(PID 2488) exited with exit code 1
2026-01-27 17:19:38.983 GMT postmaster[4932] LOG:  aborting startup
due to startup process failure
2026-01-27 17:19:38.983 GMT postmaster[4932] DEBUG:  cleaning up
dynamic shared memory control segment with ID 3769552926
2026-01-27 17:19:38.985 GMT postmaster[4932] LOG:  database system is shut down

Unfortunately, I don't see any clues there. The "The system cannot
find the file specified." messages look like they might be a clue, but
I think they are not, because they also occur in
040_standby_failover_slots_sync_standby1.log, and that test passes. At
the point where this log file shows the FATAL error, that log file
continues thus:

2026-01-27 17:18:36.905 GMT startup[1420] DEBUG:  resetting unlogged
relations: cleanup 1 init 0
2026-01-27 17:18:36.906 GMT startup[1420] DEBUG:  initializing for hot standby
2026-01-27 17:18:36.906 GMT startup[1420] LOG:  redo starts at 0/02000028
2026-01-27 17:18:36.906 GMT startup[1420] DEBUG:  recovery snapshots
are now enabled
2026-01-27 17:18:36.906 GMT startup[1420] CONTEXT:  WAL redo at
0/02000048 for Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid
768 oldestRunningXid 769
2026-01-27 17:18:36.907 GMT startup[1420] DEBUG:  end of backup record reached
2026-01-27 17:18:36.907 GMT startup[1420] CONTEXT:  WAL redo at
0/02000100 for XLOG/BACKUP_END: 0/02000028
2026-01-27 17:18:36.907 GMT startup[1420] DEBUG:  end of backup reached

Which again seems totally normal.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Robert Haas
Дата:
On Tue, Jan 27, 2026 at 12:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 2026-01-27 17:19:25.354 GMT startup[2488] DEBUG:  didn't need to
> unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
> 2026-01-27 17:19:38.938 GMT startup[2488] FATAL:  could not rename
> file "backup_label" to "backup_label.old": Permission denied

Andrey Borodin pointed out to me off-list that there's a retry loop in
pgrename(). The 13 second delay between the above two log messages
almost certainly means that retry loop is iterating until it hits its
10 second timeout. This almost certainly means that the underlying
Windows error is ERROR_ACCESS_DENIED, ERROR_SHARING_VIOLATION, or
ERROR_LOCK_VIOLATION, and that somebody else has the file open. But
nothing other than Perl touches that directory before we try to start
the standby:

my $standby = PostgreSQL::Test::Cluster->new('standby');
$standby->init_from_backup(
        $primary, $backup_name,
        has_streaming => 1,
        has_restoring => 1);
$standby->append_conf(
        'postgresql.conf', qq(
hot_standby_feedback = on
primary_slot_name = 'phys_slot'
primary_conninfo = '$connstr_1 dbname=postgres'
log_min_messages = 'debug2'
));
$standby->start;

As far as I can see, only init_from_backup() touches the backup_label
file, and that just copies the directory using RecursiveCopy.pm, which
as far as I can tell is quite careful about closing file handles. So I
still have no idea what's happening here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Andres Freund
Дата:
Hi,

On 2026-01-27 12:42:51 -0500, Robert Haas wrote:
> I tried sticking a pg_sleep(30) in just before starting the standby
> node, and that didn't help, so it doesn't seem like it's a race
> condition.

Interesting.

It could be worth trying to run the test in isolation, without all the other
concurrent tests.

Greg, have you tried to repro it interactively?

Bryan, you seem to have become the resident windows expert...


> 2026-01-27 17:19:25.337 GMT startup[2488] LOG:  starting backup
> recovery with redo LSN 0/2A000028, checkpoint LSN 0/2A000080, on
> timeline ID 1
> The system cannot find the file specified.
> 2026-01-27 17:19:25.352 GMT startup[2488] DEBUG:  could not restore
> file "00000001000000000000002A" from archive: child process exited
> with exit code 1

I think that must be a message from "copy" (which we seem to be using for
restore_command on windows).

I don't know why the standby is created with has_restoring => 1. But it
shouldn't be related to the issue, I think?

Greetings,

Andres Freund



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Amit Kapila
Дата:
On Tue, Jan 27, 2026 at 8:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 27, 2026 at 12:56 AM Amit Kapila <akapila@postgresql.org> wrote:
> > Prevent invalidation of newly synced replication slots.
>
> This commit has broken CI for me. On the "Windows - Server 2022, VS
> 2019 - Meson & ninja" build, the following shows up in
> 046_checkpoint_logical_slot_standby.log:
>
> 2026-01-27 13:44:44.421 GMT startup[5172] FATAL:  could not rename
> file "backup_label" to "backup_label.old": Permission denied
>
> I imagine this is going to break CI for everybody else too, as well as cfbot.
>

I'll try to reproduce and look into it.

--
With Regards,
Amit Kapila.



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Amit Kapila
Дата:
On Tue, Jan 27, 2026 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 27, 2026 at 12:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > 2026-01-27 17:19:25.354 GMT startup[2488] DEBUG:  didn't need to
> > unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
> > 2026-01-27 17:19:38.938 GMT startup[2488] FATAL:  could not rename
> > file "backup_label" to "backup_label.old": Permission denied
>
> Andrey Borodin pointed out to me off-list that there's a retry loop in
> pgrename(). The 13 second delay between the above two log messages
> almost certainly means that retry loop is iterating until it hits its
> 10 second timeout.
>

Yes, this is correct. I am able to reproduce it. In pgrename(), we use
MoveFileEx() windows API which fails with errorcode 32 which further
maps to doserrr 13 via _dosmaperr. It is following mapping
ERROR_SHARING_VIOLATION, EACCES in doserrors struct.

 This almost certainly means that the underlying
> Windows error is ERROR_ACCESS_DENIED, ERROR_SHARING_VIOLATION, or
> ERROR_LOCK_VIOLATION, and that somebody else has the file open.
>

It is ERROR_SHARING_VIOLATION.

 But
> nothing other than Perl touches that directory before we try to start
> the standby:
>
> my $standby = PostgreSQL::Test::Cluster->new('standby');
> $standby->init_from_backup(
>         $primary, $backup_name,
>         has_streaming => 1,
>         has_restoring => 1);
> $standby->append_conf(
>         'postgresql.conf', qq(
> hot_standby_feedback = on
> primary_slot_name = 'phys_slot'
> primary_conninfo = '$connstr_1 dbname=postgres'
> log_min_messages = 'debug2'
> ));
> $standby->start;
>
> As far as I can see, only init_from_backup() touches the backup_label
> file, and that just copies the directory using RecursiveCopy.pm, which
> as far as I can tell is quite careful about closing file handles. So I
> still have no idea what's happening here.
>

It is not clear to me either why the similar test like
040_standby_failover_slots_sync is successful and
046_checkpoint_logical_slot is failing. I am still thinking about it
but thought of sharing the information I could gather by debugging.

Do let me know if you could think of gathering any other information
which can be of help here.

--
With Regards,
Amit Kapila.



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Thomas Munro
Дата:
On Tue, Jan 27, 2026 at 5:37 PM Andres Freund <andres@anarazel.de> wrote:
> On 2026-01-28 05:16:13 +1300, Thomas Munro wrote:
> > But I wonder if you can't rename("old", "new") where "new" is a file that
> > has already been unlinked (or renamed over) that someone still holds open,
> > or something like that...
>
> I don't see a source of that that would be specific to this test though :(. We
> do wait for pg_basebackup to have shut down, which wrote backup.label (which
> was "manifactured" during streaming by basebackup.c).

I have no specific ideas, but just in case it's helpful for this
discussion, I looked at my old test suite[1] where I tried to
catalogue all the edge conditions around this sort of stuff
empirically, and saw that rename() always fails like that if the file
is open (that is, it doesn't require a more complicated sequence with
an earlier unlink/rename of the new name):

+ /*
+ * Windows can't rename over an open non-unlinked file, even with
+ * have_posix_unlink_semantics.
+ */
+ pgwin32_dirmod_loops = 2; /* minimize looping to fail fast in testing */
+ PG_EXPECT_SYS(rename(path, path2) == -1,
+  "Windows: can't rename name1.txt -> name2.txt while name2.txt is open");
+ PG_EXPECT_EQ(errno, EACCES);

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Andrey Borodin
Дата:

> On 28 Jan 2026, at 10:47, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Do let me know if you could think of gathering any other information
> which can be of help here.

Interestingly, increasing timeout in pgrename() to 500 seconds fixes "Windows - Server 2022, VS 2019 - Meson & ninja ",
butdoes not fix "Windows - Server 2022, VS 2019 - Meson & ninja". 

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 467b50d6f09..da38e37aa45 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -88,7 +88,7 @@ pgrename(const char *from, const char *to)
                        return -1;
 #endif
-               if (++loops > 100)              /* time out after 10 sec */
+               if (++loops > 5000)             /* time out after 10 sec */
                        return -1;
                pg_usleep(100000);              /* us */
        }


Best regards, Andrey Borodin.


Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Amit Kapila
Дата:
On Wed, Jan 28, 2026 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> It is not clear to me either why the similar test like
> 040_standby_failover_slots_sync is successful and
> 046_checkpoint_logical_slot is failing. I am still thinking about it
> but thought of sharing the information I could gather by debugging.
>

It seems there is some interaction with previous test in same file
which is causing this failure as we are using the primary node from
previous test. When I tried to comment out get_changes and its
corresponding injection_point in the previous test as attached, the
entire test passed. I think if we use a freshly created primary node,
this test will pass but I wanted to spend some more time to see
how/why previous test is causing this issue?

--
With Regards,
Amit Kapila.

Вложения

Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Amit Kapila
Дата:
On Tue, Jan 27, 2026 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> I don't know why the standby is created with has_restoring => 1.
>

This is not required. I think this is copy-paste oversight.

> But it
> shouldn't be related to the issue, I think?
>

Yeah, tried without this as well apart from other experiments.

--
With Regards,
Amit Kapila.



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Amit Kapila
Дата:
On Wed, Jan 28, 2026 at 4:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 28, 2026 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It is not clear to me either why the similar test like
> > 040_standby_failover_slots_sync is successful and
> > 046_checkpoint_logical_slot is failing. I am still thinking about it
> > but thought of sharing the information I could gather by debugging.
> >
>
> It seems there is some interaction with previous test in same file
> which is causing this failure as we are using the primary node from
> previous test. When I tried to comment out get_changes and its
> corresponding injection_point in the previous test as attached, the
> entire test passed. I think if we use a freshly created primary node,
> this test will pass but I wanted to spend some more time to see
> how/why previous test is causing this issue?
>

I noticed that the previous test didn't quitted the background psql
session used for concurrent checkpoint. By quitting that background
session, the test passed for me consistently. See attached. It is
written in comments atop background_psql: "Be sure to "quit" the
returned object when done with it.". Now, this background session
doesn't directly access the backup_label file but it could be
accessing one of the parent directories where backup_label is present.
One of gen-AI says as follows: "In Windows, MoveFileEx (Error 32:
ERROR_SHARING_VIOLATION) can fail if a process is accessing the file's
parent directory in a way that creates a lock. While the error message
usually points to the file itself, the parent folder is a critical
part of the operation.". I admit that I don't know the internals of
MoveFileEx, so can't say with complete conviction but the attached
sounds like a reasonable fix. Can anyone else who can reproduce the
issue once test the attached patch and share the results?

Does this fix/theory sound plausible?

--
With Regards,
Amit Kapila.

Вложения

Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Robert Haas
Дата:
On Wed, Jan 28, 2026 at 7:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Does this fix/theory sound plausible?

I wondered about this yesterday, too. I didn't actually understand how
the existence of the background psql could be causing the failure, but
I thought it might be. However, I couldn't figure out the correct
incantation to get rid of it in my testing, as I thought I would need
to detach the injection point first or something.

If it fixes it for you, I would suggest committing promptly. I think
we are too dependent on CI now to leave it broken for any period of
time, and indeed I suggest getting set up so that you test your
commits against it before committing.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Amit Kapila
Дата:
On Wed, Jan 28, 2026 at 6:28 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 28, 2026 at 7:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Does this fix/theory sound plausible?
>
> I wondered about this yesterday, too. I didn't actually understand how
> the existence of the background psql could be causing the failure, but
> I thought it might be. However, I couldn't figure out the correct
> incantation to get rid of it in my testing, as I thought I would need
> to detach the injection point first or something.
>

Yeah, it would be better to quit these sessions after the test is
complete because there are other two background sessions as well. I
used the method to quit these sessions as used in
\src\test\modules\test_misc\t\005_timeouts.pl. The attached passes for
me on both Linux and Windows (check on HEAD only as of now). I'll do
some more testing on back branches as well and push tomorrow morning
if there are no more comments.

--
With Regards,
Amit Kapila.

Вложения

Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Andres Freund
Дата:
Hi,

On 2026-01-28 18:05:10 +0530, Amit Kapila wrote:
> I noticed that the previous test didn't quitted the background psql
> session used for concurrent checkpoint. By quitting that background
> session, the test passed for me consistently. See attached. It is
> written in comments atop background_psql: "Be sure to "quit" the
> returned object when done with it.". Now, this background session
> doesn't directly access the backup_label file but it could be
> accessing one of the parent directories where backup_label is present.

Hm. I've seen (and complained about [1]) weird errors when not shutting down
IPC::Run processes - mostly the test hanging at the end though.


> One of gen-AI says as follows: "In Windows, MoveFileEx (Error 32:
> ERROR_SHARING_VIOLATION) can fail if a process is accessing the file's
> parent directory in a way that creates a lock. While the error message
> usually points to the file itself, the parent folder is a critical
> part of the operation.".

I don't see how that could be the plausible reason - after all we have a lot
of other open files open in the relevant directories.  But: It seems to fix
the problem for you, so it's worth going for it, as it's the right thing to do
anyway.


I think it'd be worth, separately from committing the workaround, trying to
figure out what's holding the file open. Andrey observed that the tests pass
for him with a much longer timeout. If you can reproduce it locally, I'd try
to use something like [2] to see what has handles open to the relevant files,
while waiting for the timeout.

Greetings,

Andres Freund

[1] https://postgr.es/m/20240619030727.ldp3mcrjbd5fqwj5%40awork3.anarazel.de
[2] https://learn.microsoft.com/en-us/sysinternals/downloads/handle



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
"Greg Burd"
Дата:
On Tue, Jan 27, 2026, at 1:17 PM, Andres Freund wrote:
> Hi,
>
> On 2026-01-27 12:42:51 -0500, Robert Haas wrote:
>> I tried sticking a pg_sleep(30) in just before starting the standby
>> node, and that didn't help, so it doesn't seem like it's a race
>> condition.
>
> Interesting.
>
> It could be worth trying to run the test in isolation, without all the other
> concurrent tests.
>
> Greg, have you tried to repro it interactively?

Nope, not yet.  I'm working on my ailing animals now and updated unicorn to include injection points.

-greg

> Bryan, you seem to have become the resident windows expert...
>
>
>> 2026-01-27 17:19:25.337 GMT startup[2488] LOG:  starting backup
>> recovery with redo LSN 0/2A000028, checkpoint LSN 0/2A000080, on
>> timeline ID 1
>> The system cannot find the file specified.
>> 2026-01-27 17:19:25.352 GMT startup[2488] DEBUG:  could not restore
>> file "00000001000000000000002A" from archive: child process exited
>> with exit code 1
>
> I think that must be a message from "copy" (which we seem to be using for
> restore_command on windows).
>
> I don't know why the standby is created with has_restoring => 1. But it
> shouldn't be related to the issue, I think?
>
> Greetings,
>
> Andres Freund



Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Andrey Borodin
Дата:

> On 28 Jan 2026, at 21:53, Andres Freund <andres@anarazel.de> wrote:
>
> Andrey observed that the tests pass
> for him with a much longer timeout.

Unfortunately, I was wrong. The job "Windows - Server 2022, MinGW64 - Meson" which failed yesterday did not fail today.
But it did not succeed either. CirrusCI seems just did not run it. I do not understand why.
Anyway, I cannot prove that it is race condition. On a contrary, test fails on any big timeout (pg_ctl will bail out)
deterministically.


Best regards, Andrey Borodin.


Re: pgsql: Prevent invalidation of newly synced replication slots.

От
Amit Kapila
Дата:
On Wed, Jan 28, 2026 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:
>
> I think it'd be worth, separately from committing the workaround, trying to
> figure out what's holding the file open. Andrey observed that the tests pass
> for him with a much longer timeout. If you can reproduce it locally, I'd try
> to use something like [2] to see what has handles open to the relevant files,
> while waiting for the timeout.
>

Thanks for the suggestion. I did some experiments by using handle.exe
and below are the results. To get the results, I added a long sleep
before rename of backup_label file.

After Fix:
==========
handle.exe
D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label

Nthandle v5.0 - Handle viewer
Copyright (C) 1997-2022 Mark Russinovich
Sysinternals - www.sysinternals.com

No matching handles found.

Before Fix:
==========
handle.exe
D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label

Nthandle v5.0 - Handle viewer
Copyright (C) 1997-2022 Mark Russinovich
Sysinternals - www.sysinternals.com

perl.exe           pid: 33784  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
pg_ctl.exe         pid: 51236  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
cmd.exe            pid: 35332  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
postgres.exe       pid: 48200  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
postgres.exe       pid: 7420   type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
postgres.exe       pid: 17160  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
postgres.exe       pid: 56192  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
postgres.exe       pid: 53892  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
postgres.exe       pid: 44732  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label
postgres.exe       pid: 43488  type: File           30C:

D:\Workspace\Postgresql\head\postgresql\build\testrun\recovery\046_checkpoint_logical_slot\data\t_046_checkpoint_logical_slot_standby_data\pgdata\backup_label

All the shown postgres processes are various standby processes. Below
are details of each postgres process:

43488: startup process
XLogCtl->SharedRecoveryState RECOVERY_STATE_ARCHIVE (1)

44732: bgwriter:
XLogCtl->SharedRecoveryState RECOVERY_STATE_ARCHIVE (1)

53892: checkpointer
XLogCtl->SharedRecoveryState RECOVERY_STATE_ARCHIVE (1)

56192: aio-worker
XLogCtl->SharedRecoveryState RECOVERY_STATE_ARCHIVE (1)

17160: aio-worker
XLogCtl->SharedRecoveryState RECOVERY_STATE_ARCHIVE (1)

7420: aio-worker
XLogCtl->SharedRecoveryState RECOVERY_STATE_ARCHIVE (1)

48200: postmaster
XLogCtl->SharedRecoveryState RECOVERY_STATE_ARCHIVE (1)

I printed XLogCtl->SharedRecoveryState to show all are standby processes.

The results are a bit strange in the sense that some unfinished psql
sessions of primary could lead standby processes to be shown in
results of handle.exe.

Note: I have access to this environment till tomorrow noon, so I can
try to investigate a bit tomorrow if there are more questions related
to the above experiment.

--
With Regards,
Amit Kapila.