Обсуждение: BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate

Поиск
Список
Период
Сортировка

BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18985
Logged by:          Erik Dobak
Email address:      erik.dobak@gmail.com
PostgreSQL version: 14.8
Operating system:   Windows server 2022
Description:

We use postgresql 14.18 at a windows server 2022 as a repository db for the
Qlik Sense product. Additionally we connect to the postgresql db from a
remote server running Qlik Data Gateway Data Movement server which
replicates data and changes to various cloud DBs by using replication slots.

When we try to shutdown the postgresql db service (per windows services or
pg_ctl -m fast) almost all connections to the posgresql server are closed
but the Qlik Data Gateway Data Movement connections can be seen as
ESTABLISHED when using the netstat -ano command.

From the posgresql logs -> this happens when i try to shutdown postgresql:

2025-07-15 08:20:36.676 UTC [14740] LOG:  could not receive data from
client: An existing connection was forcibly closed by the remote host.
2025-07-15 08:20:43.980 UTC [14516] LOG:  received fast shutdown request
2025-07-15 08:20:43.986 UTC [14516] LOG:  aborting any active transactions
2025-07-15 08:20:43.986 UTC [14604] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.987 UTC [8020] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.987 UTC [13800] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.987 UTC [15232] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.987 UTC [13600] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.987 UTC [10176] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.987 UTC [7648] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [2272] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [11072] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [13304] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.987 UTC [12656] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [15496] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [2276] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [13256] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [10828] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [6016] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:43.988 UTC [13232] FATAL:  terminating connection due to
administrator command
2025-07-15 08:20:44.047 UTC [14516] LOG:  background worker "logical
replication launcher" (PID 6400) exited with exit code 1
2025-07-15 08:20:44.053 UTC [5448] LOG:  shutting down
2025-07-15 08:20:56.512 UTC [4364] LOG:  connection received: host=::1
port=52514
2025-07-15 08:20:56.512 UTC [4364] FATAL:  the database system is shutting
down
2025-07-15 08:20:56.547 UTC [13968] LOG:  connection received:
host=127.0.0.1 port=52515
2025-07-15 08:20:56.547 UTC [13968] FATAL:  the database system is shutting
down
2025-07-15 08:21:56.486 UTC [10456] LOG:  connection received: host=::1
port=52543
2025-07-15 08:21:56.487 UTC [10456] FATAL:  the database system is shutting
down
2025-07-15 08:21:56.522 UTC [4812] LOG:  connection received: host=127.0.0.1
port=52544
2025-07-15 08:21:56.522 UTC [4812] FATAL:  the database system is shutting
down
2025-07-15 08:21:56.561 UTC [15444] LOG:  connection received: host=::1
port=52545
2025-07-15 08:21:56.562 UTC [15444] FATAL:  the database system is shutting
down
2025-07-15 08:21:56.595 UTC [12844] LOG:  connection received:
host=127.0.0.1 port=52546
2025-07-15 08:21:56.595 UTC [12844] FATAL:  the database system is shutting
down
2025-07-15 08:22:56.505 UTC [15004] LOG:  connection received: host=::1
port=52574
...


this did go on for 30minutes or more. then i checked with netstat (we are
using port 4432 instead of 5432 for postgresdb):

PS C:\Users\redacted> netstat -ano|findstr 4432
  TCP    0.0.0.0:4432           0.0.0.0:0              LISTENING       14516
  TCP    redacted_postgres_ip:4432         readacted_dgdm_ip:35818
ESTABLISHED     14516
  TCP    redacted_postgres_ip:4432         redacted_dgdm_ip:35886
ESTABLISHED     14516
  TCP    [::]:4432              [::]:0                 LISTENING       14516

here you can see 2 active network connection even though we have initiated a
fast shutdown 30mins or more before.

i decided to shutdown our Qlik Data Gateway Data Movement service on the
other server and immediately the postgresDB did stop correctly:

...
2025-07-15 08:50:51.799 UTC [5448] LOG:  checkpoint starting: shutdown
immediate
2025-07-15 08:50:51.859 UTC [5448] LOG:  checkpoint complete: wrote 9
buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.004 s,
sync=0.019 s, total=0.067 s; sync files=7, longest=0.004 s, average=0.003 s;
distance=38 kB, estimate=22505 kB
2025-07-15 08:50:51.910 UTC [14516] LOG:  database system is shut down.

-------

According to https://www.postgresql.org/docs/14/app-pg-ctl.html :
...
“Fast” mode (the default) does not wait for clients to disconnect and will
terminate an online backup in progress. All active transactions are rolled
back and clients are forcibly disconnected, then the server is shut down.
...

In our case there are at least 2 client connection not forcibly disconnected
and this is why i open this bug.


On Tuesday, July 15, 2025, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18985
Logged by:          Erik Dobak
Email address:      erik.dobak@gmail.com
PostgreSQL version: 14.8
Operating system:   Windows server 2022
Description:       

We use postgresql 14.18 at a windows server 2022 as a repository db for the
Qlik Sense product. Additionally we connect to the postgresql db from a
remote server running Qlik Data Gateway Data Movement server which
replicates data and changes to various cloud DBs by using replication slots.

Can you replicate this on a supported version of PostgreSQL?

David J. 


On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com> wrote:
On Tuesday, July 15, 2025, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18985
Logged by:          Erik Dobak
Email address:      erik.dobak@gmail.com
PostgreSQL version: 14.8
Operating system:   Windows server 2022
Description:       

We use postgresql 14.18 at a windows server 2022 as a repository db for the
Qlik Sense product. Additionally we connect to the postgresql db from a
remote server running Qlik Data Gateway Data Movement server which
replicates data and changes to various cloud DBs by using replication slots.

Can you replicate this on a supported version of PostgreSQL?

David J. 

14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate this on a supported version.

E
ah yes now i see. we could reproduce it on 14.8 and 14.18. i

E

On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com> wrote:
On Tuesday, July 15, 2025, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18985
Logged by:          Erik Dobak
Email address:      erik.dobak@gmail.com
PostgreSQL version: 14.8
Operating system:   Windows server 2022
Description:       

We use postgresql 14.18 at a windows server 2022 as a repository db for the
Qlik Sense product. Additionally we connect to the postgresql db from a
remote server running Qlik Data Gateway Data Movement server which
replicates data and changes to various cloud DBs by using replication slots.

Can you replicate this on a supported version of PostgreSQL?

David J. 
=?UTF-8?Q?Erik_Dob=C3=A1k?= <erik.dobak@gmail.com> writes:
> On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com>
> wrote:
>> Can you replicate this on a supported version of PostgreSQL?

> 14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate
> this on a supported version.

I think what David is actually wondering about is "Qlik Data Gateway
Data Movement".  That is not our code, and if it's using a custom
logical-replication plugin then the plugin seems like the likely
thing to blame.  We have not heard reports of this happening with
our stock replication code.

            regards, tom lane






On Tue, 15 Jul 2025, 18:08 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Erik Dobák <erik.dobak@gmail.com> writes:
> On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com>
> wrote:
>> Can you replicate this on a supported version of PostgreSQL?

> 14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate
> this on a supported version.

I think what David is actually wondering about is "Qlik Data Gateway
Data Movement".  That is not our code, and if it's using a custom
logical-replication plugin then the plugin seems like the likely
thing to blame.  We have not heard reports of this happening with
our stock replication code.

                        regards, tom lane

hi Tom,

not sure what you mean by "plugin".

dgdm is connecting to an untempered postgres db = standard postres db installation afaik.

did not know about plugins in postgres. how would one check if there is some?

E
On Tuesday, July 15, 2025, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Erik Dobák <erik.dobak@gmail.com> writes:
> On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com>
> wrote:
>> Can you replicate this on a supported version of PostgreSQL?

> 14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate
> this on a supported version.

I think what David is actually wondering about is "Qlik Data Gateway
Data Movement".  That is not our code, and if it's using a custom
logical-replication plugin then the plugin seems like the likely
thing to blame.  We have not heard reports of this happening with
our stock replication code.

                        

No, I meant the reported 14.8 PG version.

The observed behavior of the server seems like a bug no matter how well-coded or not the software trying to make the connecting is.  The server should be able to close and disallow clients no matter what they are doing.

David J.

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> The observed behavior of the server seems like a bug no matter how
> well-coded or not the software trying to make the connecting is.  The
> server should be able to close and disallow clients no matter what they are
> doing.

We will SIGKILL recalcitrant children during an "immediate" shutdown.
"Fast" shutdown is not defined to do that, and should not do so
because it'd imply having to do an unclean restart later.

I just tested logical replication both ways between current master and
v14 branch tip, and in either case the publishing server stops pretty
much instantaneously after "pg_ctl stop -m fast".  So I don't see
anything broken about our code, and I remain of the opinion that Qlik
is probably at fault.

            regards, tom lane



=?UTF-8?Q?Erik_Dob=C3=A1k?= <erik.dobak@gmail.com> writes:
> did not know about plugins in postgres. how would one check if there is
> some?

Check the pg_replication_slots view on the publisher for the slot(s)
being used for Qlik replication.  If the "plugin" column says
something other than "pgoutput" then a nonstandard plugin is being
used.

            regards, tom lane





On Tue, Jul 15, 2025, 11:00 Tom Lane <tgl@sss.pgh.pa.us> wrote:

I just tested logical replication both ways between current master and
v14 branch tip, and in either case the publishing server stops pretty
much instantaneously after "pg_ctl stop -m fast".  So I don't see
anything broken about our code, and I remain of the opinion that Qlik
is probably at fault.

Ok, yeah.  Looking more closely the reconnection attempt in the log doesn't look related to this so those hung connections are not playing nicely when told to go away.  Which indeed in fast mode we have to rely upon.

David J.



On Tue, 15 Jul 2025, 20:00 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> The observed behavior of the server seems like a bug no matter how
> well-coded or not the software trying to make the connecting is.  The
> server should be able to close and disallow clients no matter what they are
> doing.

We will SIGKILL recalcitrant children during an "immediate" shutdown.
"Fast" shutdown is not defined to do that, and should not do so
because it'd imply having to do an unclean restart later.

I just tested logical replication both ways between current master and
v14 branch tip, and in either case the publishing server stops pretty
much instantaneously after "pg_ctl stop -m fast".  So I don't see
anything broken about our code, and I remain of the opinion that Qlik
is probably at fault.

                        regards, tom lane
thank you Tom,

based on the documentation mentioned in my initial post i was thinking that the connections are closed during " fast" shutdown immediately.

i will forward this to Qlik then.

is there a better documentation that would explain the process of shutdown than the link i posted above?

regards

E
nice thanks i will study that.
E

On Tue, 15 Jul 2025, 20:27 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Erik Dobák <erik.dobak@gmail.com> writes:
> did not know about plugins in postgres. how would one check if there is
> some?

Check the pg_replication_slots view on the publisher for the slot(s)
being used for Qlik replication.  If the "plugin" column says
something other than "pgoutput" then a nonstandard plugin is being
used.

                        regards, tom lane
Hi,

This is likely related to the issue I've reported[1]: A logical
walsender may be stuck at 100% CPU during shutdown, trying to read an
incomplete FPI_FOR_HINT record and blocking the shutdown sequence. By
stopping the logical replication's target, the impacted walsender
exited, unblocking the shutdown.

There are similar reports of failover being stuck on projects like patroni[2].

I've provided a way to reproduce the issue in the linked thread, along
with a tentative patch.

Regards,
Anthonin Bonnefoy

[1]: https://www.postgresql.org/message-id/flat/CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ%40mail.gmail.com
[2]: https://github.com/patroni/patroni/issues/3522