Обсуждение: BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate
BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 18985 Logged by: Erik Dobak Email address: erik.dobak@gmail.com PostgreSQL version: 14.8 Operating system: Windows server 2022 Description: We use postgresql 14.18 at a windows server 2022 as a repository db for the Qlik Sense product. Additionally we connect to the postgresql db from a remote server running Qlik Data Gateway Data Movement server which replicates data and changes to various cloud DBs by using replication slots. When we try to shutdown the postgresql db service (per windows services or pg_ctl -m fast) almost all connections to the posgresql server are closed but the Qlik Data Gateway Data Movement connections can be seen as ESTABLISHED when using the netstat -ano command. From the posgresql logs -> this happens when i try to shutdown postgresql: 2025-07-15 08:20:36.676 UTC [14740] LOG: could not receive data from client: An existing connection was forcibly closed by the remote host. 2025-07-15 08:20:43.980 UTC [14516] LOG: received fast shutdown request 2025-07-15 08:20:43.986 UTC [14516] LOG: aborting any active transactions 2025-07-15 08:20:43.986 UTC [14604] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.987 UTC [8020] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.987 UTC [13800] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.987 UTC [15232] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.987 UTC [13600] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.987 UTC [10176] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.987 UTC [7648] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [2272] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [11072] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [13304] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.987 UTC [12656] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [15496] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [2276] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [13256] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [10828] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [6016] FATAL: terminating connection due to administrator command 2025-07-15 08:20:43.988 UTC [13232] FATAL: terminating connection due to administrator command 2025-07-15 08:20:44.047 UTC [14516] LOG: background worker "logical replication launcher" (PID 6400) exited with exit code 1 2025-07-15 08:20:44.053 UTC [5448] LOG: shutting down 2025-07-15 08:20:56.512 UTC [4364] LOG: connection received: host=::1 port=52514 2025-07-15 08:20:56.512 UTC [4364] FATAL: the database system is shutting down 2025-07-15 08:20:56.547 UTC [13968] LOG: connection received: host=127.0.0.1 port=52515 2025-07-15 08:20:56.547 UTC [13968] FATAL: the database system is shutting down 2025-07-15 08:21:56.486 UTC [10456] LOG: connection received: host=::1 port=52543 2025-07-15 08:21:56.487 UTC [10456] FATAL: the database system is shutting down 2025-07-15 08:21:56.522 UTC [4812] LOG: connection received: host=127.0.0.1 port=52544 2025-07-15 08:21:56.522 UTC [4812] FATAL: the database system is shutting down 2025-07-15 08:21:56.561 UTC [15444] LOG: connection received: host=::1 port=52545 2025-07-15 08:21:56.562 UTC [15444] FATAL: the database system is shutting down 2025-07-15 08:21:56.595 UTC [12844] LOG: connection received: host=127.0.0.1 port=52546 2025-07-15 08:21:56.595 UTC [12844] FATAL: the database system is shutting down 2025-07-15 08:22:56.505 UTC [15004] LOG: connection received: host=::1 port=52574 ... this did go on for 30minutes or more. then i checked with netstat (we are using port 4432 instead of 5432 for postgresdb): PS C:\Users\redacted> netstat -ano|findstr 4432 TCP 0.0.0.0:4432 0.0.0.0:0 LISTENING 14516 TCP redacted_postgres_ip:4432 readacted_dgdm_ip:35818 ESTABLISHED 14516 TCP redacted_postgres_ip:4432 redacted_dgdm_ip:35886 ESTABLISHED 14516 TCP [::]:4432 [::]:0 LISTENING 14516 here you can see 2 active network connection even though we have initiated a fast shutdown 30mins or more before. i decided to shutdown our Qlik Data Gateway Data Movement service on the other server and immediately the postgresDB did stop correctly: ... 2025-07-15 08:50:51.799 UTC [5448] LOG: checkpoint starting: shutdown immediate 2025-07-15 08:50:51.859 UTC [5448] LOG: checkpoint complete: wrote 9 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.004 s, sync=0.019 s, total=0.067 s; sync files=7, longest=0.004 s, average=0.003 s; distance=38 kB, estimate=22505 kB 2025-07-15 08:50:51.910 UTC [14516] LOG: database system is shut down. ------- According to https://www.postgresql.org/docs/14/app-pg-ctl.html : ... “Fast” mode (the default) does not wait for clients to disconnect and will terminate an online backup in progress. All active transactions are rolled back and clients are forcibly disconnected, then the server is shut down. ... In our case there are at least 2 client connection not forcibly disconnected and this is why i open this bug.
Re: BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate
От
"David G. Johnston"
Дата:
On Tuesday, July 15, 2025, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 18985
Logged by: Erik Dobak
Email address: erik.dobak@gmail.com
PostgreSQL version: 14.8
Operating system: Windows server 2022
Description:
We use postgresql 14.18 at a windows server 2022 as a repository db for the
Qlik Sense product. Additionally we connect to the postgresql db from a
remote server running Qlik Data Gateway Data Movement server which
replicates data and changes to various cloud DBs by using replication slots.
Can you replicate this on a supported version of PostgreSQL?
David J.
On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com> wrote:
On Tuesday, July 15, 2025, PG Bug reporting form <noreply@postgresql.org> wrote:The following bug has been logged on the website:
Bug reference: 18985
Logged by: Erik Dobak
Email address: erik.dobak@gmail.com
PostgreSQL version: 14.8
Operating system: Windows server 2022
Description:
We use postgresql 14.18 at a windows server 2022 as a repository db for the
Qlik Sense product. Additionally we connect to the postgresql db from a
remote server running Qlik Data Gateway Data Movement server which
replicates data and changes to various cloud DBs by using replication slots.Can you replicate this on a supported version of PostgreSQL?David J.
14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate this on a supported version.
E
ah yes now i see. we could reproduce it on 14.8 and 14.18. i
E
On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com> wrote:
On Tuesday, July 15, 2025, PG Bug reporting form <noreply@postgresql.org> wrote:The following bug has been logged on the website:
Bug reference: 18985
Logged by: Erik Dobak
Email address: erik.dobak@gmail.com
PostgreSQL version: 14.8
Operating system: Windows server 2022
Description:
We use postgresql 14.18 at a windows server 2022 as a repository db for the
Qlik Sense product. Additionally we connect to the postgresql db from a
remote server running Qlik Data Gateway Data Movement server which
replicates data and changes to various cloud DBs by using replication slots.Can you replicate this on a supported version of PostgreSQL?David J.
=?UTF-8?Q?Erik_Dob=C3=A1k?= <erik.dobak@gmail.com> writes:
> On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com>
> wrote:
>> Can you replicate this on a supported version of PostgreSQL?
> 14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate
> this on a supported version.
I think what David is actually wondering about is "Qlik Data Gateway
Data Movement". That is not our code, and if it's using a custom
logical-replication plugin then the plugin seems like the likely
thing to blame. We have not heard reports of this happening with
our stock replication code.
regards, tom lane
On Tue, 15 Jul 2025, 18:08 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Erik Dobák <erik.dobak@gmail.com> writes:
> On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com>
> wrote:
>> Can you replicate this on a supported version of PostgreSQL?
> 14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate
> this on a supported version.
I think what David is actually wondering about is "Qlik Data Gateway
Data Movement". That is not our code, and if it's using a custom
logical-replication plugin then the plugin seems like the likely
thing to blame. We have not heard reports of this happening with
our stock replication code.
regards, tom lane
hi Tom,
not sure what you mean by "plugin".
dgdm is connecting to an untempered postgres db = standard postres db installation afaik.
did not know about plugins in postgres. how would one check if there is some?
E
Re: BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate
От
"David G. Johnston"
Дата:
On Tuesday, July 15, 2025, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Erik Dobák <erik.dobak@gmail.com> writes:
> On Tue, 15 Jul 2025, 16:35 David G. Johnston, <david.g.johnston@gmail.com>
> wrote:
>> Can you replicate this on a supported version of PostgreSQL?
> 14.18 is afaik supported. it is runn:ng on 14.8. so yes i could replicate
> this on a supported version.
I think what David is actually wondering about is "Qlik Data Gateway
Data Movement". That is not our code, and if it's using a custom
logical-replication plugin then the plugin seems like the likely
thing to blame. We have not heard reports of this happening with
our stock replication code.
No, I meant the reported 14.8 PG version.
The observed behavior of the server seems like a bug no matter how well-coded or not the software trying to make the connecting is. The server should be able to close and disallow clients no matter what they are doing.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> The observed behavior of the server seems like a bug no matter how
> well-coded or not the software trying to make the connecting is. The
> server should be able to close and disallow clients no matter what they are
> doing.
We will SIGKILL recalcitrant children during an "immediate" shutdown.
"Fast" shutdown is not defined to do that, and should not do so
because it'd imply having to do an unclean restart later.
I just tested logical replication both ways between current master and
v14 branch tip, and in either case the publishing server stops pretty
much instantaneously after "pg_ctl stop -m fast". So I don't see
anything broken about our code, and I remain of the opinion that Qlik
is probably at fault.
regards, tom lane
=?UTF-8?Q?Erik_Dob=C3=A1k?= <erik.dobak@gmail.com> writes:
> did not know about plugins in postgres. how would one check if there is
> some?
Check the pg_replication_slots view on the publisher for the slot(s)
being used for Qlik replication. If the "plugin" column says
something other than "pgoutput" then a nonstandard plugin is being
used.
regards, tom lane
Re: BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate
От
"David G. Johnston"
Дата:
On Tue, Jul 15, 2025, 11:00 Tom Lane <tgl@sss.pgh.pa.us> wrote:
I just tested logical replication both ways between current master and
v14 branch tip, and in either case the publishing server stops pretty
much instantaneously after "pg_ctl stop -m fast". So I don't see
anything broken about our code, and I remain of the opinion that Qlik
is probably at fault.
Ok, yeah. Looking more closely the reconnection attempt in the log doesn't look related to this so those hung connections are not playing nicely when told to go away. Which indeed in fast mode we have to rely upon.
David J.
On Tue, 15 Jul 2025, 20:00 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> The observed behavior of the server seems like a bug no matter how
> well-coded or not the software trying to make the connecting is. The
> server should be able to close and disallow clients no matter what they are
> doing.
We will SIGKILL recalcitrant children during an "immediate" shutdown.
"Fast" shutdown is not defined to do that, and should not do so
because it'd imply having to do an unclean restart later.
I just tested logical replication both ways between current master and
v14 branch tip, and in either case the publishing server stops pretty
much instantaneously after "pg_ctl stop -m fast". So I don't see
anything broken about our code, and I remain of the opinion that Qlik
is probably at fault.
regards, tom lane
thank you Tom,
based on the documentation mentioned in my initial post i was thinking that the connections are closed during " fast" shutdown immediately.
i will forward this to Qlik then.
is there a better documentation that would explain the process of shutdown than the link i posted above?
regards
E
nice thanks i will study that.
E
On Tue, 15 Jul 2025, 20:27 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Erik Dobák <erik.dobak@gmail.com> writes:
> did not know about plugins in postgres. how would one check if there is
> some?
Check the pg_replication_slots view on the publisher for the slot(s)
being used for Qlik replication. If the "plugin" column says
something other than "pgoutput" then a nonstandard plugin is being
used.
regards, tom lane
Re: BUG #18985: fast shutdown does not close connections from qlik data gateway data movement aka. replicate
От
Anthonin Bonnefoy
Дата:
Hi, This is likely related to the issue I've reported[1]: A logical walsender may be stuck at 100% CPU during shutdown, trying to read an incomplete FPI_FOR_HINT record and blocking the shutdown sequence. By stopping the logical replication's target, the impacted walsender exited, unblocking the shutdown. There are similar reports of failover being stuck on projects like patroni[2]. I've provided a way to reproduce the issue in the linked thread, along with a tentative patch. Regards, Anthonin Bonnefoy [1]: https://www.postgresql.org/message-id/flat/CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ%40mail.gmail.com [2]: https://github.com/patroni/patroni/issues/3522