Обсуждение: Logical wal receiver (background worker) not detecting when publishernode has died
Logical wal receiver (background worker) not detecting when publishernode has died
От
Achilleas Mantzios
Дата:
Dear List, Coming back from : https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com and having got absolutely no helpful answer from our infrastructure people, I would like to ask : Is it on earth possible that the primary (publisher node) has crushed while on the subscriber node the logical wal receiver goes on happily like there is no problem at all, no messages in log, no timeouts, acts if nothing happen ? (in the meantime the second standby instantly detected the crush of the primary and immediately restarted re-connection attempts)
> On Nov 23, 2018, at 12:20 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote: > > Dear List, > Coming back from : https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com > > and having got absolutely no helpful answer from our infrastructure people, I would like to ask : > > Is it on earth possible that the primary (publisher node) has crushed while on the subscriber node the logical wal receivergoes on happily like there is no problem at all, no messages in log, no timeouts, acts if nothing happen ? > > (in the meantime the second standby instantly detected the crush of the primary and immediately restarted re-connectionattempts) > Not that I can think of without it being a bug. If it happens again; you can try killing the WAL receiver session via Postgres and if that fails then using tcpkill to terminatethe session. It would be good to know the actual cause though and collect as much information before terminating the session. Interested; but not sure if it’s related: https://www.evanjones.ca/tcp-stuck-connection-mystery.html
> On Nov 23, 2018, at 8:00 PM, Rui DeSousa <rui@crazybean.net> wrote: > > > >> On Nov 23, 2018, at 12:20 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote: >> >> Dear List, >> Coming back from : https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com >> >> and having got absolutely no helpful answer from our infrastructure people, I would like to ask : >> >> Is it on earth possible that the primary (publisher node) has crushed while on the subscriber node the logical wal receivergoes on happily like there is no problem at all, no messages in log, no timeouts, acts if nothing happen ? >> >> (in the meantime the second standby instantly detected the crush of the primary and immediately restarted re-connectionattempts) >> > > Not that I can think of without it being a bug. > > If it happens again; you can try killing the WAL receiver session via Postgres and if that fails then using tcpkill toterminate the session. > > It would be good to know the actual cause though and collect as much information before terminating the session. > > Interested; but not sure if it’s related: https://www.evanjones.ca/tcp-stuck-connection-mystery.html > > > Same problem but no solution; keep alive not working. https://superuser.com/questions/1021988/connection-remains-flagged-as-established-even-if-host-is-unconnected