Обсуждение: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Поиск
Список
Период
Сортировка

PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
Hello,
while setting up logical replication since August we had seen early on the need to increase max_receiver_timeout and
max_sender_timeoutfrom 60sec to 5mins, otherwise the synchronization would never 
 
take place.
This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary 
 
server run out of space. (this is not connected with the wal fill up of the previous Sunday nor does it explain why it
happened,still in the dark about this one).
 

I got the following messages on the publisher (primary) host :

10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG: 
terminatingwalsender process due to replication timeout
 
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change 
 
callback, associated LSN 13DF/393BF7F0
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr@dynacom line:11 LOG: 
disconnection:session time: 49:47:17.937 user=repmgr database=dynacom 
 
host=10.9.0.77 port=48650

By querying pg_stat_subscription all 3 timestamps were about 5 mins prior to the above walsender termination.

I didn't get *any* ERROR/FATAL message on the subscriber side. We have built a monitoring system that notifies us about
problems,conflicts such as unique key violations, etc but we had no alerts 
 
coming from the subscriber logs. So this went unnoticed. Also the scripts about disk size growth fired but this was
alreadytoo late.
 

So after I got the alerts about our primary being dead, I had to free the space in pg_wal, keeping all the wals
elsewhere,consult pg_controldata about last checkpoint, bring the primary up, and then 
 
making sure that all standbys were up and running. Our physical standby easily caught up, but the logical one (which
waspart of the cause) of course didn't. So I used the moved wals, moving them back 
 
to pg_wal (of the primary) in bulks (as the subscriber asked) till I managed to bring the subscriber node in sync with
thepublisher, sans the last wal (the one created at the time of the PANIC), 
 
which I had to bring from our WAL archiving server. This pretty much brought everything up and running again.

Lessons learned are :
* monitoring the subscriber is not enough
* a walsender (sending WAL across for logical decoding) having a timeout may be fatal

I have seen howtos recommending disabling those timeouts :
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html

The postgresql docs are quite "laconic" about the use of the two timeout parameters, and both default to 1min.  :

wal_sender_timeout (integer)
     Terminate replication connections that are inactive longer than the specified number of milliseconds. This is
usefulfor the sending server to detect a standby crash or network outage. A value of 
 
zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf file or on the server
commandline. The default value is 60 seconds.
 

wal_receiver_timeout (integer)
     Terminate replication connections that are inactive longer than the specified number of milliseconds. This is
usefulfor the receiving standby server to detect a primary node crash or network 
 
outage. A value of zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf file or
onthe server command line. The default value is 60 seconds.
 

So what postgresql manual basically says to the user is :

- We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to
monitorthis and act accordingly or write an automated tool to handle such events (to 
 
do what???), otherwise set it to 0 but be prepared, in case of permanent problems, to loose availability when you run
outof disk space.
 

So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified
amountof time? Otherwise what is the purpose of the LOG message? (which is not even 
 
an ERROR ?) Should a restart of the subscriber or the publisher node remedy this?

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt




> On Nov 12, 2018, at 5:41 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
> This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary server run out of space. (this is not
connectedwith the wal fill up of the previous Sunday nor does it explain why it happened, still in the dark about this
one).

This sounds like there was a network related issue.  Did the WAL receiver timeout too or did it remain “connected”?  If
thedownstream server did not detected the network issue thus failed to drop the abandoned connection and reconnect on
itsown then this is normal behavior as the replication slot would not have be active.   

I’m a bit confused and as I thought you stated before that you checked the replication slots and they where active and
movingforward; right? 

> - We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to
monitorthis and act accordingly or write an automated tool to handle such events (to do what???), otherwise set it to 0
butbe prepared, in case of permanent problems, to loose availability when you run out of disk space. 
>
> So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified
amountof time? Otherwise what is the purpose of the LOG message? (which is not even an ERROR ?) Should a restart of the
subscriberor the publisher node remedy this? 

wal_sender_timeout and wal_receiver_timeout are timeout and Postgres will terminate the connect and the downstream
serverwill reconnect on its own (as long as it terminates its own connection — wal_receiver_timeout). 

This is very useful when you have an overzealous firewall that drops connections due to idle sessions without resets or
anyother situation that causes a network connection issue. 

Disabling the timeout seems like a really bad idea, the end result then would depend on your TCP/IP stack witch can
takea day or so to detect an abandoned connect unless TCP/IP keep alive is enabled.  And, I would recommenced setting
upTCP/IP keep alive to detect abandoned sessions. i.e. again a firewall dropping a user sessions without a reset (can
happenedon a long running query as the connection would look idle to the firewall); or a user closing the lid on their
laptopand heading home for the day while being logged in — without the timeouts their session will continue to run
activequeries and/or hold on to opened transactions until the OS terminates the session or some other timeout is
reached.

Did you check to see if you have any long running queries or opened transactions that are holding on to a xmin?






Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 12/11/18 6:30 μ.μ., Rui DeSousa wrote:
>
>> On Nov 12, 2018, at 5:41 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>
>> This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary server run out of space. (this is not
connectedwith the wal fill up of the previous Sunday nor does it explain why it happened, still in the dark about this
one).
> This sounds like there was a network related issue.

yes, apparently, or a transaction was large enough so that it didn't get through within the timeout limit. However I
countedabout 24 WAL files in this period  so it would be max about 400MB. Which 
 
means the speed fell down to 1.3MB/sec which is pretty slow for connections between the same cloud provider.

> Did the WAL receiver timeout too or did it remain “connected”?

I mention it in the original message :
"By querying pg_stat_subscription all 3 timestamps were about 5 mins prior to the above walsender termination.
I didn't get *any* ERROR/FATAL message on the subscriber side".

Or to be more precise, I didn't get any related message on the subscriber side. Here what the subscriber logs contain
forthe said time window :
 

  [106486] 2018-11-09 14:55:16.996 EET  @ line:468 LOG:  checkpoint complete: wrote 17564 buffers (1.7%); 0 WAL file(s)
added,0 removed, 43 recycled; write=610.394 s, sync=0.044 s, total=610.521 s; 
 
sync files=507, longest=0.011 s, average=0.000 s; distance=564195 kB, estimate=712814 kB
  [106486] 2018-11-09 14:55:19.093 EET  @ line:469 LOG:  checkpoint starting: xlog
10.9.202.211(49220) [18191] 2018-11-09 14:59:35.772 EET [unknown] [unknown]@[unknown] line:1 LOG:  connection received:
host=10.9.202.211port=49220
 
.......
  [106486] 2018-11-09 15:06:56.846 EET  @ line:470 LOG:  checkpoint complete: wrote 9034 buffers (0.9%); 0 WAL file(s)
added,0 removed, 24 recycled; write=697.701 s, sync=0.003 s, total=697.752 s; 
 
sync files=414, longest=0.002 s, average=0.000 s; distance=722719 kB, estimate=722719 kB
.......
  [106486] 2018-11-09 15:10:19.880 EET  @ line:471 LOG:  checkpoint starting: time
  [106486] 2018-11-09 15:23:49.049 EET  @ line:472 LOG:  checkpoint complete: wrote 9554 buffers (0.9%); 0 WAL file(s)
added,0 removed, 0 recycled; write=809.158 s, sync=0.003 s, total=809.169 s; 
 
sync files=453, longest=0.001 s, average=0.000 s; distance=327761 kB, estimate=683223 kB


> If the downstream server did not detected the network issue thus failed to drop the abandoned connection and
reconnecton its own then this is normal behavior as the replication slot would not have be active.
 

The subscriber did not detect the walsender termination. It was only after the subscriber node restart (this Sunday)
thatit started again asking for WAL segments.
 
The replication slot lives on the publisher node, it was apparently inactive since the walsender terminated on Friday
(tillthe PANIC)
 

>    
>
> I’m a bit confused and as I thought you stated before that you checked the replication slots and they where active
andmoving forward; right?
 

I have sent two separate emails with two separate subjects, reporting two distinct incidents. In this incident, I could
notcheck any replication slot because the primary had PANIC'ed.
 

>
>> - We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to
monitorthis and act accordingly or write an automated tool to handle such events (to do what???), otherwise set it to 0
butbe prepared, in case of permanent problems, to loose availability when you run out of disk space.
 
>>
>> So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified
amountof time? Otherwise what is the purpose of the LOG message? (which is not even an ERROR ?) Should a restart of the
subscriberor the publisher node remedy this?
 
> wal_sender_timeout and wal_receiver_timeout are timeout and Postgres will terminate the connect and the downstream
serverwill reconnect on its own (as long as it terminates its own connection — wal_receiver_timeout).
 

This didn't happen. Nothing replication related coming from the subscriber logs since the walsender termination.

> Did you check to see if you have any long running queries or opened transactions that are holding on to a xmin?

https://www.postgresql.org/docs/10/protocol-logical-replication.html
An uncommitted transaction won't make it to the subscriber.

>
>
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 12/11/18 12:41 μ.μ., Achilleas Mantzios wrote:
> Hello,
> while setting up logical replication since August we had seen early on the need to increase max_receiver_timeout and
max_sender_timeoutfrom 60sec to 5mins, otherwise the synchronization would never 
 
> take place.
> This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary 
 
> server run out of space. (this is not connected with the wal fill up of the previous Sunday nor does it explain why
ithappened, still in the dark about this one).
 
>
> I got the following messages on the publisher (primary) host :
>
> 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG: 
terminatingwalsender process due to replication timeout
 
> 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change 
 
> callback, associated LSN 13DF/393BF7F0
> 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr@dynacom line:11
LOG: disconnection: session time: 49:47:17.937 user=repmgr database=dynacom 
 
> host=10.9.0.77 port=48650
>
> By querying pg_stat_subscription all 3 timestamps were about 5 mins prior to the above walsender termination.
>
> I didn't get *any* ERROR/FATAL message on the subscriber side. We have built a monitoring system that notifies us
aboutproblems, conflicts such as unique key violations, etc but we had no alerts 
 
> coming from the subscriber logs. So this went unnoticed. Also the scripts about disk size growth fired but this was
alreadytoo late.
 


Just a thought :
could it be related to https://groups.google.com/a/2ndquadrant.com/forum/#!topic/bdr-list/YDG1_MuvfVM , i.e. :
https://commitfest.postgresql.org/14/1151/
?
Is there a way for the WAL receiver to not have detected the termination of the replication stream?
Shouldn't normally the WAL receiver detect this and try again in wal_retrieve_retry_interval ?

We checked with our cloud provider, they replied that there were no outages at all.


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 13/11/18 2:00 μ.μ., Achilleas Mantzios wrote:
> On 12/11/18 12:41 μ.μ., Achilleas Mantzios wrote:
>> Hello,
>> while setting up logical replication since August we had seen early on the need to increase max_receiver_timeout and
max_sender_timeoutfrom 60sec to 5mins, otherwise the synchronization would 
 
>> never take place.
>> This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5
mins).This left the replication slot retaining wals till our production primary 
 
>> server run out of space. (this is not connected with the wal fill up of the previous Sunday nor does it explain why
ithappened, still in the dark about this one).
 
>>
>> I got the following messages on the publisher (primary) host :
>>
>> 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG:
terminatingwalsender process due to replication timeout
 
>> 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr@dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change 
 
>> callback, associated LSN 13DF/393BF7F0
>> 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr@dynacom line:11
LOG:disconnection: session time: 49:47:17.937 user=repmgr database=dynacom 
 
>> host=10.9.0.77 port=48650
>>
>> By querying pg_stat_subscription all 3 timestamps were about 5 mins prior to the above walsender termination.
>>
>> I didn't get *any* ERROR/FATAL message on the subscriber side. We have built a monitoring system that notifies us
aboutproblems, conflicts such as unique key violations, etc but we had no alerts 
 
>> coming from the subscriber logs. So this went unnoticed. Also the scripts about disk size growth fired but this was
alreadytoo late.
 
>
>
> Just a thought :
> could it be related to https://groups.google.com/a/2ndquadrant.com/forum/#!topic/bdr-list/YDG1_MuvfVM , i.e. :
https://commitfest.postgresql.org/14/1151/

Just saw the above patch was already committed to 10.2 , so it must be something else.

> ?
> Is there a way for the WAL receiver to not have detected the termination of the replication stream?
> Shouldn't normally the WAL receiver detect this and try again in wal_retrieve_retry_interval ?
>
> We checked with our cloud provider, they replied that there were no outages at all.
>
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt





On Nov 13, 2018, at 7:00 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Is there a way for the WAL receiver to not have detected the termination of the replication stream?

The teardown of the network socket on the upstream server should send a reset packet to the downstream server and at that point the WAL receiver would close its connection.  Is there any firewalls, router, rules, etc between the nodes that could have dropped the packet? 


Shouldn't normally the WAL receiver detect this and try again in wal_retrieve_retry_interval ?

Not really… if the connection has already been torn down; the upstream server would send another reset packet on the next request and in this case it would.  However, if request packets at not reaching the upstream server; i.e. due to firewall silently dropping the packets (personally I believe firewall should always set reset packets to friendly hosts) then what happens is the TCP/IP send queue builds up with the requests packets instead — a t this point waiting on the OS to terminate the connection which can day or two depending on your TCP/IP setting.

What you want to use instead is wal_receiver_timeout to detect the given case where upstream server either no longer exists or the firewall, etc is silently dropping packets.


Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 13/11/18 5:35 μ.μ., Rui DeSousa wrote:


On Nov 13, 2018, at 7:00 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Is there a way for the WAL receiver to not have detected the termination of the replication stream?

The teardown of the network socket on the upstream server should send a reset packet to the downstream server and at that point the WAL receiver would close its connection.  Is there any firewalls, router, rules, etc between the nodes that could have dropped the packet?

No



Shouldn't normally the WAL receiver detect this and try again in wal_retrieve_retry_interval ?

Not really… if the connection has already been torn down; the upstream server would send another reset packet on the next request and in this case it would.  However, if request packets at not reaching the upstream server; i.e. due to firewall silently dropping the packets (personally I believe firewall should always set reset packets to friendly hosts) then what happens is the TCP/IP send queue builds up with the requests packets instead — a t this point waiting on the OS to terminate the connection which can day or two depending on your TCP/IP setting.


Again no dropping, no firewall.

What you want to use instead is wal_receiver_timeout to detect the given case where upstream server either no longer exists or the firewall, etc is silently dropping packets.

Once again from my original message :
"while setting up logical replication since August we had seen early on the need to increase max_receiver_timeout and max_sender_timeout from 60sec to 5mins"

So with wal_receiver_timeout='5 min', the receiver never detected any timeout.





-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt


On Nov 13, 2018, at 11:06 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

On 13/11/18 5:35 μ.μ., Rui DeSousa wrote:


On Nov 13, 2018, at 7:00 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Is there a way for the WAL receiver to not have detected the termination of the replication stream?

The teardown of the network socket on the upstream server should send a reset packet to the downstream server and at that point the WAL receiver would close its connection.  Is there any firewalls, router, rules, etc between the nodes that could have dropped the packet?

No


If both nodes are up and able to communicate with each other then the OS should teardown the connection; why is that not happening?  You stated that you’re in the cloud; cloud providers use a software network and all endpoints have ACLs — a.k.a firewall. I would also check to see if iptables is running… it on by default.

Have there been changes to how you deploy new ACLs, etc? 

The fact that WAL sender terminated and WAL receiver did not; points to the WAL receiver not being able to communicate with the upstream server.



Shouldn't normally the WAL receiver detect this and try again in wal_retrieve_retry_interval ?

Not really… if the connection has already been torn down; the upstream server would send another reset packet on the next request and in this case it would.  However, if request packets at not reaching the upstream server; i.e. due to firewall silently dropping the packets (personally I believe firewall should always set reset packets to friendly hosts) then what happens is the TCP/IP send queue builds up with the requests packets instead — a t this point waiting on the OS to terminate the connection which can day or two depending on your TCP/IP setting.


Again no dropping, no firewall.


Again, if both nodes are up and are able to communicate then the this should get resolved on its own.

What you want to use instead is wal_receiver_timeout to detect the given case where upstream server either no longer exists or the firewall, etc is silently dropping packets.

Once again from my original message :
"while setting up logical replication since August we had seen early on the need to increase max_receiver_timeout and max_sender_timeout from 60sec to 5mins"

So with wal_receiver_timeout='5 min', the receiver never detected any timeout.


It should have reached the timeout and the connection should be torn down.  It could be that the send queue is backed up and Postgres is hung on trying to teardown the connection.  Again, network related issue.

I’ve actually ran into this issue before; where I was unable to terminate an idle session from an application node that had an open transaction.  The TCP/IP send queue was backlogged thus it would not terminate; one might consider this a bug or not.  The application server was not reachable due to firewall dropping the connection and not sending a reset packet on further attempts to communicate with the application server; I had to get a network admin to drop the connection as bouncing Postgres was not an option on a production system and kill -9 is a bad idea too.  

Assuming this is going to happen again — I would advise you to get from both nodes the state of connection and tcp/ip queue information from netstat.   If you can also to a tcpdump on the connection to see what each of the nodes is doing that would give you more insight.

I would also advise looking into TCP/IP keep alive.

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 14/11/18 3:52 π.μ., Rui DeSousa wrote:


On Nov 13, 2018, at 11:06 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

On 13/11/18 5:35 μ.μ., Rui DeSousa wrote:


On Nov 13, 2018, at 7:00 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Is there a way for the WAL receiver to not have detected the termination of the replication stream?

The teardown of the network socket on the upstream server should send a reset packet to the downstream server and at that point the WAL receiver would close its connection.  Is there any firewalls, router, rules, etc between the nodes that could have dropped the packet?

No


If both nodes are up and able to communicate with each other then the OS should teardown the connection; why is that not happening?  You stated that you’re in the cloud; cloud providers use a software network and all endpoints have ACLs — a.k.a firewall. I would also check to see if iptables is running… it on by default.

Have there been changes to how you deploy new ACLs, etc? 

The fact that WAL sender terminated and WAL receiver did not; points to the WAL receiver not being able to communicate with the upstream server.



Shouldn't normally the WAL receiver detect this and try again in wal_retrieve_retry_interval ?

Not really… if the connection has already been torn down; the upstream server would send another reset packet on the next request and in this case it would.  However, if request packets at not reaching the upstream server; i.e. due to firewall silently dropping the packets (personally I believe firewall should always set reset packets to friendly hosts) then what happens is the TCP/IP send queue builds up with the requests packets instead — a t this point waiting on the OS to terminate the connection which can day or two depending on your TCP/IP setting.


Again no dropping, no firewall.


Again, if both nodes are up and are able to communicate then the this should get resolved on its own.

What you want to use instead is wal_receiver_timeout to detect the given case where upstream server either no longer exists or the firewall, etc is silently dropping packets.

Once again from my original message :
"while setting up logical replication since August we had seen early on the need to increase max_receiver_timeout and max_sender_timeout from 60sec to 5mins"

So with wal_receiver_timeout='5 min', the receiver never detected any timeout.


It should have reached the timeout and the connection should be torn down.  It could be that the send queue is backed up and Postgres is hung on trying to teardown the connection.  Again, network related issue.

I’ve actually ran into this issue before; where I was unable to terminate an idle session from an application node that had an open transaction.  The TCP/IP send queue was backlogged thus it would not terminate; one might consider this a bug or not.  The application server was not reachable due to firewall dropping the connection and not sending a reset packet on further attempts to communicate with the application server; I had to get a network admin to drop the connection as bouncing Postgres was not an option on a production system and kill -9 is a bad idea too.  

Assuming this is going to happen again — I would advise you to get from both nodes the state of connection and tcp/ip queue information from netstat.   If you can also to a tcpdump on the connection to see what each of the nodes is doing that would give you more insight.

I would also advise looking into TCP/IP keep alive.

Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we run no firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself earlier).  Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the primary. We have :

# sysctl -a | grep -i keepaliv
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

dynacom=# show tcp_keepalives_count ;
 tcp_keepalives_count
----------------------
 0
(1 row)

dynacom=# show tcp_keepalives_idle ;
 tcp_keepalives_idle
---------------------
 0
(1 row)

dynacom=# show tcp_keepalives_interval ;
 tcp_keepalives_interval
-------------------------
 0
(1 row)


Also in addition to what you say (netstat, tcpdump) if I detect such a case (even with the primary panic'ed -- yeah this would take quite some nerves to do this) I will connect with gdb and take a stack trace to know what the worker is doing and why doesn't it restart.



-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
> Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we
runno firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself
earlier). Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the
primary.We have : 
>
> # sysctl -a | grep -i keepaliv
> net.ipv4.tcp_keepalive_intvl = 75
> net.ipv4.tcp_keepalive_probes = 9
> net.ipv4.tcp_keepalive_time = 7200
>

Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped.
So,the WAL receiver should have corrected itself after that time.  For reference, I run terminating abandon session
within15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc. 

I haven’t used Postgres keep alive setting as I find the OS handles it just fine.


> Also in addition to what you say (netstat, tcpdump) if I detect such a case (even with the primary panic'ed -- yeah
thiswould take quite some nerves to do this) I will connect with gdb and take a stack trace to know what the worker is
doingand why doesn't it restart. 
>
>

Do a netstat -an; that will show you all the network connections and their current state.  If you do this on both
systemsyou should find corresponding entries/states for the replication stream.   

i.e. sample output:

Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address          Foreign Address        (state)
tcp4       0      0 10.10.10.1.50604          10.10.10.2.5432       ESTABLISHED


In your case; what I think might have happened is that upstream server would not have an entry or it would be in one of
theFIN states where the downstream server would have EST connection with the Send-Q backlog.  If the servers were
communicatingthen the upstream server would have responded with a reset packet to the downstream thus forcing the
sessionto terminate.  

Using root account you could have seen what was transpiring on the given connection; i.e.

tcpdump -i eth0 port 50604




Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 14/11/18 6:24 μ.μ., Rui DeSousa wrote:
>
>> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>
>> Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we
runno firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself
earlier). Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the
primary.We have :
 
>>
>> # sysctl -a | grep -i keepaliv
>> net.ipv4.tcp_keepalive_intvl = 75
>> net.ipv4.tcp_keepalive_probes = 9
>> net.ipv4.tcp_keepalive_time = 7200
>>
> Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped.
So,the WAL receiver should have corrected itself after that time.  For reference, I run terminating abandon session
within15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc.
 

Didn't happen. Friday 15:01 + '18 hrs' = Saturday 09:01 . The subscription was stuck on Friday 15:01, unfortunately I
didn'tkeep the whole output of the pg_stat_subscription (which I took on Sunday 
 
morning). And in Sunday morning the primary, having run out of space, PANIC'ed .

>
> I haven’t used Postgres keep alive setting as I find the OS handles it just fine.
>
>
>> Also in addition to what you say (netstat, tcpdump) if I detect such a case (even with the primary panic'ed -- yeah
thiswould take quite some nerves to do this) I will connect with gdb and take a stack trace to know what the worker is
doingand why doesn't it restart.
 
>>
>>
> Do a netstat -an; that will show you all the network connections and their current state.  If you do this on both
systemsyou should find corresponding entries/states for the replication stream.
 
>
> i.e. sample output:
>
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address          Foreign Address        (state)
> tcp4       0      0 10.10.10.1.50604          10.10.10.2.5432       ESTABLISHED
>
>
> In your case; what I think might have happened is that upstream server would not have an entry or it would be in one
ofthe FIN states where the downstream server would have EST connection with the Send-Q backlog.  If the servers were
communicatingthen the upstream server would have responded with a reset packet to the downstream thus forcing the
sessionto terminate.
 
>
> Using root account you could have seen what was transpiring on the given connection; i.e.
>
> tcpdump -i eth0 port 50604
>
>
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt





On Nov 15, 2018, at 2:20 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

On 14/11/18 6:24 μ.μ., Rui DeSousa wrote:

On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we run no firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself earlier).  Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the primary. We have :

# sysctl -a | grep -i keepaliv
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped.  So, the WAL receiver should have corrected itself after that time.  For reference, I run terminating abandon session within 15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc.

Didn't happen. Friday 15:01 + '18 hrs' = Saturday 09:01 . The subscription was stuck on Friday 15:01, unfortunately I didn't keep the whole output of the pg_stat_subscription (which I took on Sunday morning). And in Sunday morning the primary, having run out of space, PANIC'ed .

Apparently TCP keep alive is not enabled; looks like we’re missing the following but don’t look for it :).  So the connection would stick around until it gets a reset packet which could be days if not longer especially if packets are being dropped.  I know you say there is no firewall; but the upstream server would send a reset packet if connection was terminated and/or Postgres was down — so there is something dropping packets.

net.inet.tcp.always_keepalive=1

A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request.  For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.

The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.


Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 15/11/18 4:48 μ.μ., Rui DeSousa wrote:


On Nov 15, 2018, at 2:20 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

On 14/11/18 6:24 μ.μ., Rui DeSousa wrote:

On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we run no firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself earlier).  Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the primary. We have :

# sysctl -a | grep -i keepaliv
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped.  So, the WAL receiver should have corrected itself after that time.  For reference, I run terminating abandon session within 15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc.

Didn't happen. Friday 15:01 + '18 hrs' = Saturday 09:01 . The subscription was stuck on Friday 15:01, unfortunately I didn't keep the whole output of the pg_stat_subscription (which I took on Sunday morning). And in Sunday morning the primary, having run out of space, PANIC'ed .

Apparently TCP keep alive is not enabled; looks like we’re missing the following but don’t look for it :).  So the connection would stick around until it gets a reset packet which could be days if not longer especially if packets are being dropped.  I know you say there is no firewall; but the upstream server would send a reset packet if connection was terminated and/or Postgres was down — so there is something dropping packets.

net.inet.tcp.always_keepalive=1

This setting is from FreeBSD. I have tested changing the settings on my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would have no effect at all to the postgresql settings, they remained all three of them at zero. This is completely irrelevant with my problem but anyway.


A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request.  For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So according to the official Linux docs, three are the parameters that govern TCP keepalive in Linux, which in both the said systems are set as :
root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
root@TEST-smadb:/var/lib/pgsql#


The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.

So from looks of this, keep alive is enabled. (Also don't confuse WAL receiver with logical worker, different programs, albeit similar).

People any other theories as to why the subscriber kept acting like the publisher was there, while it wasn't?
Is there any way (by network means?) to mock this behavior in order to fool the replication worker like the sender is there?





-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt


On Nov 16, 2018, at 3:18 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

net.inet.tcp.always_keepalive=1

This setting is from FreeBSD. I have tested changing the settings on my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would have no effect at all to the postgresql settings, they remained all three of them at zero. This is completely irrelevant with my problem but anyway.


That is what I stated; you don’t need it.  It is that in Linux the application has to enable it and I don’t know of a kernel setting for Linux like the one in FreeBSD


A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request.  For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So according to the official Linux docs, three are the parameters that govern TCP keepalive in Linux, which in both the said systems are set as :
root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
root@TEST-smadb:/var/lib/pgsql#


That does not mean the connection has TCP keep alive enabled; it just means that if an application requests it those would be the defaults setting if it doesn’t provide its own.  Those setting would be too large anyway; you want to be able to detect a broken connection much quicker than 18 hours.


The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.

So from looks of this, keep alive is enabled. (Also don't confuse WAL receiver with logical worker, different programs, albeit similar).

I don’t believe it’s enabled; have you check to see that you getting keep alive packets?  If it was enabled it would have terminated after 18 hours.

Is there any way (by network means?) to mock this behavior in order to fool the replication worker like the sender is there?

Put a firewall in-between the servers and drop the packets without sending resets.


Have a read here:

Section 4.2


The RFC states TCP keep alive should be off by default; FreeBSD changed that back in 1999 and I believe Linux still follows the RFC:




> On Nov 16, 2018, at 3:18 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
>>
>> net.inet.tcp.always_keepalive=1
>
> This setting is from FreeBSD. I have tested changing the settings on my PostgreSQL 11.1 on my FreeBSD
11.2-RELEASE-p3,and this would have no effect at all to the postgresql settings, they remained all three of them at
zero.

That setting is not going to change Postgresql settings it’s not possible; however, it overrides the connection and
forcesTCP keep alive even if the application attempts to disable it.  If you would to watch the connection you see the
keepalive packets. 

> This is completely irrelevant with my problem but anyway.

You have network related issue and this would detect and resolve the issue without fixing the network issue.  You have
tofix at least one or both of them. 

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 16/11/18 5:36 μ.μ., Rui DeSousa wrote:
>
>> On Nov 16, 2018, at 3:18 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>
>>> net.inet.tcp.always_keepalive=1
>> This setting is from FreeBSD. I have tested changing the settings on my PostgreSQL 11.1 on my FreeBSD
11.2-RELEASE-p3,and this would have no effect at all to the postgresql settings, they remained all three of them at
zero.
> That setting is not going to change Postgresql settings it’s not possible; however, it overrides the connection and
forcesTCP keep alive even if the application attempts to disable it.  If you would to watch the connection you see the
keepalive packets.
 


Note : this is just a small parenthesis to the search for the cause of 
the problem.

So I changed all three settings in POSTGRESQL 11.1 (by ALTER SYSTEM + 
pg_ctl restart) which runs on a FreeBSD 11 and none of those seemed to 
have changed, via SHOW or :

postgres# select name,setting from pg_settings where name ~* 'keepalive';

           name           | setting
-------------------------+---------
  tcp_keepalives_count    | 0
  tcp_keepalives_idle     | 0
  tcp_keepalives_interval | 0
(3 rows)

repeated the test on a Linux machine with PostgreSQL 10.5, guess what, 
same outcome, it seems that both FreeBSD and Linux trust the OS for 
those settings.

Docs don't imply this behavior, docs say that if those values are set to 
0 then the OS values are used.


>
>> This is completely irrelevant with my problem but anyway.
> You have network related issue and this would detect and resolve the issue without fixing the network issue.  You
haveto fix at least one or both of them.
 


I mean FreeBSD is irrelevant with the problem. Problem was on Linux.



Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:


On 16/11/18 5:29 μ.μ., Rui DeSousa wrote:


On Nov 16, 2018, at 3:18 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

net.inet.tcp.always_keepalive=1

This setting is from FreeBSD. I have tested changing the settings on my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would have no effect at all to the postgresql settings, they remained all three of them at zero. This is completely irrelevant with my problem but anyway.


That is what I stated; you don’t need it.  It is that in Linux the application has to enable it and I don’t know of a kernel setting for Linux like the one in FreeBSD


You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the code supports KEEPALIVE.




A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request.  For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So according to the official Linux docs, three are the parameters that govern TCP keepalive in Linux, which in both the said systems are set as :
root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
root@TEST-smadb:/var/lib/pgsql#


That does not mean the connection has TCP keep alive enabled; it just means that if an application requests it those would be the defaults setting if it doesn’t provide its own.  Those setting would be too large anyway; you want to be able to detect a broken connection much quicker than 18 hours.


I checked on a bare minimal default installation, (after tweaking the kernel tunables to smaller values of course), keepalive msgs are sent and ACK'ed at the specified intervals, checked with wireshark, port 5432. You should test this yourself.




The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.

So from looks of this, keep alive is enabled. (Also don't confuse WAL receiver with logical worker, different programs, albeit similar).

I don’t believe it’s enabled; have you check to see that you getting keep alive packets?  If it was enabled it would have terminated after 18 hours.


See above. In the meantime, I would be nice if someone from the hackers would chime in to clear things up, just to be sure.

Which means, that since PostgreSQL *supports* KEEPALIVE and the logical worker kept happy like nothing happened, then I guess *something* was mocking the KEEPALIVE ACKs??????



Is there any way (by network means?) to mock this behavior in order to fool the replication worker like the sender is there?

Put a firewall in-between the servers and drop the packets without sending resets.


Have a read here:

Section 4.2


The RFC states TCP keep alive should be off by default; FreeBSD changed that back in 1999 and I believe Linux still follows the RFC:





On Nov 17, 2018, at 6:07 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the code supports KEEPALIVE.


Postgres supports it; but the question is it on for the given connection? 


I checked on a bare minimal default installation, (after tweaking the kernel tunables to smaller values of course), keepalive msgs are sent and ACK'ed at the specified intervals, checked with wireshark, port 5432. You should test this yourself.



I just configured Postgres with streaming replication using the following versions and TCP keep alive was enabled by default for the WAL receiver connection and also psql connections.

Linux debian 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
PostgreSQL 10.6 (Debian 10.6-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

root@debian:~# netstat -anp --timers | grep -e Timer -e  EST | grep -e Timer -e 5432
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 10.6.6.101:47546        10.6.6.100:5432         ESTABLISHED 989/telnet           off (0.00/0/0)
tcp        0      0 10.6.6.101:47544        10.6.6.100:5432         ESTABLISHED 953/psql             keepalive (7103.36/0/0)
tcp        0      0 10.6.6.101:47542        10.6.6.100:5432         ESTABLISHED 922/postgres: 10/ma  keepalive (7088.03/0/0)


As you can see from above; telnet does not enable keep alive on the connection.  I would check the troubled system with the above netstat command to verify that keep alive is in fact enabled on the WAL receiver connection.

If it’s enabled the connection should have terminated after the 18 hours and hopefully less now with your new setting.  I have no idea why it wouldn’t terminate and reconnect other than tcp keep live is either off or a bug in Linux/Postgres.


 


On Nov 17, 2018, at 3:47 PM, Rui DeSousa <rui@crazybean.net> wrote:



On Nov 17, 2018, at 6:07 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the code supports KEEPALIVE.


Postgres supports it; but the question is it on for the given connection? 


I checked on a bare minimal default installation, (after tweaking the kernel tunables to smaller values of course), keepalive msgs are sent and ACK'ed at the specified intervals, checked with wireshark, port 5432. You should test this yourself.



I just configured Postgres with streaming replication using the following versions and TCP keep alive was enabled by default for the WAL receiver connection and also psql connections.

Linux debian 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
PostgreSQL 10.6 (Debian 10.6-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

root@debian:~# netstat -anp --timers | grep -e Timer -e  EST | grep -e Timer -e 5432
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 10.6.6.101:47546        10.6.6.100:5432         ESTABLISHED 989/telnet           off (0.00/0/0)
tcp        0      0 10.6.6.101:47544        10.6.6.100:5432         ESTABLISHED 953/psql             keepalive (7103.36/0/0)
tcp        0      0 10.6.6.101:47542        10.6.6.100:5432         ESTABLISHED 922/postgres: 10/ma  keepalive (7088.03/0/0)


As you can see from above; telnet does not enable keep alive on the connection.  I would check the troubled system with the above netstat command to verify that keep alive is in fact enabled on the WAL receiver connection.

If it’s enabled the connection should have terminated after the 18 hours and hopefully less now with your new setting.  I have no idea why it wouldn’t terminate and reconnect other than tcp keep live is either off or a bug in Linux/Postgres.


 

I will also add that testing this configuration worked as expected.  When disconnecting the primary node from the network; the Postgres processes where able to detect the fault and WAL receiver reconnected as soon as the primary was back on line without issue.

What is the version of Postgres and OS are you using?

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:


On 17/11/18 10:47 μ.μ., Rui DeSousa wrote:


On Nov 17, 2018, at 6:07 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the code supports KEEPALIVE.


Postgres supports it; but the question is it on for the given connection?


As we both saw this is the default.




I checked on a bare minimal default installation, (after tweaking the kernel tunables to smaller values of course), keepalive msgs are sent and ACK'ed at the specified intervals, checked with wireshark, port 5432. You should test this yourself.



I just configured Postgres with streaming replication using the following versions and TCP keep alive was enabled by default for the WAL receiver connection and also psql connections.

Linux debian 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
PostgreSQL 10.6 (Debian 10.6-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

root@debian:~# netstat -anp --timers | grep -e Timer -e  EST | grep -e Timer -e 5432
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 10.6.6.101:47546        10.6.6.100:5432         ESTABLISHED 989/telnet           off (0.00/0/0)
tcp        0      0 10.6.6.101:47544        10.6.6.100:5432         ESTABLISHED 953/psql             keepalive (7103.36/0/0)
tcp        0      0 10.6.6.101:47542        10.6.6.100:5432         ESTABLISHED 922/postgres: 10/ma  keepalive (7088.03/0/0)


As you can see from above; telnet does not enable keep alive on the connection.  I would check the troubled system with the above netstat command to verify that keep alive is in fact enabled on the WAL receiver connection.


On the very same system (with *no* configuration/nothing changed -the kernel sysctl keepalive tested settings was on another test machine-), just now :

root@TEST-smadb:/var/lib/pgsql# uname -a
Linux TEST-smadb 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
root@TEST-smadb:/var/lib/pgsql#
root@TEST-smadb:/var/lib/pgsql# netstat -anp --timers | grep 10.9.0.20 | grep 5432
tcp        0      0 10.9.0.77:51307         10.9.0.20:5432          ESTABLISHED 34400/postgres: bgw keepalive (5711.60/0/0)
root@TEST-smadb:/var/lib/pgsql#
root@TEST-smadb:/var/lib/pgsql# ps aux | grep 34400
postgres  34400  0.5 17.1 19618680 5648984 ?    Ssl  Nov16  13:52 postgres: bgworker: logical replication worker for subscription 185231525 
root      70384  0.0  0.0  12708  1620 pts/1    S+   08:29   0:00 grep 34400
root@TEST-smadb:/var/lib/pgsql#
root@TEST-smadb:/var/lib/pgsql# su - postgres -c 'psql -c "select version()"'
                                         version                                        
-----------------------------------------------------------------------------------------
 PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
(1 row)

root@TEST-smadb:/var/lib/pgsql#


If it’s enabled the connection should have terminated after the 18 hours


Yes.


and hopefully less now with your new setting. 


This was on a  test node, the settings are still at the default.


 
I have no idea why it wouldn’t terminate and reconnect other than tcp keep live is either off or a bug in Linux/Postgres.


I may post the logs of both the primary and the subscriber (filtering out some sensitive business content of course). It is striking that the subscriber node immediately after restart, starts asking for WALs, so this is not network firewall related, at least not in a static fashion. I mean if a firewall was in between with some rules then it could have (catastrophic) effect for the whole course of replication, not just those specific hours. So I too think that there maybe a bug in PostgreSQL (but the logical worker shares code with the wal receiver, this has been tested for so long) or the OS (again unlikely) *or* something weird may have happened, like some VM migration, e.g. cloning, on the cloud provider side. I don't have direct contact with them, our admin guys say nothing abnormal happened, the cloud provider says nothing abnormal happened, however my boss told me that they were doing some migrations from one site to another. That might explain some things.

I will ask again. Thank you a lot!




 

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
On 14/11/18 6:24 μ.μ., Rui DeSousa wrote:
>
>> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>
>> Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we
runno firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself
earlier). Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the
primary.We have :
 
>>
>> # sysctl -a | grep -i keepaliv
>> net.ipv4.tcp_keepalive_intvl = 75
>> net.ipv4.tcp_keepalive_probes = 9
>> net.ipv4.tcp_keepalive_time = 7200
>>
> Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped.
So,the WAL receiver should have corrected itself after that time.  For reference, I run terminating abandon session
within15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc.
 


Hey, I was reading the docs, it seems it means :

net.ipv4.tcp_keepalive_time + net.ipv4.tcp_keepalive_intvl * net.ipv4.tcp_keepalive_probes = 2hrs 11 Mins 15 Secs,
ratherthan 18 Hrs
 


>
> I haven’t used Postgres keep alive setting as I find the OS handles it just fine.
>
>
>> Also in addition to what you say (netstat, tcpdump) if I detect such a case (even with the primary panic'ed -- yeah
thiswould take quite some nerves to do this) I will connect with gdb and take a stack trace to know what the worker is
doingand why doesn't it restart.
 
>>
>>
> Do a netstat -an; that will show you all the network connections and their current state.  If you do this on both
systemsyou should find corresponding entries/states for the replication stream.
 
>
> i.e. sample output:
>
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address          Foreign Address        (state)
> tcp4       0      0 10.10.10.1.50604          10.10.10.2.5432       ESTABLISHED
>
>
> In your case; what I think might have happened is that upstream server would not have an entry or it would be in one
ofthe FIN states where the downstream server would have EST connection with the Send-Q backlog.  If the servers were
communicatingthen the upstream server would have responded with a reset packet to the downstream thus forcing the
sessionto terminate.
 
>
> Using root account you could have seen what was transpiring on the given connection; i.e.
>
> tcpdump -i eth0 port 50604
>
>
>




On Nov 20, 2018, at 3:34 PM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Hey, I was reading the docs, it seems it means :

net.ipv4.tcp_keepalive_time + net.ipv4.tcp_keepalive_intvl * net.ipv4.tcp_keepalive_probes = 2hrs 11 Mins 15 Secs, rather than 18 Hrs

Yeah, that’s correct.  I wonder why it didn’t terminate.

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:


On 20/11/18 10:48 μ.μ., Rui DeSousa wrote:


On Nov 20, 2018, at 3:34 PM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Hey, I was reading the docs, it seems it means :

net.ipv4.tcp_keepalive_time + net.ipv4.tcp_keepalive_intvl * net.ipv4.tcp_keepalive_probes = 2hrs 11 Mins 15 Secs, rather than 18 Hrs

Yeah, that’s correct.  I wonder why it didn’t terminate.


Most probably because there was another created clone, cloud migration magic, that's my theory, albeit not confirmed by the provider. Logical worker (walreceiver) was still alive and happy even after the primary crushed. I have the logs from the other standby and it immediately detected the problem (PANIC on the primary) and retried. No firewall dropping packets, in every test I did, the logical bgworker detects any problems *instantly*, and retries after 5 secs by default.

Re: PostgreSQL 10.5 : Logical replication timeout results in PANICin pg_wal "No space left on device"

От
Alvaro Herrera
Дата:
On 2018-Nov-14, Rui DeSousa wrote:

> > On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
> > 
> > Our sysadms (seasoned linux/network guys : we have been working here
> > for more than 10 yrs) were absolute in that we run no firewall or
> > other traffic shaping system between the two hosts. (if we did the
> > problem would manifest itself earlier).  Can you recommend what to
> > look for exactly regarding both TCP stacks ? The subscriber node is
> > a clone of the primary. We have :
> > 
> > # sysctl -a | grep -i keepaliv
> > net.ipv4.tcp_keepalive_intvl = 75
> > net.ipv4.tcp_keepalive_probes = 9
> > net.ipv4.tcp_keepalive_time = 7200
> 
> Those keep alive settings are linux’s defaults and work out to be 18
> hours before the abandon connection is dropped.  So, the WAL receiver
> should have corrected itself after that time.  For reference, I run
> terminating abandon session within 15 mins as they take-up valuable
> database resources and could potentially hold on to locks, snapshots,
> etc.

Where does your 18h figure come from?  As I understand it, these numbers
mean "wait 7200 seconds, then send 9 probes 75 seconds apart", kill the
connection if not reply is obtained.  So that works out to about 131
minutes (modulo fencepost bug).  Certainly not 18 hours ...

Now ... I have seen Linux kernel code that seemed to me to cause network
transmission get stuck *in the kernel* without any way out.  Now I'm not
a kernel expert and I don't know if this applies to your case (maybe it
got fixed already), but it was definitely some process that was stuck
with "wchan" set to a network kernel call and way beyond TCP keepalives.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

От
Achilleas Mantzios
Дата:
Hi Alvaro!

On 23/11/18 1:10 μ.μ., Alvaro Herrera wrote:
> On 2018-Nov-14, Rui DeSousa wrote:
>
>>> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>>
>>> Our sysadms (seasoned linux/network guys : we have been working here
>>> for more than 10 yrs) were absolute in that we run no firewall or
>>> other traffic shaping system between the two hosts. (if we did the
>>> problem would manifest itself earlier).  Can you recommend what to
>>> look for exactly regarding both TCP stacks ? The subscriber node is
>>> a clone of the primary. We have :
>>>
>>> # sysctl -a | grep -i keepaliv
>>> net.ipv4.tcp_keepalive_intvl = 75
>>> net.ipv4.tcp_keepalive_probes = 9
>>> net.ipv4.tcp_keepalive_time = 7200
>> Those keep alive settings are linux’s defaults and work out to be 18
>> hours before the abandon connection is dropped.  So, the WAL receiver
>> should have corrected itself after that time.  For reference, I run
>> terminating abandon session within 15 mins as they take-up valuable
>> database resources and could potentially hold on to locks, snapshots,
>> etc.
> Where does your 18h figure come from?  As I understand it, these numbers
> mean "wait 7200 seconds, then send 9 probes 75 seconds apart", kill the
> connection if not reply is obtained.  So that works out to about 131
> minutes (modulo fencepost bug).  Certainly not 18 hours ...

Thanks, yes it sums up to 2Hrs 11 Mins. The moments after the primary crushed I didn't have the nerves/patience/guts to
waitthat long and actually prove that the subscriber listened happily to a 
 
ghost/stuck connection.

>
> Now ... I have seen Linux kernel code that seemed to me to cause network
> transmission get stuck *in the kernel* without any way out.  Now I'm not
> a kernel expert and I don't know if this applies to your case (maybe it
> got fixed already), but it was definitely some process that was stuck
> with "wchan" set to a network kernel call and way beyond TCP keepalives.

It seems we'll have to upgrade our systems/kernels ASAP.  Thanks a lot!

>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt