Обсуждение: pgBackRest : "ERROR: [042]: unexpected EOF reading line from remoteprocess on" and stale pgbackrest processes hanging
Hello List, I had this error today with postgresql 10.6 and pgbackrest 2.07. I was running test backups and during a backup The command given to the repo host (pg1-host=10.9.0.77) was : pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup I noticed that the (two) ssh login sessions to the repo host just froze. No disconnections, just froze. So some network problemsurely occurred. On the repo host pgbackrest log I got : 2019-01-16 12:20:38.372 P00 ERROR: [042]: unexpected EOF reading line from remote process on '10.9.0.77' on the pg host (10.9.0.77) I noticed that the pgbackrest backup processes had died, the only ones running was spawn fromthe archive-push , however on the repo host the pgbackrest processes were alive, when in fact the backup was stalled. On the repo I got : root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77 tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77 tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh on (65.50/12/0) tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh on (69.60/12/0) tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh on (65.50/12/0) tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh on (61.40/12/0) In order to start a new backup I had to kill the 4 processes. Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ? -- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
Hello so should I post an issue to github? or just wait to see if this happens again? On 16/1/19 5:24 μ.μ., Achilleas Mantzios wrote: > Hello List, > > I had this error today with postgresql 10.6 and pgbackrest 2.07. I was > running test backups and during a backup > The command given to the repo host (pg1-host=10.9.0.77) was : > pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup > I noticed that the (two) ssh login sessions to the repo host just > froze. No disconnections, just froze. So some network problem surely > occurred. > On the repo host pgbackrest log I got : > 2019-01-16 12:20:38.372 P00 ERROR: [042]: unexpected EOF reading line > from remote process on '10.9.0.77' > on the pg host (10.9.0.77) I noticed that the pgbackrest backup > processes had died, the only ones running was spawn from the > archive-push , however on the repo host the pgbackrest processes were > alive, when in fact the backup was stalled. > On the repo I got : > root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77 > tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh > tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh > tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh > tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh > root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77 > tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED > 11918/ssh on (65.50/12/0) > tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED > 11920/ssh on (69.60/12/0) > tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED > 11919/ssh on (65.50/12/0) > tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED > 11922/ssh on (61.40/12/0) > > In order to start a new backup I had to kill the 4 processes. > > Is this behavior normal ? Would the 4 processes be killed due to > KEEPALIVE settings eventually ? >
Greetings, * Achilleas Mantzios (achill@matrix.gatewaynet.com) wrote: > I had this error today with postgresql 10.6 and pgbackrest 2.07. I was running test backups and during a backup > The command given to the repo host (pg1-host=10.9.0.77) was : > pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup > I noticed that the (two) ssh login sessions to the repo host just froze. No disconnections, just froze. So some networkproblem surely occurred. > On the repo host pgbackrest log I got : > 2019-01-16 12:20:38.372 P00 ERROR: [042]: unexpected EOF reading line from remote process on '10.9.0.77' > on the pg host (10.9.0.77) I noticed that the pgbackrest backup processes > had died, the only ones running was spawn from the archive-push , however on > the repo host the pgbackrest processes were alive, when in fact the backup > was stalled. > On the repo I got : > root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77 > tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh > tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh > tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh > tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh > root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77 > tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh on (65.50/12/0) > tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh on (69.60/12/0) > tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh on (65.50/12/0) > tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh on (61.40/12/0) This looks like the kernel thinks the connections are still alive. > In order to start a new backup I had to kill the 4 processes. > > Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ? Yes, they almost certainly would have timed out eventually- but how long were you already waiting..? The timeouts in pgbackrest are not terribly long, in general. Thanks! Stephen
Вложения
Hello Stephen On 16/1/19 10:27 μ.μ., Stephen Frost wrote: > Greetings, > > * Achilleas Mantzios (achill@matrix.gatewaynet.com) wrote: >> I had this error today with postgresql 10.6 and pgbackrest 2.07. I was running test backups and during a backup >> The command given to the repo host (pg1-host=10.9.0.77) was : >> pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup >> I noticed that the (two) ssh login sessions to the repo host just froze. No disconnections, just froze. So some networkproblem surely occurred. >> On the repo host pgbackrest log I got : >> 2019-01-16 12:20:38.372 P00 ERROR: [042]: unexpected EOF reading line from remote process on '10.9.0.77' >> on the pg host (10.9.0.77) I noticed that the pgbackrest backup processes >> had died, the only ones running was spawn from the archive-push , however on >> the repo host the pgbackrest processes were alive, when in fact the backup >> was stalled. >> On the repo I got : >> root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77 >> tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh >> tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh >> tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh >> tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh >> root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77 >> tcp 0 88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh on (65.50/12/0) >> tcp 0 88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh on (69.60/12/0) >> tcp 0 88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh on (65.50/12/0) >> tcp 0 88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh on (61.40/12/0) > This looks like the kernel thinks the connections are still alive. > >> In order to start a new backup I had to kill the 4 processes. >> >> Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ? > Yes, they almost certainly would have timed out eventually- but how > long were you already waiting..? The timeouts in pgbackrest are not > terribly long, in general. Looking at the log I waited 9 minutes, before manually killing. > > Thanks! > > Stephen
On 1/16/19 10:27 PM, Stephen Frost wrote: > * Achilleas Mantzios (achill@matrix.gatewaynet.com) wrote: >> >> Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ? > > Yes, they almost certainly would have timed out eventually- but how > long were you already waiting..? The timeouts in pgbackrest are not > terribly long, in general. By default the remote processes will shutdown after 30 minutes if they have not gotten a message from the main process. This is controlled by the `protocol-timeout` option. It's a bit unusual for them to get stuck like this -- usually they get an eof and exit immediately. Definitely looks like something was up with SSH. Regards, -- -David david@pgmasters.net