Обсуждение: BUG #15078: Unable to receive data from WAL Stream Error

Поиск
Список
Период
Сортировка

BUG #15078: Unable to receive data from WAL Stream Error

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15078
Logged by:          MAHESH KOLLA
Email address:      mkolla@transunion.com
PostgreSQL version: 9.6.3
Operating system:   RHEL
Description:

< 2018-02-19 01:37:48.847 CST > FATAL:  could not receive data from WAL
stream: SSL SYSCALL error: Connection timed out

sh: dev/null: No such file or directory
sh: dev/null: No such file or directory
< 2018-02-19 01:37:48.860 CST > LOG:  invalid resource manager ID 48 at
15/2D69E848
< 2018-02-19 01:37:48.916 CST > LOG:  started streaming WAL from primary at
15/2D000000 on timeline 1



Re: BUG #15078: Unable to receive data from WAL Stream Error

От
Eric Radman
Дата:
On Thu, Feb 22, 2018 at 05:22:09PM +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      15078
> Logged by:          MAHESH KOLLA
> Email address:      mkolla@transunion.com
> PostgreSQL version: 9.6.3
> Operating system:   RHEL
> Description:        

Why are you submitting the same bug report twice?


> < 2018-02-19 01:37:48.847 CST > FATAL:  could not receive data from WAL
> stream: SSL SYSCALL error: Connection timed out
> 
> sh: dev/null: No such file or directory
> sh: dev/null: No such file or directory
> < 2018-02-19 01:37:48.860 CST > LOG:  invalid resource manager ID 48 at
> 15/2D69E848
> < 2018-02-19 01:37:48.916 CST > LOG:  started streaming WAL from primary at
> 15/2D000000 on timeline 1

Is the primary server accepting SSL connections?  What does pg_isready
report using the same connection parameters?


-- 
Eric Radman  |  http://eradman.com


RE: BUG #15078: Unable to receive data from WAL Stream Error

От
"Kolla, Mahesh"
Дата:
Hi Eric,

Thank you for the response. Primary is accepting the ssl connections .Stand by is in sync with primary all the time but
wereceive FATALS sometimes 3 to 4 times a day which make us worry if having any data corruption  

This is my official email. So Raised it again. Apologies for that

Thank you
Mahesh Kolla

-----Original Message-----
From: Eric Radman [mailto:ericshane@eradman.com]
Sent: Thursday, February 22, 2018 11:28 AM
To: Kolla, Mahesh <Mahesh.Kolla@transunion.com>; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15078: Unable to receive data from WAL Stream Error

On Thu, Feb 22, 2018 at 05:22:09PM +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      15078
> Logged by:          MAHESH KOLLA
> Email address:      mkolla@transunion.com
> PostgreSQL version: 9.6.3
> Operating system:   RHEL
> Description:

Why are you submitting the same bug report twice?


> < 2018-02-19 01:37:48.847 CST > FATAL:  could not receive data from
> WAL
> stream: SSL SYSCALL error: Connection timed out
>
> sh: dev/null: No such file or directory
> sh: dev/null: No such file or directory < 2018-02-19 01:37:48.860 CST
> > LOG:  invalid resource manager ID 48 at
> 15/2D69E848
> < 2018-02-19 01:37:48.916 CST > LOG:  started streaming WAL from
> primary at
> 15/2D000000 on timeline 1

Is the primary server accepting SSL connections?  What does pg_isready report using the same connection parameters?


--
Eric Radman  |
https://urldefense.proofpoint.com/v2/url?u=http-3A__eradman.com&d=DwIBAg&c=7gn0PlAmraV3zr-k385KhKAz9NTx0dwockj5vIsr5Sw&r=soAMyKP9lXw41tqZFCOuMp8AZeFGh0j-d84gRvShwfQ&m=_ryleXxDO1WUCOpihQwT2hJC2nDbqGzBVmY6mCGsD-E&s=oHIPY0Z1XyWXNQNNq5WEubiAEVjbJo8QppVbqHvrfbk&e=

Вложения

Re: BUG #15078: Unable to receive data from WAL Stream Error

От
Tom Lane
Дата:
"Kolla, Mahesh" <Mahesh.Kolla@transunion.com> writes:
> Thank you for the response. Primary is accepting the ssl connections .Stand by is in sync with primary all the time
butwe receive FATALS sometimes 3 to 4 times a day which make us worry if having any data corruption  

I doubt this is a PG bug; it sounds more like a networking problem.
Maybe there's a router or firewall in between that is timing out your
connections too easily.  If you can't adjust the network infrastructure,
you could try enabling TCP keepalives (with a shorter repeat
interval than the default) on the replication connections.

            regards, tom lane


RE: BUG #15078: Unable to receive data from WAL Stream Error

От
"Kolla, Mahesh"
Дата:
Hello Tom ,

Please let me know why we are getting below associated LOGS saying invalid resource manager   with the FATAL message
Unableto receive data from WAL Stream Error 

> sh: dev/null: No such file or directory
> sh: dev/null: No such file or directory < 2018-02-19 01:37:48.860 CST
> > LOG:  invalid resource manager ID 48 at
> 15/2D69E848

Please kindly suggest us a value for tcp_keepalives_idle as it is presently set to 0

Thank you
Mahesh Kolla
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, February 22, 2018 12:19 PM
To: Kolla, Mahesh <Mahesh.Kolla@transunion.com>
Cc: Eric Radman <ericshane@eradman.com>; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15078: Unable to receive data from WAL Stream Error

"Kolla, Mahesh" <Mahesh.Kolla@transunion.com> writes:
> Thank you for the response. Primary is accepting the ssl connections
> .Stand by is in sync with primary all the time but we receive FATALS
> sometimes 3 to 4 times a day which make us worry if having any data
> corruption

I doubt this is a PG bug; it sounds more like a networking problem.
Maybe there's a router or firewall in between that is timing out your connections too easily.  If you can't adjust the
networkinfrastructure, you could try enabling TCP keepalives (with a shorter repeat interval than the default) on the
replicationconnections. 

            regards, tom lane

Вложения

RE: BUG #15078: Unable to receive data from WAL Stream Error

От
"Kolla, Mahesh"
Дата:
Hello Tom,

We got information from network team saying that  primary and standby are communicating through a switch ,there is no
firewallat all 

Please let me know why we are getting below associated LOGS saying invalid resource manager   with the FATAL message
Unableto receive data from WAL Stream Error as well. It looks like  a bug  

> sh: dev/null: No such file or directory
> sh: dev/null: No such file or directory < 2018-02-19 01:37:48.860 CST
> > LOG:  invalid resource manager ID 48 at
> 15/2D69E848

Thank you
Mahesh Kolla

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, February 22, 2018 12:19 PM
To: Kolla, Mahesh <Mahesh.Kolla@transunion.com>
Cc: Eric Radman <ericshane@eradman.com>; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15078: Unable to receive data from WAL Stream Error

"Kolla, Mahesh" <Mahesh.Kolla@transunion.com> writes:
> Thank you for the response. Primary is accepting the ssl connections
> .Stand by is in sync with primary all the time but we receive FATALS
> sometimes 3 to 4 times a day which make us worry if having any data
> corruption

I doubt this is a PG bug; it sounds more like a networking problem.
Maybe there's a router or firewall in between that is timing out your connections too easily.  If you can't adjust the
networkinfrastructure, you could try enabling TCP keepalives (with a shorter repeat interval than the default) on the
replicationconnections. 

            regards, tom lane

Вложения

Re: BUG #15078: Unable to receive data from WAL Stream Error

От
Tomas Vondra
Дата:
On 02/22/2018 07:43 PM, Kolla, Mahesh wrote:
> Hello Tom ,
> 
> Please let me know why we are getting below associated LOGS saying
invalid resource manager with the FATAL message Unable to receive data
from WAL Stream Error
> 
>> sh: dev/null: No such file or directory
>> sh: dev/null: No such file or directory < 2018-02-19 01:37:48.860 CST 
>>> LOG:  invalid resource manager ID 48 at
>> 15/2D69E848
> 

I believe that essentially means the WAL is corrupted in some way,
possibly due to a network issue. I don't think I've seen such error
message though, so not sure.

FWIW  it's really hard to investigate issues when you only copy three
lines, two of which are errors in your shell script. That provides no
context whatsoever.

> Please kindly suggest us a value for tcp_keepalives_idle as it is
> presently set to 0
> 

That really depends on your networking configuration, but you can try this:

tcp_keepalives_idle = 60
tcp_keepalives_interval = 15
tcp_keepalives_count = 3

which essentially pings the server every 60 seconds, if the server does
not respond in 15 seconds it'll try again, and will consider the
connection gone after 3 failures.

But it's unclear if this really is a networking issue, so hard to say if
this improves the situation.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: BUG #15078: Unable to receive data from WAL Stream Error

От
"Kolla, Mahesh"
Дата:
Hello Tomas,

Thank you for the suggestions.

We are not getting any other errors in our stand by database .This  error is written directly to postgres.log by Logger
process.We are not running any shell script for it . 

It is showing dev/null :no such file or directory may be because of this command
restore_command='cp /archive/%f %p 2>/dev/null' in recovery.conf file

Please let me know whether it gives any clue to answer this problem

Thank you
Mahesh Kolla


-----Original Message-----
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
Sent: Sunday, February 25, 2018 6:45 PM
To: Kolla, Mahesh <Mahesh.Kolla@transunion.com>; Tom Lane <tgl@sss.pgh.pa.us>
Cc: Eric Radman <ericshane@eradman.com>; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15078: Unable to receive data from WAL Stream Error


On 02/22/2018 07:43 PM, Kolla, Mahesh wrote:
> Hello Tom ,
>
> Please let me know why we are getting below associated LOGS saying
invalid resource manager with the FATAL message Unable to receive data from WAL Stream Error
>
>> sh: dev/null: No such file or directory
>> sh: dev/null: No such file or directory < 2018-02-19 01:37:48.860 CST
>>> LOG:  invalid resource manager ID 48 at
>> 15/2D69E848
>

I believe that essentially means the WAL is corrupted in some way, possibly due to a network issue. I don't think I've
seensuch error message though, so not sure. 

FWIW  it's really hard to investigate issues when you only copy three lines, two of which are errors in your shell
script.That provides no context whatsoever. 

> Please kindly suggest us a value for tcp_keepalives_idle as it is
> presently set to 0
>

That really depends on your networking configuration, but you can try this:

tcp_keepalives_idle = 60
tcp_keepalives_interval = 15
tcp_keepalives_count = 3

which essentially pings the server every 60 seconds, if the server does not respond in 15 seconds it'll try again, and
willconsider the connection gone after 3 failures. 

But it's unclear if this really is a networking issue, so hard to say if this improves the situation.

regards

--
Tomas Vondra
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com&d=DwICaQ&c=7gn0PlAmraV3zr-k385KhKAz9NTx0dwockj5vIsr5Sw&r=TA1Pdlc8dsrZZaPtMe1RtA9m8ljv1LsiVrnfONx6s5s&m=v2w_MPC4qmLjfQj0gSGkwxnfk_83kjpOcUuJqFeKjGY&s=52uy0dtmIjRjlwzoB12AEOPe0DQdw0sYU7CorW06OOc&e=
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения