Обсуждение: WAL receive process dies

Поиск
Список
Период
Сортировка

WAL receive process dies

От
Patrick Krecker
Дата:
We have a periodic network connectivity issue (unrelated to Postgres) that is causing the replication to fail. 

We are running Postgres 9.3 using streaming replication. We also have WAL archives available to be replayed with restore_command. Typically when I bring up a slave it copies over WAL archives for a while before connecting via streaming replication.

When I notice the machine is behind in replication, I also notice that the WAL receiver process has died. There didn't seem to be any information in the logs about it. The solution thus far has been to restart Postgres. I have two questions:

1. It seems that Postgres does not fall back to copying WAL archives with its restore_command. I just want to confirm that this is what Postgres is supposed to do when its connection via streaming replication times out.
2. Is it possible to restart replication after the WAL receiver process has died without restarting Postgres?

Thanks!
Patrick

Re: WAL receive process dies

От
Craig Ringer
Дата:
On 08/28/2014 09:39 AM, Patrick Krecker wrote:
> We have a periodic network connectivity issue (unrelated to Postgres)
> that is causing the replication to fail.
>
> We are running Postgres 9.3 using streaming replication. We also have
> WAL archives available to be replayed with restore_command. Typically
> when I bring up a slave it copies over WAL archives for a while before
> connecting via streaming replication.
>
> When I notice the machine is behind in replication, I also notice that
> the WAL receiver process has died. There didn't seem to be any
> information in the logs about it.

What did you search for?

Do you have core dumps enabled? That'd be a good first step. (Exactly
how to do this depends on the OS/distro/version, but you basically want
to set "ulimit -c unlimited" on some ancestor of the postmaster).

> 1. It seems that Postgres does not fall back to copying WAL archives
> with its restore_command. I just want to confirm that this is what
> Postgres is supposed to do when its connection via streaming replication
> times out.

It should fall back.

> 2. Is it possible to restart replication after the WAL receiver process
> has died without restarting Postgres?

PostgreSQL should do so its self.

Please show your recovery.conf (appropriately redacted) and
postgresql.conf for the replica, and complete logs for the time period
of interest. You'll want to upload the logs somewhere then link to them,
do not attach them to an email to the list.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WAL receive process dies

От
Patrick Krecker
Дата:
Hi Craig -- Sorry for the late response, I've been tied up with some other things for the last day. Just to give some context, this is a machine that sits in our office and replicates from another read slave in production via a tunnel set up with spiped. The spiped tunnel is working and postgres is still stuck (it has been stuck since 8-25).

The last moment that replication was working was  2014-08-25 22:06:05.03972. We have a table called replication_time with one column and one row that has a timestamp that is updated every second, so it's easy to tell the last time this machine was in sync with production.


Currently the WAL receive process is still not running. Interestingly, another pg instance running on the same machine is replicating just fine.

A note about that: there is another instance running on that machine and a definite race condition with restore_wal_s3.py, which writes the file to /tmp before copying it to the destination requested by postgres (I just discovered this today, this is not generally how we run our servers). So, if both are restoring at the same time, they will step on the WAL archives being unzipped in /tmp and bad things will happen. But, interestingly, I checked the logs for the other machine and there is no activity on that day. It does not appear that the WAL replay was invoked or that the WAL receive timed out.

As for enabling the core dump, it seems that it needs to be done when Postgres starts, and thought I would leave it running in its "stuck" state for now. However, if you know how to enable it on a running process, let me know. We are running Ubuntu 13.10.


On Wed, Aug 27, 2014 at 11:30 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 08/28/2014 09:39 AM, Patrick Krecker wrote:
> We have a periodic network connectivity issue (unrelated to Postgres)
> that is causing the replication to fail.
>
> We are running Postgres 9.3 using streaming replication. We also have
> WAL archives available to be replayed with restore_command. Typically
> when I bring up a slave it copies over WAL archives for a while before
> connecting via streaming replication.
>
> When I notice the machine is behind in replication, I also notice that
> the WAL receiver process has died. There didn't seem to be any
> information in the logs about it.

What did you search for?

Do you have core dumps enabled? That'd be a good first step. (Exactly
how to do this depends on the OS/distro/version, but you basically want
to set "ulimit -c unlimited" on some ancestor of the postmaster).

> 1. It seems that Postgres does not fall back to copying WAL archives
> with its restore_command. I just want to confirm that this is what
> Postgres is supposed to do when its connection via streaming replication
> times out.

It should fall back.

> 2. Is it possible to restart replication after the WAL receiver process
> has died without restarting Postgres?

PostgreSQL should do so its self.

Please show your recovery.conf (appropriately redacted) and
postgresql.conf for the replica, and complete logs for the time period
of interest. You'll want to upload the logs somewhere then link to them,
do not attach them to an email to the list.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WAL receive process dies

От
Andres Freund
Дата:
On 2014-08-29 13:04:43 -0700, Patrick Krecker wrote:
> Hi Craig -- Sorry for the late response, I've been tied up with some other
> things for the last day. Just to give some context, this is a machine that
> sits in our office and replicates from another read slave in production via
> a tunnel set up with spiped. The spiped tunnel is working and postgres is
> still stuck (it has been stuck since 8-25).
>
> The last moment that replication was working was  2014-08-25
> 22:06:05.03972. We have a table called replication_time with one column and
> one row that has a timestamp that is updated every second, so it's easy to
> tell the last time this machine was in sync with production.
>
> recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
> logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

The problem is this log entry:

2014-08-27 18:44:27 PDT ERROR:  requested starting point 175/28000000 is
ahead of the WAL flush position of this server 174/B76D16A8

That's the walreceiver connecting to the upstream node and askign for
wAL. Somehow the standby has gotten *considerably* ahead of the node
it's trying to receive WAL from.

Are you sure primary_conninfo / the spiped tunnel connects to the right
postgres instance? Did you possibly have a failover or something like that?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WAL receive process dies

От
Patrick Krecker
Дата:

On Fri, Aug 29, 2014 at 2:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-08-29 13:04:43 -0700, Patrick Krecker wrote:
> Hi Craig -- Sorry for the late response, I've been tied up with some other
> things for the last day. Just to give some context, this is a machine that
> sits in our office and replicates from another read slave in production via
> a tunnel set up with spiped. The spiped tunnel is working and postgres is
> still stuck (it has been stuck since 8-25).
>
> The last moment that replication was working was  2014-08-25
> 22:06:05.03972. We have a table called replication_time with one column and
> one row that has a timestamp that is updated every second, so it's easy to
> tell the last time this machine was in sync with production.
>
> recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
> logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

The problem is this log entry:

2014-08-27 18:44:27 PDT ERROR:  requested starting point 175/28000000 is
ahead of the WAL flush position of this server 174/B76D16A8

That's the walreceiver connecting to the upstream node and askign for
wAL. Somehow the standby has gotten *considerably* ahead of the node
it's trying to receive WAL from.

Are you sure primary_conninfo / the spiped tunnel connects to the right
postgres instance? Did you possibly have a failover or something like that?

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

We didn't have a failover. We just have one master and replica in different AZs on Amazon EC2.

I ran the following on the local endpoint of spiped:

while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location() from replication_time";  done;

And the same command on production and I was able to verify that the xlogs for a given point in time were the same (modtime is updated every second by an upstart job):

spiped from office -> production:
          modtime           | pg_last_xlog_receive_location | pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
 2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8

Ran directly on production replica:
          modtime           | pg_last_xlog_receive_location | pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
 2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8

To me, this is sufficient proof that spiped is indeed talking to the machine I think it's talking to (also lsof reports the correct hostname).

I created another basebackup from the currently stuck postgres intance on another machine and I also get this error:

2014-08-29 15:27:30 PDT FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 177/2D000000 is ahead of the WAL flush position of this server 174/B76D16A8

However, this new instance is able to fetch logs from S3 and replay them without issue.

Is it possible that the data dir on the stuck instance is just corrupt? It is not impossible for this to have happened at some point in the past due to the race condition in fetching logs from S3 I mentioned above.
 

Re: WAL receive process dies

От
Andres Freund
Дата:
[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:
> I ran the following on the local endpoint of spiped:
>
> while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select
> modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
> from replication_time";  done;
>
> And the same command on production and I was able to verify that the xlogs
> for a given point in time were the same (modtime is updated every second by
> an upstart job):
>
> spiped from office -> production:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> Ran directly on production replica:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> To me, this is sufficient proof that spiped is indeed talking to the
> machine I think it's talking to (also lsof reports the correct hostname).
>
> I created another basebackup from the currently stuck postgres intance on
> another machine and I also get this error:
>
> 2014-08-29 15:27:30 PDT FATAL:  could not receive data from WAL stream:
> ERROR:  requested starting point 177/2D000000 is ahead of the WAL flush
> position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WAL receive process dies

От
Patrick Krecker
Дата:



On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:
> I ran the following on the local endpoint of spiped:
>
> while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select
> modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
> from replication_time";  done;
>
> And the same command on production and I was able to verify that the xlogs
> for a given point in time were the same (modtime is updated every second by
> an upstart job):
>
> spiped from office -> production:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> Ran directly on production replica:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> To me, this is sufficient proof that spiped is indeed talking to the
> machine I think it's talking to (also lsof reports the correct hostname).
>
> I created another basebackup from the currently stuck postgres intance on
> another machine and I also get this error:
>
> 2014-08-29 15:27:30 PDT FATAL:  could not receive data from WAL stream:
> ERROR:  requested starting point 177/2D000000 is ahead of the WAL flush
> position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing something else wrong? Sorry :(

First, I apologize for the misleading information, but when I made another basebackup and tried to use it, I configured the machine to cascade from the stuck replica, *not* from the spiped endpoint. When I properly connected it to the spiped endpoint it synced up fine, giving this log line:

2014-08-29 16:16:21 PDT LOG:  started streaming WAL from primary at 177/4F000000 on timeline 1

The command as you gave reported a syntax error as is, but I googled a little bit and run this one:

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

And it gave me this output:

      systemid       | timeline |   xlogpos
---------------------+----------+--------------
 5964163898407843711 |        1 | 177/53091990

Re: WAL receive process dies

От
Chris Hundt
Дата:
(One of Patrick's coworkers here; thanks a lot for your assistance)

Just in case you wanted this as well, I ran

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5432 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

(5432 is the stuck replica and 5445 is the pipe to the working replica)

      systemid       | timeline |   xlogpos    
---------------------+----------+--------------
 5964163898407843711 |        1 | 174/B76D16A8


On Fri, Aug 29, 2014 at 7:26 PM, Patrick Krecker <patrick@judicata.com> wrote:



On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:
> I ran the following on the local endpoint of spiped:
>
> while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select
> modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
> from replication_time";  done;
>
> And the same command on production and I was able to verify that the xlogs
> for a given point in time were the same (modtime is updated every second by
> an upstart job):
>
> spiped from office -> production:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> Ran directly on production replica:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> To me, this is sufficient proof that spiped is indeed talking to the
> machine I think it's talking to (also lsof reports the correct hostname).
>
> I created another basebackup from the currently stuck postgres intance on
> another machine and I also get this error:
>
> 2014-08-29 15:27:30 PDT FATAL:  could not receive data from WAL stream:
> ERROR:  requested starting point 177/2D000000 is ahead of the WAL flush
> position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing something else wrong? Sorry :(

First, I apologize for the misleading information, but when I made another basebackup and tried to use it, I configured the machine to cascade from the stuck replica, *not* from the spiped endpoint. When I properly connected it to the spiped endpoint it synced up fine, giving this log line:

2014-08-29 16:16:21 PDT LOG:  started streaming WAL from primary at 177/4F000000 on timeline 1

The command as you gave reported a syntax error as is, but I googled a little bit and run this one:

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

And it gave me this output:

      systemid       | timeline |   xlogpos
---------------------+----------+--------------
 5964163898407843711 |        1 | 177/53091990

Re: WAL receive process dies

От
Patrick Krecker
Дата:



On Sat, Aug 30, 2014 at 7:04 AM, Chris Hundt <chris@judicata.com> wrote:
(One of Patrick's coworkers here; thanks a lot for your assistance)

Just in case you wanted this as well, I ran

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5432 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

(5432 is the stuck replica and 5445 is the pipe to the working replica)

      systemid       | timeline |   xlogpos    
---------------------+----------+--------------
 5964163898407843711 |        1 | 174/B76D16A8


On Fri, Aug 29, 2014 at 7:26 PM, Patrick Krecker <patrick@judicata.com> wrote:



On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:
> I ran the following on the local endpoint of spiped:
>
> while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select
> modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
> from replication_time";  done;
>
> And the same command on production and I was able to verify that the xlogs
> for a given point in time were the same (modtime is updated every second by
> an upstart job):
>
> spiped from office -> production:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> Ran directly on production replica:
>           modtime           | pg_last_xlog_receive_location |
> pg_last_xlog_replay_location
> ----------------------------+-------------------------------+------------------------------
>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8
>
> To me, this is sufficient proof that spiped is indeed talking to the
> machine I think it's talking to (also lsof reports the correct hostname).
>
> I created another basebackup from the currently stuck postgres intance on
> another machine and I also get this error:
>
> 2014-08-29 15:27:30 PDT FATAL:  could not receive data from WAL stream:
> ERROR:  requested starting point 177/2D000000 is ahead of the WAL flush
> position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing something else wrong? Sorry :(

First, I apologize for the misleading information, but when I made another basebackup and tried to use it, I configured the machine to cascade from the stuck replica, *not* from the spiped endpoint. When I properly connected it to the spiped endpoint it synced up fine, giving this log line:

2014-08-29 16:16:21 PDT LOG:  started streaming WAL from primary at 177/4F000000 on timeline 1

The command as you gave reported a syntax error as is, but I googled a little bit and run this one:

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

And it gave me this output:

      systemid       | timeline |   xlogpos
---------------------+----------+--------------
 5964163898407843711 |        1 | 177/53091990


I I restarted the server and it promptly started fetching WAL archives from our backups [1]. I am still at a loss as to why the binary replication couldn't reconnect and why Postgres didn't fall back to fetching WAL archives with its restore_command.