Обсуждение: Slave promotion problem...

Поиск
Список
Период
Сортировка

Slave promotion problem...

От
marin@kset.org
Дата:
Last week we had some problems on the master server which caused a
failover on the slave (the master was completely unresponsive due to
reasons still unknown). The slave received the promote signal (pg_ctl
promote) and logged that in the logs:
2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG:  received promote
request
2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL:  terminating
walreceiver process due to administrator command

5 hours later the slave still didn't promote. Meanwhile we fixed the
master and restarted it. The slave was restarted and it behaved just
like the promote signal didn't arrive, connecting to the master as a
regular slave.

Because of maintenance we had to issue a failover a few days after, and
this time the failover was successful:
2015-08-30 19:40:08 UTC [312]: [2-1] user=,db= LOG:  replication
terminated by primary server
2015-08-30 19:40:08 UTC [312]: [3-1] user=,db= DETAIL:  End of WAL
reached on timeline 3 at 1AC/4D000090.
2015-08-30 19:40:08 UTC [312]: [4-1] user=,db= FATAL:  could not send
end-of-streaming message to primary: no COPY in progress
2015-08-30 19:40:08 UTC [6]: [34-1] user=,db= LOG:  invalid record
length at 1AC/4D000090
2015-08-30 19:40:10 UTC [6]: [35-1] user=,db= LOG:  received promote
request
2015-08-30 19:40:13 UTC [6]: [36-1] user=,db= LOG:  redo done at
1AC/4D000028
2015-08-30 19:40:13 UTC [6]: [37-1] user=,db= LOG:  last completed
transaction was at log time 2015-08-30 19:40:07.18114+00
2015-08-30 19:40:14 UTC [6]: [38-1] user=,db= LOG:  selected new
timeline ID: 4
2015-08-30 19:40:14 UTC [6]: [39-1] user=,db= LOG:  restored log file
"00000003.history" from archive
2015-08-30 19:40:14 UTC [6]: [40-1] user=,db= LOG:  archive recovery
complete
2015-08-30 19:40:14 UTC [6]: [41-1] user=,db= LOG:  MultiXact member
wraparound protections are now enabled
2015-08-30 19:40:14 UTC [29303]: [1-1] user=,db= LOG:  autovacuum
launcher started
2015-08-30 19:40:14 UTC [1]: [4-1] user=,db= LOG:  database system is
ready to accept connections

I am unsure if this promotion failure is a bug/glitch, but the promote
procedure is automated and tested a couple of hundred times so I am
certain we initiated the promote correctly.
Looking in the internet I haven't found anything similar. Does anybody
know any reason why the slave didn't promote after receiving the promote
signal? Looking at the data it seems like the slave aborted the promote
process.

Both instances are 9.4.4 connected with streaming replication.

Regards,
Mladen Marinović


Re: Slave promotion problem...

От
Martín Marqués
Дата:
El 31/08/15 a las 03:29, marin@kset.org escribió:
> Last week we had some problems on the master server which caused a
> failover on the slave (the master was completely unresponsive due to
> reasons still unknown). The slave received the promote signal (pg_ctl
> promote) and logged that in the logs:
> 2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG:  received promote
> request
> 2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL:  terminating
> walreceiver process due to administrator command
>
> 5 hours later the slave still didn't promote. Meanwhile we fixed the
> master and restarted it. The slave was restarted and it behaved just
> like the promote signal didn't arrive, connecting to the master as a
> regular slave.

Aren't there any further logs after the walreceiver termination?

Up to here everything looks fine, but we have no idea on what was logged
afterwards.

> I am unsure if this promotion failure is a bug/glitch, but the promote
> procedure is automated and tested a couple of hundred times so I am
> certain we initiated the promote correctly.

Are you using homemade scripts? Maybe you need to test them more
thoroughly, with different environment parameters.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Slave promotion problem...

От
marin@kset.org
Дата:
On 2015-08-31 14:38, Martín Marqués wrote:
> El 31/08/15 a las 03:29, marin@kset.org escribió:
>> Last week we had some problems on the master server which caused a
>> failover on the slave (the master was completely unresponsive due to
>> reasons still unknown). The slave received the promote signal (pg_ctl
>> promote) and logged that in the logs:
>> 2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG:  received promote
>> request
>> 2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL:  terminating
>> walreceiver process due to administrator command
>>
>> 5 hours later the slave still didn't promote. Meanwhile we fixed the
>> master and restarted it. The slave was restarted and it behaved just
>> like the promote signal didn't arrive, connecting to the master as a
>> regular slave.
>
> Aren't there any further logs after the walreceiver termination?
> Up to here everything looks fine, but we have no idea on what was
> logged
> afterwards.
There are logs (quite a few, cca. 5 hours of it), every second something
like this:
2015-08-28 23:05:12 UTC [79867]: [1-1] user=[unknown],db=[unknown] LOG:
connection received: host=[local]
2015-08-28 23:05:12 UTC [79867]: [2-1] user=postgres,db=postgres LOG:
connection authorized: user=postgres database=postgres
This logs the connection of the process that probes the server is alive.

I was expecting to see something like:
redo done at xxxxx
last completed transaction was at log time xxxxxxx

But those lines didn't appear after 5 hours. As I understand, these are
written before the server uses the restore_command to get WAL and
history files from the archive.

>
>> I am unsure if this promotion failure is a bug/glitch, but the promote
>> procedure is automated and tested a couple of hundred times so I am
>> certain we initiated the promote correctly.
>
> Are you using homemade scripts? Maybe you need to test them more
> thoroughly, with different environment parameters.

We use a custom script for the restore_command, but is seems that it was
not invoked.

Regards,
Mladen Marinović



Re: Slave promotion problem...

От
marin@kset.org
Дата:
On 2015-08-31 16:05, marin@kset.org wrote:
> On 2015-08-31 14:38, Martín Marqués wrote:
>> El 31/08/15 a las 03:29, marin@kset.org escribió:
>>> Last week we had some problems on the master server which caused a
>>> failover on the slave (the master was completely unresponsive due to
>>> reasons still unknown). The slave received the promote signal (pg_ctl
>>> promote) and logged that in the logs:
>>> 2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG:  received promote
>>> request
>>> 2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL:  terminating
>>> walreceiver process due to administrator command
>>>
>>> 5 hours later the slave still didn't promote. Meanwhile we fixed the
>>> master and restarted it. The slave was restarted and it behaved just
>>> like the promote signal didn't arrive, connecting to the master as a
>>> regular slave.
>>
>> Aren't there any further logs after the walreceiver termination?
>> Up to here everything looks fine, but we have no idea on what was
>> logged
>> afterwards.
> There are logs (quite a few, cca. 5 hours of it), every second
> something like this:
> 2015-08-28 23:05:12 UTC [79867]: [1-1] user=[unknown],db=[unknown]
> LOG:  connection received: host=[local]
> 2015-08-28 23:05:12 UTC [79867]: [2-1] user=postgres,db=postgres LOG:
> connection authorized: user=postgres database=postgres
> This logs the connection of the process that probes the server is
> alive.
>
> I was expecting to see something like:
> redo done at xxxxx
> last completed transaction was at log time xxxxxxx
>
> But those lines didn't appear after 5 hours. As I understand, these
> are written before the server uses the restore_command to get WAL and
> history files from the archive.

After careful testing of two scenarios:
1. Master is alive (or at least the slave thinks so), slave is signaled
to promote
2. Master is dead, slave is signaled to promote.
We have come to a conclusion that in the scenario 1. the slave requires
an additional WAL file from the archive before the "redo done at xxxxxx"
message. Knowing that we established that the problem was in the
restore_command script (wrapper around several rsync commands) which
hanged indefinitely.


>>
>>> I am unsure if this promotion failure is a bug/glitch, but the
>>> promote
>>> procedure is automated and tested a couple of hundred times so I am
>>> certain we initiated the promote correctly.
>>
>> Are you using homemade scripts? Maybe you need to test them more
>> thoroughly, with different environment parameters.
>
> We use a custom script for the restore_command, but is seems that it
> was not invoked.
>
> Regards,
> Mladen Marinović