Re: Standby corruption after master is restarted

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Standby corruption after master is restarted
Дата
Msg-id ce06163c-58ed-5dda-ea5c-138c86b62132@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Ответы Re: Standby corruption after master is restarted  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Список pgsql-bugs
Hi Emre,

On 03/28/2018 07:50 PM, Emre Hasegeli wrote:
> We experienced this issue again, this time on production.  The primary
> instance was in a loop of being killed by Linux OOM-killer and being
> restarted in 1 minute intervals.  The corruption only happened on one
> of the two standbys.  The primary and the other standby have no
> problems.  Only the was killed and restarted, the standbys were not.
> There weren't any unusual settings, "fsync" was not disabled.  Here is
> the information I collected.
> 

I've been trying to reproduce this by running a master with a couple of
replicas, and randomly restarting the master (while pgbench is running).
But so far no luck, so I guess something else is required to reproduce
the behavior ...

> The logs at the time standby broke:
> 
>> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
>> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
>> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
>> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected
>> 2018-03-28 14:01:24 UTC [3693-68] FATAL:  invalid memory alloc request size 1916035072
> 
> And from the next try:
> 
>> 2018-03-28 14:02:15 UTC [26808-5] LOG:  consistent recovery state reached at 1DFB/D6BDDFF8
>> 2018-03-28 14:02:15 UTC [26808-6] FATAL:  invalid memory alloc request size 191603507
> 

In the initial report (from August 2018) you shared pg_xlogdump output,
showing that the corrupted WAL record is an FPI_FOR_HINT right after
CHECKPOINT_SHUTDOWN. Was it the same case this time?

BTW which versions are we talking about? I see the initial report
mentioned catversion 201608131, this one mentions 201510051, so I'm
guessing 9.6 and 9.5. Which minor versions?

Is the master under load (accepting writes) before shutdown?

How was it restarted, actually? I see you're mentioning OOM killer, so I
guess "kill -9". What about the first report - was it the same case, or
was it restarted "nicely" using pg_ctl?

Could the replica receive the WAL in some other way - say, from a WAL
archive? What archive/restore commands you use?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15155: table_to_xmlschema() ignores string restriction whengenerating XSD
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Standby corruption after master is restarted