[HACKERS] Crash on promotion when recovery.conf is renamed

Поиск

Список

Период

Сортировка

От	Magnus Hagander
Тема	[HACKERS] Crash on promotion when recovery.conf is renamed
Дата	15 декабря 2016 г. 14:44:15
Msg-id	CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com обсуждение исходный текст
Ответы	Re: [HACKERS] Crash on promotion when recovery.conf is renamed (Heikki Linnakangas <hlinnaka@iki.fi>)
Список	pgsql-hackers

Дерево обсуждения

I had a system where the recovery.conf file was renamed "out of the way" at some point, and then the system was promoted. This is obviously operator error, but it seems like something we should handle.

What happens now is that the non-existance of recovery.conf is a FATAL error. I wonder if it should just be a WARNING, at least in the case of ENOENT?

What happens is this.

Log output:

2016-12-15 09:36:46.265 CET [25437] LOG: received promote request

2016-12-15 09:36:46.265 CET [25438] FATAL: terminating walreceiver process due to administrator command

mha@mha-laptop:~/postgresql/inst/head$ 2016-12-15 09:36:46.265 CET [25437] LOG: invalid record length at 0/5015168: wanted 24, got 0

2016-12-15 09:36:46.265 CET [25437] LOG: redo done at 0/5015130

2016-12-15 09:36:46.265 CET [25437] LOG: last completed transaction was at log time 2016-12-15 09:36:19.27125+01

2016-12-15 09:36:46.276 CET [25437] LOG: selected new timeline ID: 2

2016-12-15 09:36:46.429 CET [25437] FATAL: could not open file "recovery.conf": No such file or directory

2016-12-15 09:36:46.429 CET [25436] LOG: startup process (PID 25437) exited with exit code 1

2016-12-15 09:36:46.429 CET [25436] LOG: terminating any other active server processes

2016-12-15 09:36:46.429 CET [25456] WARNING: terminating connection because of crash of another server process

2016-12-15 09:36:46.429 CET [25456] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2016-12-15 09:36:46.429 CET [25456] HINT: In a moment you should be able to reconnect to the database and repeat your command.

2016-12-15 09:36:46.431 CET [25436] LOG: database system is shut down

So we can see it switches to timeline 2. Looking in pg_wal (or pg_xlog -- customer system was on 9.5, but this is reproducible in HEAD):

-rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000004

-rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000005

-rw------- 1 mha mha 16777216 Dec 15 09:36 000000020000000000000005

-rw------- 1 mha mha 41 Dec 15 09:36 00000002.history

However, according to pg_controldata, we are still on timeline 1:

Latest checkpoint location: 0/4000060

Prior checkpoint location: 0/4000060

Latest checkpoint's REDO location: 0/4000028

Latest checkpoint's REDO WAL file: 000000010000000000000004

Latest checkpoint's TimeLineID: 1

Latest checkpoint's PrevTimeLineID: 1

Minimum recovery ending location: 0/5015168

Min recovery ending loc's timeline: 1

But since we have a history file for timeline 2 in the data directory (and neatly archived), this data directory isn't consistent with that. Meaning that for example any other standbys that you try to connect to this cluster will simply fail, because they try to join up on timeline 2 which doesn't actually exist.

I wonder if there might be more corner cases like this, but in this particular one it seems easy enough to just say that failing to rename recovery.conf because it didn't exist is safe.

But in the case of failing to rename recovery.conf for example because of permissions errors, we don't want to ignore it. But we also really don't want to end up with this kind of inconsistent data directory IMO. I don't know that code well enough to suggest how to fix it though -- hoping for input for someone who knows it closer?

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Pavel Stehule
Дата: 15 декабря 2016 г., 14:31:43
Сообщение: [HACKERS] new set of psql patches for loading (saving) data from (to) text,binary files

Следующее

От: Magnus Hagander
Дата: 15 декабря 2016 г., 15:04:37
Сообщение: [HACKERS] pg_basebackups and slots

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

[HACKERS] Crash on promotion when recovery.conf is renamed

Предыдущее

Следующее