[BUGS] BUG #14538: streaming replication same wal missing

Поиск

Список

Период

Сортировка

От	vodevsh@gmail.com
Тема	[BUGS] BUG #14538: streaming replication same wal missing
Дата	9 февраля 2017 г. 16:58:08
Msg-id	20170209135808.1405.57354@wrigleys.postgresql.org обсуждение
Ответы	Re: [BUGS] BUG #14538: streaming replication same wal missing
Список	pgsql-bugs

Дерево обсуждения

The following bug has been logged on the website:

Bug reference:      14538
Logged by:          Vladimir Svedov
Email address:      vodevsh@gmail.com
PostgreSQL version: 9.3.11
Operating system:   Linux
Description:

Hi,
I posted formatted text as a question
here:http://dba.stackexchange.com/questions/163735/streaming-replication-same-wal-missing

Several days ago I saw this in master log:

 replica 2017-02-06 16:10:55 UTC ERROR:  requested WAL segment
000000030000096D00000052 has already been removed
This message was repeating until I stopped slave.

Checking pg_stat_activity on prod showed autovacuum task on table ~250GB, I
decided that it produced huge amount of WALs. Checking them confirmed my
assumption - all wal_keep_segments=500 were last minute. So I thought that
network bandwidth was not sufficient to send WALs fast enough to slave to
replay them. Master has archive_command = cd ., so no way back.

I did not panic, I recreate replication. I:

delete all data files on slave
pg_start_backup() on prod
rsync -zahv data
pg_start slave
pg_stop_backup() on prod and check logs.
Imagine my desperation to see that slave has this:

-bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
LOG:  could not open usermap file "/pg/data93/pg_ident.conf": No such file
or directory
LOG:  database system was shut down in recovery at 2017-02-09 12:15:07 UTC
LOG:  entering standby mode
LOG:  redo starts at 982/39074640
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  consistent recovery state reached at 982/3AEC60F8
LOG:  unexpected pageaddr 97C/83EC8000 in log segment
00000003000009820000003A, offset 15499264
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 982/3A000000 on timeline 3
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
and so on... 3 days passed! That segment has gone, but why is it looking for
it??? I copied whole data directory.

All pages are readable (pg_dump works on all dbs), changes from master go to
slave. ERROR keeps appearing each several seconds on slave.

Please, tell me I did something stupid and it is not a bug. It happened on
prod.


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

[BUGS] BUG #14538: streaming replication same wal missing