figuring out a streaming replication failure

Поиск
Список
Период
Сортировка
От Scott Ribe
Тема figuring out a streaming replication failure
Дата
Msg-id 3BED8921-6122-47E2-B14A-7537364D1175@elevated-dev.com
обсуждение исходный текст
Список pgsql-general
The standby log:

->  2010-11-14 17:40:16 MST - 887 -LOG:  database system was shut down in recovery at 2010-11-14 17:40:10 MST
->  2010-11-14 17:40:16 MST - 887 -LOG:  entering standby mode
->  2010-11-14 17:40:16 MST - 887 -LOG:  consistent recovery state reached at 3/3988FF8
->  2010-11-14 17:40:16 MST - 887 -LOG:  redo starts at 3/3988F68
->  2010-11-14 17:40:16 MST - 887 -LOG:  invalid record length at 3/3988FF8
->  2010-11-14 17:40:16 MST - 885 -LOG:  database system is ready to accept read only connections
->  2010-11-14 17:40:16 MST - 890 -LOG:  streaming replication successfully connected to primary
->  2010-11-15 02:24:26 MST - 890 -FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment
000000010000000300000004has already been removed 

->  2010-11-15 02:24:26 MST - 887 -LOG:  unexpected pageaddr 2/B9BF2000 in log file 3, segment 4, offset 12525568
->  2010-11-15 02:24:27 MST - 2790 -LOG:  streaming replication successfully connected to primary
->  2010-11-15 02:24:27 MST - 2790 -FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment
000000010000000300000004has already been removed 

->  2010-11-15 02:24:32 MST - 2791 -LOG:  streaming replication successfully connected to primary
->  2010-11-15 02:24:32 MST - 2791 -FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment
000000010000000300000004has already been removed 

...

Now, the standby is geographically isolated from the master, so it's over an internet connection, so it's not a shock
thatwith a large enough update and wal_keep_segments not large enough, speed of disk would outrun speed of network
sufficientlyfor this to happen. 

But as far as I know there was almost no write activity at 2am, no active users at all, no batch processing. There is a
pg_dumpallthat kicks off at 2am and these errors start about the same time that it finished. I also did the original
synchand standby launch immediately after a mass update before autovacuum had a chance to run, so at some point there
wouldbe a lot of tuples marked dead. 

wal_keep_segments was at 64, the first segment still around was 000000010000000300000010, checkpoint_segments was 16.
Inthe midst of the pg_dumpall the master logs do show messages about checkpoint flushing too often. The 70ish log
segmentsstill around show mod times right around 2:23, progressing a second or so each, whereas they were created over
amuch longer period going back to the day before. 

1 question: what happened here? Why were log files created the day before updated?

1 suggestion: would it be possible to not delete wal segments that are needed by a currently attached standby?

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Trouble Accessing Schema-Qualified Table
Следующее
От: akp geek
Дата:
Сообщение: Re: index row requires 10040 bytes, maximum size is 8191