Re: SR fails to send existing WAL file after off-line copy

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: SR fails to send existing WAL file after off-line copy
Дата
Msg-id AANLkTin1=k=OYrCfeMqrJuXa_+0312SWuoEqaF1adiDp@mail.gmail.com
обсуждение исходный текст
Ответ на SR fails to send existing WAL file after off-line copy  (Greg Smith <greg@2ndquadrant.com>)
Ответы Re: SR fails to send existing WAL file after off-line copy  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Список pgsql-hackers
On Sun, Oct 31, 2010 at 5:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Which is confusing because that file is certainly on the master still, and
> hasn't even been considered archived yet much less removed:
>
> [master@pyramid pg_log]$ ls -l $PGDATA/pg_xlog
> -rw------- 1 master master 16777216 Oct 31 16:29 000000010000000000000000
> drwx------ 2 master master     4096 Oct  4 12:28 archive_status
> [master@pyramid pg_log]$ ls -l $PGDATA/pg_xlog/archive_status/
> total 0
>
> So why isn't SR handing that data over?  Is there some weird unhandled
> corner case this exposes, but that wasn't encountered by the systems the
> tutorial was tried out on?  I'm not familiar enough with the SR internals to
> reason out what's going wrong myself yet.  Wanted to validate that Matt's
> report wasn't a unique one though, with a bit more detail included about the
> state the system gets into, and one potential fix (increasing
> wal_keep_segments) already tried without improvement.

There seem to be two cases in the code that can generate that error.
One, attempting to open the file returns ENOENT.  Two, after the data
has been read, the last-removed position returned by
XLogGetLastRemoved precedes the data we think we just read, implying
that it was overwritten while we were in the process of reading it.
Does your installation have debugging symbols?  Can you figure out
which case is triggering (inside XLogRead) and what the values of log,
seg, lastRemovedLog, and lastRemovedSeg are?

Even if you lack debugging symbols, if you have gdb, you might be able
figure out which case is triggering by looking at whether
XLogGetLastRemoved gets called before the error message is printed
(put a breakpoint on that function).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Itagaki Takahiro
Дата:
Сообщение: Comparison with "true" in source code
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Maximum function call nesting depth for regression tests