Re: Completely broken replica after PANIC: WAL contains references to invalid pages
От | Andres Freund |
---|---|
Тема | Re: Completely broken replica after PANIC: WAL contains references to invalid pages |
Дата | |
Msg-id | 20130330172144.GI28736@alap2.anarazel.de обсуждение исходный текст |
Ответ на | Re: Completely broken replica after PANIC: WAL contains references to invalid pages (Sergey Konoplev <gray.ru@gmail.com>) |
Ответы |
Re: Completely broken replica after PANIC: WAL contains
references to invalid pages
(Simon Riggs <simon@2ndQuadrant.com>)
|
Список | pgsql-bugs |
On 2013-03-29 14:53:26 -0700, Sergey Konoplev wrote: > On Fri, Mar 29, 2013 at 2:38 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > I have to admit, I find it a bit confusing that so many people report a > > bug and then immediately destroy all evidence of the bug. Just seems to > > a happen a bit too frequently. > > You see, businesses usually need it up ASAP again. Sorry, I must have > note down the output of pg_controldata straight after it got broken, I > just have not came up to it. But the business will also need the standby working correctly in case of a critical incident of a primary. So it should have quite an interest in fixing bugs in that area. Yes, I realize thats not always easy to do :(. > > Thats not a pg_controldata output from the broken replica though, or is > > it? I guess its from a new standby? > > That was the output from the standby that was rsync-ed on top of the > broken one. I thought you might find something useful in it. > Can I test your guess some other way? And what was the guess? Don't think you can easily test it. And after reading more code I am pretty sure my original guess was bogus. As was my second. And third ;) But I think I see what could be going on: During HS we maintain pg_subtrans so we can deal with more than PGPROC_MAX_CACHED_SUBXIDS in one TX. For that we need to regularly extend subtrans so the pages are initialized when we setup the topxid<->subxid mapping in ProcArrayApplyXidAssignment(). The call to ExtendSUBTRANS happens in RecordKnownAssignedTransactionIds() which is called from several places, including ProcArrayApplyXidAssignment(). The logic it uses is: if (TransactionIdFollows(xid, latestObservedXid)) { TransactionId next_expected_xid; /* * Extend clog and subtrans like we do in GetNewTransactionId() during * normal operation using individual extend steps. Typical case * requires almost no activity. */ next_expected_xid = latestObservedXid; TransactionIdAdvance(next_expected_xid); while (TransactionIdPrecedesOrEquals(next_expected_xid, xid)) { ExtendCLOG(next_expected_xid); ExtendSUBTRANS(next_expected_xid); TransactionIdAdvance(next_expected_xid); } So if the xid is later than latestObservedXid we extend subtrans one by one. So far so good. But we initialize it in ProcArrayApplyRecoveryInfo() when consistency is initially reached: latestObservedXid = running->nextXid; TransactionIdRetreat(latestObservedXid); Before that subtrans has initially been started up with: if (wasShutdown) oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids); else oldestActiveXID = checkPoint.oldestActiveXid; ... StartupSUBTRANS(oldestActiveXID); That means its only initialized up to checkPoint.oldestActiveXid. As it can take some time till we reach consistency it seems rather plausible that there now will be a gap in initilized pages. From checkPoint.oldestActiveXid to running->nextXid if there are pages inbetween. Does that explanation sound about right to anybody else? I'll provide a patch for the issue in a while, for now I'll try to reproduce it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Sergey KonoplevДата:
Сообщение: Re: Completely broken replica after PANIC: WAL contains references to invalid pages