Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Дата
Msg-id CAMkU=1wuVt42q-coO40smhFGYQj2pnFycY+jNcdS99SsoLSrZg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)  (Andres Freund <andres@anarazel.de>)
Ответы Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Tue, May 10, 2016 at 2:00 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-10 09:19:16 -0700, Andres Freund wrote:
>> On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
>> > On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
>> > > The easy way to trigger this problem would be to have an oid wraparound
>> > > - but the WAL shows that that's not the case here.  I've not figured
>> > > that one out entirely (and won't tonight). But I do see WAL records
>> > > like:
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
>> > > i.e. two NEXTOID records allocating the same range, which obviously
>> > > doesn't seem right.  There's also every now and then close by ranges:
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID
3311455
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:    7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID
3311461
>
>> > It seems to me that the real question
>> > here is how you're getting two calls to XLogPutNextOid() with the same
>> > value of ShmemVariableCache->nextOid, and the answer, as it seems to
>> > me, must be that LWLocks are broken.
>>
>> There likely were a bunch of crashes in between, Jeff's test suite
>> triggers them at a high rate. It seems a lot more likely than that an
>> lwlock bug only materializes in the oid counter.  Investigating.
>
> void
> CreateCheckPoint(int flags)
> {
> ...
>     /*
>      * An end-of-recovery checkpoint is really a shutdown checkpoint, just
>      * issued at a different time.
>      */
>     if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
>         shutdown = true;
>     else
>         shutdown = false;
> ...
>
>     LWLockAcquire(OidGenLock, LW_SHARED);
>     checkPoint.nextOid = ShmemVariableCache->nextOid;
>     if (!shutdown)
>         checkPoint.nextOid += ShmemVariableCache->oidCount;
>     LWLockRelease(OidGenLock);
> ...
>     recptr = XLogInsert(RM_XLOG_ID,
>                         shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
>                         XLOG_CHECKPOINT_ONLINE);
> ...
> }
>
> I think that's to blame here.  Looking at the relevant WAL record shows:
>
> pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 'CHECKPOINT|NEXTOID'
> rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/12023C38, prev 2/12023C00, desc:
CHECKPOINT_ONLINEredo 2/12000120; /* ... */ oid 4294501; /* ... */ online
 

By my understanding, this is the point at which the crash occurred.

> rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/1327A798, prev 2/1327A768, desc:
CHECKPOINT_SHUTDOWNredo 2/1327A798; /* ... */ oid 4294501; /* ... */ shutdown
 
> rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
>
> (note that end-of-recovery checkpoints are logged as shutdown
> checkpoints, pretty annoying imo)
>
> So I think the issue is that the 2/12023C38 checkpoint *starts* before
> the first NEXTOID, and thus gets an earlier nextoid.


But isn't CreateCheckPoint called at the end of the checkpoint, not
the start of it?

I don't understand how it could be out-of-date at that point.  But
obviously it is.


Cheers,

Jeff



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Vladimir Gordiychuk
Дата:
Сообщение: Re: Stopping logical replication protocol
Следующее
От: Tom Lane
Дата:
Сообщение: Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)