Bugs in TOAST handling, OID assignment and redo recovery

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема Bugs in TOAST handling, OID assignment and redo recovery
Дата
Msg-id CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
обсуждение исходный текст
Ответы Re: Bugs in TOAST handling, OID assignment and redo recovery  (Bernd Helmle <mailings@oopsware.de>)
Re: Bugs in TOAST handling, OID assignment and redo recovery  (Heikki Linnakangas <hlinnaka@iki.fi>)
Re: Bugs in TOAST handling, OID assignment and redo recovery  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список pgsql-hackers
Hello,

One of our 2ndQuadrant support customers recently reported a sudden rush of TOAST errors post a crash recovery, nearly causing an outage. Most errors read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some workarounds, I investigated this in more detail and identified two long standing bugs in TOAST as well as redo recovery.  

Immediately after the crash recovery, server started issuing OIDs that conflicted with the OIDs issued just prior to the crash. The XLOG_NEXTOID and buffer LSN interlock is supposed to guard against that, but a bug in the redo recovery breaks that protection. We shall see that in a minute.

When duplicate OIDs are issued, either because of a bug in redo recovery or because of OID counter wraparound, GetNewOidWithIndex() is supposed to protect us against that by first scanning the toast table to check if a duplicate chunk_id or toast value already exists. If so, a new OID is requested and we repeat until a non-conflicting OID is found. 

This mostly works, except in one corner case. GetNewOidWithIndex() uses SnapshotDirty to look for conflicting OID. But this fails if the existing toast tuple was recently deleted by another committed transaction. If the OldestXmin has not advanced enough, such tuples are reported as RECENTLY_DEAD  and fail the SnapshotDirty qualifications. So toast_save_datum() happily goes ahead and inserts a conflicting toast tuple. The UNIQUE index on <chunk_id, chunk_seq> does not complain because it seems the other tuple as RECENTLY_DEAD too. At this point, we have two toast tuples with the same <chunk_id, chunk_seq> - one of these is RECENTLY_DEAD and the other is LIVE. Later when toast_fetch_datum() and friends scan the toast table for the chunk_id (toast value), it uses SnapshotToast for retrieving the tuple. This special snapshot doesn't do any visibility checks on the toast tuple, assuming that since the main heap tuple is visible, the toast tuple(s) should be visible too. This snapshot unfortunately also sees the  RECENTLY_DEAD toast tuple and thus returns duplicate toast tuples, leading to the errors such as above. This same issue can also lead to other kinds of errors in tuptoaster.c.

ISTM that the right fix is to teach toast_save_datum() to check for existence of duplicate chunk_id by scanning the table with the same SnapshotToast that it later uses to fetch the tuples. We already do that in case of toast rewrite, but not for regular inserts. I propose to do that for regular path too, as done in the attached patch.

Duplicate OIDs may be generated when the counter wraps around and it may cause this problem.

- OID counter wraps around
- Old toast tuple is deleted and the deleting transaction commits
- OldestXmin is held back by an open transaction
- The same OID if then inserted again into the toast table
- The toasted value is fetched before the table is vacuumed or the index pointer is killed. This leads to duplicates being returned.

This is probably quite a corner case, but not impossible given that we already have reports for such problems. But the problem gets accentuated in case of a crash recovery or on a Hot standby. This is because of another somewhat unrelated bug which manifested in our case.

An interesting case to consider is the following sequence of events:

1. ONLINE checkpoint starts. We note down the nextOid counter, after rightly accounting for the cached values too. CreateCheckPoint() does this:

    LWLockAcquire(OidGenLock, LW_SHARED);
    checkPoint.nextOid = ShmemVariableCache->nextOid;
    if (!shutdown)
        checkPoint.nextOid += ShmemVariableCache->oidCount;
    LWLockRelease(OidGenLock);
 
2. The cached OIDs are then consumed, and a new set of OIDs are allocated, generating XLOG_NEXTOID WAL record.

3. More OIDs are consumed from the newly allocated range. Those OIDs make to the disk after following buffer LSN interlock. So XLOG_NEXTOID gets flushed to disk as well.

4. Checkpoint finishes, it writes a ONLINE checkpoint record with the nextOid value saved in step 1

5. CRASH

The redo recovery starts at some LSN written before step 1. It initialises the ShmemVariableCache->nextOid with the value stored in the checkpoint record and starts replaying WAL records. When it sees the XLOG_NEXTOID WAL record generated at step 2, it correctly advances the OID counter to cover the next range. 

But while applying ONLINE checkpoint written at step 4, we do this:

        /* ... but still treat OID counter as exact */
        LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
        ShmemVariableCache->nextOid = checkPoint.nextOid;
        ShmemVariableCache->oidCount = 0;
        LWLockRelease(OidGenLock);

So effectively we overwrite the nextOid counter and drag it backwards again. If we don't see another XLOG_NEXTOID record before the crash recovery ends, then we shall start the server with a stale value, generating duplicate OIDs immediately after the restart.

I don't think this is an unusual case and I'm surprised that we didn't notice this before. Most likely the other protections in the system to guard against wrapped around OIDs masked the bug. The TOAST table though escapes those protections in the specific case we discussed above and brought out the bug in open.

I think the right fix is to simply ignore the nextOid counter while replaying ONLINE checkpoint record. We must have already initialised ShmemVariableCache->nextOid to the value stored in this (or some previous) checkpoint record when redo recovery is started. As and when we see XLOG_NEXTOID record, we would maintain the shared memory counter correctly. If we don't see any XLOG_NEXTOID record, the value we started with must be the correct value. I see no problem even when OID wraps around during redo recovery. XLOG_NEXTOID should record that correctly.

Curiously neither REINDEX nor VACUUM would fix it until the OldestXmin counter advances enough so that RECENTLY_DEAD tuples are ignored by the REINDEX or removed by VACUUM. But after OldestXmin progresses, a REINDEX or a VACUUM should fix these transient errors. Also once manifested, the errors would stay until a REINDEX or VACUUM is performed.

We tested this at least upto 9.1 and the bugs exist there too. In fact, these bugs probably existed forever, though I did not check very old releases.

Attached is a simple reproducer and a proposed fix to address both the bugs. We should consider backpatching it all supported releases.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Finnerty
Дата:
Сообщение: Re: Custom PGC_POSTMASTER GUC variables ... feasible?
Следующее
От: Sergei Kornilov
Дата:
Сообщение: Re: Custom PGC_POSTMASTER GUC variables ... feasible?