[bug fix] Cascaded standby cannot start after a clean shutdown

Поиск
Список
Период
Сортировка
От Tsunakawa, Takayuki
Тема [bug fix] Cascaded standby cannot start after a clean shutdown
Дата
Msg-id 0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
обсуждение исходный текст
Ответы Re: [bug fix] Cascaded standby cannot start after a clean shutdown
Список pgsql-hackers
Hello,

Our customer encountered a rare bug of PostgreSQL which prevents a cascaded standby from starting up.  The attached
patchis a fix for it.  I hope this will be back-patched.  I'll add this to the next CF.
 


PROBLEM
==============================

The PostgreSQL version is 9.5.  The cluster consists of a master, a cascading standby (SB1), and a cascaded standby
(SB2). The WAL flows like this: master -> SB1 -> SB2.
 

The user shut down SB2 and tried to restart it, but failed with the following message:

    FATAL:  invalid memory alloc request size 3075129344

The master and SB1 continued to run.


CAUSE
==============================

total_len in the code below was about 3GB, so palloc() rejected the memory allocation request.

[xlogreader.c]
    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
    total_len = record->xl_tot_len;
...
    /*
     * Enlarge readRecordBuf as needed.
     */
    if (total_len > state->readRecordBufSize &&
        !allocate_recordbuf(state, total_len))
    {

Why was XLogRecord->xl_tot_len so big?  That's because the XLOG reader read the garbage portion in a WAL file, which is
immediatelyafter the end of valid WAL records.
 

Why was there the garbage?  Because the cascading standby sends just the valid WAL records, not the whole WAL blocks,
tothe cascaded standby.  When the cascaded standby receives those WAL records and write them in a recycled WAL file,
thelast WAL block in the file contains the mix of valid WAL records and the garbage left over.
 

Why did the XLOG reader try to allocate memory for a garbage?  Doesn't it stop reading the WAL?  As the following code
shows,when the WAL record header crosses a WAL block boundary, the XLOG reader first allocates memory for the whole
record,and then checks the validity of the record header as soon as it reads the entire header.
 

[xlogreader.c]
    /*
     * If the whole record header is on this page, validate it immediately.
     * Otherwise do just a basic sanity check on xl_tot_len, and validate the
     * rest of the header after reading it from the next page.  The xl_tot_len
     * check is necessary here to ensure that we enter the "Need to reassemble
     * record" code path below; otherwise we might fail to apply
     * ValidXLogRecordHeader at all.
     */
    if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
    {
        if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
                                   randAccess))
            goto err;
        gotheader = true;
    }
    else
    {
        /* XXX: more validation should be done here */
        if (total_len < SizeOfXLogRecord)
        {
            report_invalid_record(state,
                                  "invalid record length at %X/%X: wanted %u, got %u",
                                  (uint32) (RecPtr >> 32), (uint32) RecPtr,
                                  (uint32) SizeOfXLogRecord, total_len);
            goto err;
        }
        gotheader = false;
    }


FIX
==============================

One idea is to defer the memory allocation until the entire record header is read and ValidXLogRecordHeader() is
called. However, ValidXLogRecordHeader() might misjudge the validity if the garbage contains xl_prev seeming correct.
 

So, I modified walreceiver to zero-fill the last partial WAL block.  This is the guard that the master does in
AdvanceXLInsertBuffer()as below:
 

        /*
         * Be sure to re-zero the buffer so that bytes beyond what we've
         * written will look like zeroes and not valid XLOG records...
         */
        MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

FYI, the following unsolved problem may be solved, too.

https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com


Regards
Takayuki Tsunakawa


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: TODO item: WAL replay of CREATE TABLESPACE with differingdirectory structure
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: CALL stmt, ERROR: unrecognized node type: 113 bug