Re: New WAL code dumps core trivially on replay of bad data

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: New WAL code dumps core trivially on replay of bad data
Дата
Msg-id 5031DBA7.6010309@enterprisedb.com
обсуждение исходный текст
Ответ на Re: New WAL code dumps core trivially on replay of bad data  (Amit kapila <amit.kapila@huawei.com>)
Ответы Re: New WAL code dumps core trivially on replay of bad data
Список pgsql-hackers
On 18.08.2012 08:52, Amit kapila wrote:
> Tom Lane Sent: Saturday, August 18, 2012 7:16 AM
>
>> so it merrily tries to compute a checksum on a gigabyte worth of data,
>> and soon falls off the end of memory.
>
>> In reality, inspection of the WAL file suggests that this is the end of
>> valid data and what should have happened is that replay just stopped.
>> The xl_len and so forth shown above are just garbage from off the end of
>> what was actually read from the file (everything beyond offset 0xcebff8
>> in file 4 is in fact zeroes).
>
>> I'm not sure whether this is just a matter of having failed to
>> sanity-check that xl_tot_len is at least SizeOfXLogRecord, or whether
>> there is a deeper problem with the new design of continuation records
>> that makes it impossible to validate records safely.
>
> Earlier there was a check related to total length in ReadRecord, before it calls RecordIsValid()
>       if (record->xl_tot_len<  SizeOfXLogRecord + record->xl_len ||
>                 record->xl_tot_len>  SizeOfXLogRecord + record->xl_len +
>                           XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
>
> I think that missing check of total length has caused this problem. However now this check will be different.

That check still exists, in ValidXLogRecordHeader(). However, we now 
allocate the buffer for the whole record before that check, based on 
xl_tot_len, if the record header is split across pages. The theory in 
allocating the buffer is that a bogus xl_tot_len field will cause the 
malloc() to fail, returning NULL, and we treat that the same as a broken 
header. However, with memory overcommit, what happens is that the 
malloc() succeeds, but the process is killed when it actually tries to 
use all that memory.

I think we need to delay the allocation of the record buffer. We need to 
read and validate the whole record header first, like we did before, 
before we trust xl_tot_len enough to call malloc() with it. I'll take a 
shot at doing that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: [PATCH] Docs: Make notes on sequences and rollback more obvious
Следующее
От: Rushabh Lathia
Дата:
Сообщение: Primary Key Constraint on inheritance table not getting route to child tables