Re: Logging corruption error codes

Поиск
Список
Период
Сортировка
От Andrey Borodin
Тема Re: Logging corruption error codes
Дата
Msg-id FB0BEAE7-F856-44D6-9130-C8EFD964D1D0@yandex-team.ru
обсуждение исходный текст
Ответ на Re: Logging corruption error codes  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Ответы Re: Logging corruption error codes  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs

> 22 июля 2019 г., в 16:16, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
>
> On 2019-06-20 11:57, Andrey Borodin wrote:
>> We are fine-tuning our data corruption monitoring and found out that many corruption cases do not report proper
errorcode. 
>> This makes automatic log analyzer way too smart program.
>> We think that corruption error codes should be given in cases when B-tree or TOAST do not know how to interpret
data.
>> PFA patch with cases that we have found in logs and consider evidence of corruption.
>>
>> Best regards, Andrey Borodin.
>
> Should we use errmsg_internal() in the adjusted calls, so that the error
> messages are not picked up for translation?  I could go either way, but
> it's something that should be considered.

Thanks for looking into this.

From my POV these messages provide meaningful information to cope with corruption. But they are definitely internal.
Translations already provide some information on toast chunks, mentions btree many times times and many other internal
things.
So, I'm confused about status of these messages.
Such messages should be rare enough and those to whom they are addressed should be familiar with English.

We've encountered few more cases of messages, that potentially follow data corruption. In our test environment, we were
experimentingwith custom Linux kernel that had page cache bug. The bug manifested itself in reappearing stale page
versions.This causes various data corruptions, always undetected by data checksums (do we want Merkle tree?). 

Besides messages in this patch we also had:
could not read block 1751 in file "base/16452/358336": Bad address  // Probably mostly not only data corruption, but
hardwarefault 
t_xmin is uncommitted in tuple to be updated // Probably on-disk corruption
failed to re-find parent key in index // Probably index corruption
left link changed unexpectedly in block // Probably on-disk data corruption
right sibling 45056 of block * is not next child * of block * in index // Definitely index corruption

Should I add corruption codes for these messages in the patch? Or make a separate discussion about these?

Thanks!

Best regards, Andrey Borodin.


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: REINDEX CONCURRENTLY causes ALTER TABLE to fail
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15924: Query Execution and variable declaration