Re: Add some more corruption error codes to relcache

Поиск

Список

Период

Сортировка

От	Kirk Wolak
Тема	Re: Add some more corruption error codes to relcache
Дата	27 июня 2023 г. 06:32:52
Msg-id	CACLU5mQtNWiK9E-paTM=7BSZKuEBpsK17JfVwYigS2kz3P5fJA@mail.gmail.com обсуждение исходный текст
Ответ на	Add some more corruption error codes to relcache ("Andrey M. Borodin" <x4mmm@yandex-team.ru>)
Список	pgsql-hackers

Дерево обсуждения

On Fri, Jun 16, 2023 at 9:18 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers,

Relcache errors from time to time detect catalog corruptions. For example, recently I observed following:
1. Filesystem or nvme disk zeroed out leading 160Kb of catalog index. This type of corruption passes through data_checksums.
2. RelationBuildTupleDesc() was failing with "catalog is missing 1 attribute(s) for relid 2662".
3. We monitor corruption error codes and alert on-call DBAs when see one, but the message is not marked as XX001 or XX002. It's XX000 which happens from time to time due to less critical reasons than data corruption.
4. High-availability automation switched primary to other host and other monitoring checks did not ring too.

This particular case is not very illustrative. In fact we had index corruption that looked like catalog corruption.
But still it looks to me that catalog inconsistencies (like relnatts != number of pg_attribute rows) could be marked with ERRCODE_DATA_CORRUPTED.
This particular error code in my experience proved to be a good indicator for early corruption detection.

What do you think?
What other subsystems can be improved in the same manner?

Best regards, Andrey Borodin.

Andrey, I think this is a good idea. But your #1 item sounds familiar. There was a thread about someone creating/dropping lots of databases, who found some kind of race condition that would ZERO out pg_ catalog entries, just like you are mentioning. I think he found the problem with that relations could not be found and/or the DB did not want to start. I just spent 30 minutes looking for it, but my "search-fu" is apparently failing.

Which leads me to ask if there is a way to detect the corrupting write (writing all zeroes to the file when we know better? A Zeroed out header when one cannot exist?) Hoping this triggers a bright idea on your end...

Kirk...

В списке pgsql-hackers по дате отправления:

Предыдущее

От: jian he
Дата: 27 июня 2023 г., 05:35:53
Сообщение: Re: Do we want a hashset type?

Следующее

От: Thomas Munro
Дата: 27 июня 2023 г., 06:33:57
Сообщение: Re: ReadRecentBuffer() doesn't scale well

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Add some more corruption error codes to relcache

Предыдущее

Следующее