Re: new heapcheck contrib module

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: new heapcheck contrib module
Дата
Msg-id CA+TgmoYTDcf5MJrSBCSB6iLnGzh4pE7nCBBVBYGP-7D0CwzuHw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: new heapcheck contrib module  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: new heapcheck contrib module  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On Wed, May 13, 2020 at 5:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Do you recall seeing corruption resulting in segfaults in production?

I have seen that, I believe. I think it's more common to fail with
errors about not being able to palloc>1GB, not being able to look up
an xid or mxid, etc. but I am pretty sure I've seen multiple cases
involving seg faults, too. Unfortunately for my credibility, I can't
remember the details right now.

> I personally don't recall seeing that. If it happened, the segfaults
> themselves probably wouldn't be the main concern.

I don't really agree. Hypothetically speaking, suppose you corrupt
your only copy of a critical table in such a way that every time you
select from it, the system seg faults. A user in this situation might
ask questions like:

1. How did my table get corrupted?
2. Why do I only have one copy of it?
3. How do I retrieve the non-corrupted portion of my data from that
table and get back up and running?

In the grand scheme of things, #1 and #2 are the most important
questions, but when something like this actually happens, #3 tends to
be the most urgent question, and it's a lot harder to get the
uncorrupted data out if the system keeps crashing.

Also, a seg fault tends to lead customers to think that the database
has a bug, rather than that the database is corrupted.

Slightly off-topic here, but I think our error reporting in this area
is pretty lame. I've learned over the years that when a customer
reports that they get a complaint about a too-large memory allocation
every time they access a table, they've probably got a corrupted
varlena header. However, that's extremely non-obvious to a typical
user. We should try to report errors indicative of corruption in a way
that gives the user some clue that corruption has happened. Peter made
a stab at improving things there by adding
errcode(ERRCODE_DATA_CORRUPTED) in a bunch of places, but a lot of
users will never see the error code, only the message, and a lot of
corruption produces still produces errors that weren't changed by that
commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Our naming of wait events is a disaster.
Следующее
От: Ranier Vilela
Дата:
Сообщение: Re: [PATCH] Fix ouside scope t_ctid (ItemPointerData)