Re: ERROR: unexpected chunk number 0 (expected 1) for toast value76753264 in pg_toast_10920100

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема Re: ERROR: unexpected chunk number 0 (expected 1) for toast value76753264 in pg_toast_10920100
Дата
Msg-id CABOikdPJUv5EyxRZnJY2oA7EKi=bZdoryXNmUGK3dFAQ08tZdw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: ERROR: unexpected chunk number 0 (expected 1) for toast value76753264 in pg_toast_10920100  (Pavan Deolasee <pavan.deolasee@gmail.com>)
Список pgsql-general


On Fri, Apr 6, 2018 at 8:55 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:


On Fri, Apr 6, 2018 at 2:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
adsj@novozymes.com (Adam Sjøgren) writes:
>> [... still waiting for the result, I will return with what it said
>> when the server does ...]

> It did eventually finish, with the same result:

Huh.  So what we have here, apparently, is that regular MVCC snapshots
think there is exactly one copy of the 1698936148/0 row, but TOAST fetches
think there is more than one.  This is darn odd, not least because we
never do UPDATEs in toast tables, only inserts and deletes, so there
certainly shouldn't be update chains there.

It seems like you've got some corner case wherein SnapshotToast sees a row
that isn't visible according to MVCC --- probably a row left over from
some previous cycle of life.  That is, I'm imagining the OID counter
wrapped around and we've reused a toast OID, but for some reason there's
still a row in the table with that OID.  I'm not sure offhand how we could
get into such a state.  Alvaro, does this ring any bells (remembering that
this is 9.3)?

FWIW one of our support customers reported a very similar TOAST table corruption issue last week which nearly caused an outage. After a lot of analysis, I think I've now fully understood the reasons behind the corruption, the underlying bug(s) and possible remedy. I am currently working on writing a reproducible test case to demonstrate the problem and writing the fix. More details on that soon.


I've posted a reproducer and a proposed fix to -hackers [1]

In the particular case that I investigated, a database crash/recovery was involved. But I think we should be able to create a scenario where OID wrap-around or a standby promotion triggers the problem.  I don't know if any of that was involved in the cases reported on this thread, but I've a strong suspicion that the underlying bug is probably the same.

Thanks,
Pavan


--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-general по дате отправления:

Предыдущее
От: hmidi slim
Дата:
Сообщение: Using enum instead of join tables
Следующее
От: Alexandre Arruda
Дата:
Сообщение: Re: ERROR: found multixact from before relminmxid