Re: [HACKERS] emergency outage requiring database restart
От | Merlin Moncure |
---|---|
Тема | Re: [HACKERS] emergency outage requiring database restart |
Дата | |
Msg-id | CAHyXU0zk7KpHARJ+ErqnxD+6-kBnnyYb8dnUEpHESwKjGWvd=Q@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] emergency outage requiring database restart (Ants Aasma <ants.aasma@eesti.ee>) |
Список | pgsql-hackers |
On Thu, Aug 10, 2017 at 12:01 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: >>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>>> Still getting checksum failures. Over the last 30 days, I see the >>>> following. Since enabling checksums FWICT none of the damage is >>>> permanent and rolls back with the transaction. So creepy! >>> >>> The checksums still only differ in least significant digits which >>> pretty much means that there is a block number mismatch. So if you >>> rule out filesystem not doing its job correctly and transposing >>> blocks, it could be something else that is resulting in blocks getting >>> read from a location that happens to differ by a small multiple of >>> page size. Maybe somebody is racily mucking with table fd's between >>> seeking and reading. That would explain the issue disappearing after a >>> retry. >>> >>> Maybe you can arrange for the RelFileNode and block number to be >>> logged for the checksum failures and check what the actual checksums >>> are in data files surrounding the failed page. If the requested block >>> number contains something completely else, but the page that follows >>> contains the expected checksum value, then it would support this >>> theory. >> >> will do. Main challenge is getting hand compiled server to swap in >> so that libdir continues to work. Getting access to the server is >> difficult as is getting a maintenance window. I'll post back ASAP. > > As a new datapoint, we just had a customer with an issue that I think > might be related. The issue was reasonably repeatable by running a > report on the standby system. Issue manifested itself by first "could > not open relation" and/or "column is not in index" errors, followed a > few minutes later by a PANIC from startup process due to "specified > item offset is too large", "invalid max offset number" or "page X of > relation base/16384/1259 is uninitialized". I took a look at the xlog > dump and it was completely fine. For instance in the "specified item > offset is too large" case there was a INSERT_LEAF redo record > inserting the preceding offset just a couple hundred kilobytes back. > Restarting the server sometimes successfully applied the offending > WAL, sometimes it failed with other corruption errors. The offending > relations were always pg_class or pg_class_oid_index. Replacing plsh > functions with dummy plpgsql functions made the problem go away, > reintroducing plsh functions made it reappear. Fantastic. I was never able to attempt to apply O_CLOEXEC patch (see upthread) due to the fact that access to the system is highly limited and compiling a replacement binary was a bit of a headache. IIRC this was the best theory on the table as to the underlying cause and we ought to to try that first, right? Reminder; I was able to completely eliminate all damage (but had to handle occasional unexpected rollback) via enabling checksums. merlin
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Robert HaasДата:
Сообщение: Re: [HACKERS] Proposal: Local indexes for partitioned table