Re: emergency outage requiring database restart

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема Re: emergency outage requiring database restart
Дата
Msg-id CAHyXU0zCezq3Zq63GEvDYebW6j8tXoKM4mk54d3jSrQDzyDMNA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: emergency outage requiring database restart  (Merlin Moncure <mmoncure@gmail.com>)
Ответы Re: emergency outage requiring database restart  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-hackers
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Merlin Moncure wrote:
>>
>>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>>> castaging-#   SELECT ...
>>> ERROR:  42809: "pg_cast_oid_index" is an index
>>> LINE 11:   FROM ApartmentSample s
>>>                 ^
>>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>>
>>> should I be restoring from backups?
>>
>> It's pretty clear to me that you've got catalog corruption here.  You
>> can try to fix things manually as they emerge, but that sounds like a
>> fool's errand.
>
> Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> to be to the system catalogs with at least two critical tables dropped
> or inaccessible in some fashion.  A lot of the OIDs seemed to be
> pointing at the wrong thing.  Couple more datapoints here.
>
> *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> *) Another database on the same cluster was not impacted.  However
> it's more olap style and may not have been written to during the
> outage
>
> Now, this infrastructure running this system is running maybe 100ish
> postgres clusters and maybe 1000ish sql server instances with
> approximately zero unexplained data corruption issues in the 5 years
> I've been here.  Having said that, this definitely smells and feels
> like something on the infrastructure side.  I'll follow up if I have
> any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

merlin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavan Deolasee
Дата:
Сообщение: Re: Indirect indexes
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Indirect indexes