Re: bug, bad memory, or bad disk?

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема Re: bug, bad memory, or bad disk?
Дата
Msg-id CAHyXU0zjiWxZBVm_4ANod08WUXuX9qFO7=+tp2TCDsdQzrGQzA@mail.gmail.com
обсуждение исходный текст
Ответ на bug, bad memory, or bad disk?  (Ben Chobot <bench@silentmedia.com>)
Ответы Re: bug, bad memory, or bad disk?  (Amit Kapila <amit.kapila@huawei.com>)
Список pgsql-general
On Fri, Feb 15, 2013 at 8:08 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Friday, February 15, 2013 1:33 AM Ben Chobot wrote:
>
>> 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]
>  ERROR:  invalid memory alloc request size
>> 1968078400
>> 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]
>  ERROR:  invalid page header in block 2948 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641
>> 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]
>  ERROR:  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]  ERROR:
>  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]  ERROR:
>  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]
>  ERROR:  invalid memory alloc request size
>> 1968078400
>> 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]
>  ERROR:  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]
>  ERROR:  invalid page header in block 38887 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627
>> 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]
>  ERROR:  invalid page header in block 2368 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945
>
>> There didn't seem to be much correlation to which files were affected, and
> this was a critical server, so once we
>> realized a simple reindex wasn't going to solve things, we shut it down
> and brought up a slave as the new master db.
>
>> While that seemed to fix these issues, we soon noticed problems with
> missing clog files. The missing clogs were outside > the range of the
> existing clogs, so we tried using dummy clog files. It didn't help, and
> running pg_check we found that > one block of one table was definitely
> corrupt. Worse, that corruption had spread to all our replicas.
>
> Can you check that corrupted block is from one of the relations mentioned in
> your errors. This is just to reconfirm.
>
>> I know this is a little sparse on details, but my questions are:
>
>> 1. What kind of fault should I be looking to fix? Because it spread to all
> the replicas, both those that stream and
>> those that replicate by replaying wals in the wal archive, I assume it's
> not a storage issue. (My understanding is that > streaming replicas stream
> their changes from memory, not from wals.)
>
>   Streaming replication stream their changes from wals.

Yeah.  This smells like disk corruption to me, but it really could be
anything.  Unfortunately it can spread to the replicas especially if
you're not timely about taking the master down.  page checksums (a
proposed feature) are a way of dealing with this problem.

The biggest issue is the missing clog files -- did you have more than
one replica? Were they missing on all of them?

merlin

В списке pgsql-general по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: bug, bad memory, or bad disk?
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: bug, bad memory, or bad disk?