Re: 9.0.4 Data corruption issue

Поиск

Список

Период

Сортировка

От	Ken Caruso
Тема	Re: 9.0.4 Data corruption issue
Дата	19 июля 2011 г. 19:27:30
Msg-id	CAMg8r_o-ZT5-3JHaMbL4FRQcy0ZxwO9V+JxW2EyPB4CyMzy48g@mail.gmail.com обсуждение исходный текст
Ответ на	Re: 9.0.4 Data corruption issue (Cédric Villemain <cedric.villemain.debian@gmail.com>)
Список	pgsql-admin

Дерево обсуждения

On Sun, Jul 17, 2011 at 3:04 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote:

2011/7/17 Ken Caruso <ken@ipl31.net>:
>
>
> On Sat, Jul 16, 2011 at 2:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Ken Caruso <ken@ipl31.net> writes:
>> > Sorry, the actual error reported by CLUSTER is:
>>
>> > gpup=> cluster verbose tablename;
>> > INFO: clustering "dbname.tablename"
>> > WARNING: could not write block 12125253 of base/2651908/652397108
>> > DETAIL: Multiple failures --- write error might be permanent.
>> > ERROR: could not open file "base/2651908/652397108.1" (target block
>> > 12125253): No such file or directory
>> > CONTEXT: writing block 12125253 of relation base/2651908/652397108
>>
>> Hmm ... it looks like you've got a dirty buffer in shared memory that
>> corresponds to a block that no longer exists on disk; in fact, the whole
>> table segment it belonged to is gone. Or maybe the block or file number
>> in the shared buffer header is corrupted somehow.
>>
>> I imagine you're seeing errors like this during each checkpoint attempt?
>
> Hi Tom,
> Thanks for the reply.
> Yes, I tried a pg_start_backup() to force a checkpoint and it failed due to
> similar error.
>
>>
>> I can't think of any very good way to clean that up. What I'd try here
>> is a forced database shutdown (immediate-mode stop) and see if it starts
>> up cleanly. It might be that whatever caused this has also corrupted
>> the back WAL and so WAL replay will result in the same or similar error.
>> In that case you'll be forced to do a pg_resetxlog to get the DB to come
>> up again. If so, a dump and reload and some manual consistency checking
>> would be indicated :-(
>
> Before seeing this message, I restarted Postgres and it was able to get to a
> consistent state at which point I reclustered the db without error and
> everything appears to be fine. Any idea what caused this? Was it something
> to do with the Vacuum Full?

Block number 12125253 is bigger that any block we can find in
base/2651908/652397108.1
Should the table size be in the 100GB range or 2-3 GB range ?
This should help decide: if in the former case, then probably at least
a segment disappear or, in the later, the shared_buffer turn
corrupted.

The DB was in the 200GB-300GB range when this happened. What would cause the segment to go missing? Just wondering if there is any further action I should take like filing a bug or if this is a known issue. Thanks for everyone's help.

-Ken

Ken, you didn't change RELSEG_SIZE, right ? (it needs to be change in
source code before compile it yourself)
In both case a hardware check is welcome I believe.
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

В списке pgsql-admin по дате отправления:

Предыдущее

От: "Kevin Grittner"
Дата: 19 июля 2011 г., 19:10:08
Сообщение: Re: Replicating privileges from one user to another

Следующее

От: Ken Caruso
Дата: 19 июля 2011 г., 20:43:54
Сообщение: Bloat and Slow Vacuum Time on Toast

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: 9.0.4 Data corruption issue

Предыдущее

Следующее