Обсуждение: Could not open file "pg_clog/...."

Поиск
Список
Период
Сортировка

Could not open file "pg_clog/...."

От
"Markus Wollny"
Дата:
Hello!

Recently one of my PostgreSQL servers has started throwing error
messages like these:

ERROR:  could not access status of transaction 3489956864
DETAIL:  Could not open file "pg_clog/0D00": Datei oder Verzeichnis
nicht gefunden. (file not found)

The machine in question doesn't show any signs of a hardware defect,
we're running a RAID-10 over 10 disks for this partition on a 3Ware
hardware RAID controller with battery backup unit, the controller
doesn't show any defects at all. We're running PostgreSQL 8.3.5 on that
box, kernel is 2.6.18-6-amd64 of Debian Etch, the PostgreSQL binaries
were compiled from source on that machine.

I searched the lists and though I couldn't find an exact hint as to
what's causing this, I found a suggestion for a more or less hotfix
solution:
Create a file of the required size filled with zeroes and then put that
into the clog-directory, i.e.
dd bs=262144 count=1 if=/dev/zero of=/tmp/pg_clog_replacements/0002
chown postgres.daemon /tmp/pg_clog_replacements/0002
chmod 600 /tmp/pg_clog_replacements/0002
mv /tmp/pg_clog_replacements/0002 /var/lib/pgsql/data/pg_clog

I know that I'd be loosing some transactions, but in our use case this
is not critical. Anyway, this made the problem go away for a while but
now I'm getting those messages again - and indeed the clog-files in
question appear to be missing altogether. And what's worse, the
workaround no longer works properly but makes PostgreSQL crash:

magazine=# vacuum analyze pcaction.article;
PANIC:  corrupted item pointer: 5
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

And from the logfile:

<2009-05-12 11:38:09 CEST - 6606: [local]@magazine>PANIC:  corrupted
item pointer: 5
<2009-05-12 11:38:09 CEST - 6606: [local]@magazine>STATEMENT:  vacuum
analyze pcaction.article;
<2009-05-12 11:38:09 CEST - 29178: @>LOG:  server process (PID 6606) was
terminated by signal 6: Aborted
<2009-05-12 11:38:09 CEST - 29178: @>LOG:  terminating any other active
server processes
<2009-05-12 11:38:09 CEST - 6607:
192.168.222.134(57292)@magazine>WARNING:  terminating connection because
of crash of another server process
<2009-05-12 11:38:09 CEST - 6607:
192.168.222.134(57292)@magazine>DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
<2009-05-12 11:38:09 CEST - 6569: 192.168.222.134(57214)@bluebox>HINT:
In a moment you should be able to reconnect to the database and repeat
your command.
[...]
<2009-05-12 11:38:09 CEST - 29178: @>LOG:  all server processes
terminated; reinitializing
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  database system was
interrupted; last known up at 2009-05-12 11:37:51 CEST
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  database system was not
properly shut down; automatic recovery in progress
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  redo starts at 172/8B4EE118
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  record with zero length at
172/8B6AD510
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  redo done at 172/8B6AD4E0
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  last completed transaction was
at log time 2009-05-12 11:38:09.550175+02
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  checkpoint starting: shutdown
immediate
<2009-05-12 11:38:09 CEST - 6619: @>LOG:  checkpoint complete: wrote 351
buffers (1.1%); 0 transaction log file(s) added, 0 removed, 2 recycled;
write=0.008s, sync=0.000 s, total=0.009 s
<2009-05-12 11:38:09 CEST - 6622: @>LOG:  autovacuum launcher started
<2009-05-12 11:38:09 CEST - 29178: @>LOG:  database system is ready to
accept connections

Now what exactly is causing those missing clog files, what can I do to
prevent this and what can I do to recover my database cluster, as this
issue seems to prevent proper dumps at the moment?

Kind regards

   Markus

Jede Stimme zahlt, jetzt voten fur die besten Games: www.bamaward.de

Computec Media AG
Sitz der Gesellschaft und Registergericht: Furth (HRB 8818)
Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jurg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276



Re: Could not open file "pg_clog/...."

От
Glyn Astill
Дата:
--- On Tue, 12/5/09, Markus Wollny <Markus.Wollny@computec.de> wrote:

> From: Markus Wollny <Markus.Wollny@computec.de>
> Subject: [GENERAL] Could not open file "pg_clog/...."
> To: pgsql-general@postgresql.org
> Date: Tuesday, 12 May, 2009, 11:04 AM
> Hello!
>
> Recently one of my PostgreSQL servers has started throwing
> error
> messages like these:
>
> ERROR:  could not access status of transaction 3489956864
> DETAIL:  Could not open file "pg_clog/0D00":
> Datei oder Verzeichnis
> nicht gefunden. (file not found)
>
> The machine in question doesn't show any signs of a
> hardware defect,
> we're running a RAID-10 over 10 disks for this
> partition on a 3Ware
> hardware RAID controller with battery backup unit, the
> controller
> doesn't show any defects at all. We're running
> PostgreSQL 8.3.5 on that
> box, kernel is 2.6.18-6-amd64 of Debian Etch, the
> PostgreSQL binaries
> were compiled from source on that machine.

Apart from not being on the latest release you have a very similar setup to me.

>
> I know that I'd be loosing some transactions, but in
> our use case this
> is not critical. Anyway, this made the problem go away for
> a while but
> now I'm getting those messages again - and indeed the
> clog-files in
> question appear to be missing altogether. And what's
> worse, the
> workaround no longer works properly but makes PostgreSQL
> crash:
>

The first thing I would have done if I've been forced to do that (if there was no other option?) would be a dump /
restoredirectly afterwards, then pick through for any inconsistencies. 

> Now what exactly is causing those missing clog files, what
> can I do to
> prevent this and what can I do to recover my database
> cluster, as this
> issue seems to prevent proper dumps at the moment?

Probably wait for the big-wigs to reply but perhaps a reindex may get you going.

I'd definately be starting with a fresh database once I got out of the whole though...




Re: Could not open file "pg_clog/...."

От
"Markus Wollny"
Дата:
Hi!

> -----Ursprüngliche Nachricht-----
> Von: Glyn Astill [mailto:glynastill@yahoo.co.uk]
> Gesendet: Dienstag, 12. Mai 2009 12:33
> An: pgsql-general@postgresql.org; Markus Wollny

> The first thing I would have done if I've been forced to do
> that (if there was no other option?) would be a dump /
> restore directly afterwards, then pick through for any
> inconsistencies.

That's a lot of data - somewhere around 43GB at the moment. And pg_dump seems to fail altogether on the affected
databases,so the pg_clog issue actually means that I cannot make any current backups. 

> Probably wait for the big-wigs to reply but perhaps a reindex
> may get you going.

Tried that, but it also makes PostgreSQL crash, so no luck there either. I also dropped template0, recreated it from
template1,did a VACUUM FREEZE on it, marked it as template again and disallowed connections. 

> I'd definately be starting with a fresh database once I got
> out of the whole though...

Yes, but that'll be a nightshift and I need some way to actually get at a working dump now...

Kind regards

   Markus

Jede Stimme zählt, jetzt voten für die besten Games: www.bamaward.de

Computec Media AG
Sitz der Gesellschaft und Registergericht: Fürth (HRB 8818)
Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jürg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276



Re: Could not open file "pg_clog/...."

От
Glyn Astill
Дата:
--- On Tue, 12/5/09, Markus Wollny <Markus.Wollny@computec.de> wrote:

> From: Markus Wollny <Markus.Wollny@computec.de>
> Subject: AW: [GENERAL] Could not open file "pg_clog/...."
> To: glynastill@yahoo.co.uk, pgsql-general@postgresql.org
> Date: Tuesday, 12 May, 2009, 11:52 AM
> Hi!
>
> > -----Ursprüngliche Nachricht-----
> > Von: Glyn Astill [mailto:glynastill@yahoo.co.uk]
> > Gesendet: Dienstag, 12. Mai 2009 12:33
> > An: pgsql-general@postgresql.org; Markus Wollny
>
> > The first thing I would have done if I've been
> forced to do
> > that (if there was no other option?) would be a dump /
>
> > restore directly afterwards, then pick through for any
>
> > inconsistencies.
>
> That's a lot of data - somewhere around 43GB at the
> moment. And pg_dump seems to fail altogether on the affected
> databases, so the pg_clog issue actually means that I cannot
> make any current backups.
>
> > Probably wait for the big-wigs to reply but perhaps a
> reindex
> > may get you going.
>
> Tried that, but it also makes PostgreSQL crash, so no luck
> there either. I also dropped template0, recreated it from
> template1, did a VACUUM FREEZE on it, marked it as template
> again and disallowed connections.
>
> > I'd definately be starting with a fresh database
> once I got
> > out of the whole though...
>
> Yes, but that'll be a nightshift and I need some way to
> actually get at a working dump now...
>

It appears to be failing on the pcaction.article table. Could you get away without that? Perhaps, and it'd be a
longshot,you'd be able to dump the rest of the data with it gone? 

I'm going to duck out of this now though, and I think you should probably wait until someone a little more knowlegable
replies.




Re: Could not open file "pg_clog/...."

От
"Markus Wollny"
Дата:
Hi!

> It appears to be failing on the pcaction.article table. Could
> you get away without that? Perhaps, and it'd be a longshot,
> you'd be able to dump the rest of the data with it gone?
>
> I'm going to duck out of this now though, and I think you
> should probably wait until someone a little more knowlegable replies.

Yes, I could get away with it without any problems, as this table isn't
really needed any more (just some remains of a previous site version,
which I didn't yet drop so we could still take a look at it if the need
should arise). It is somewhat funny that this should happen on some
table that is hardy accessed at all, whereas the more popular database
objects seem to be in order. But I am not yet shure, that this one is
the only affected object, but I'll try to dump that db without the
pcaction.schema, see what happens.

Kind regards

   Markus

Jede Stimme zahlt, jetzt voten fur die besten Games: www.bamaward.de

Computec Media AG
Sitz der Gesellschaft und Registergericht: Furth (HRB 8818)
Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jurg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276



Re: Could not open file "pg_clog/...."

От
Glyn Astill
Дата:



--- On Tue, 12/5/09, Glyn Astill <glynastill@yahoo.co.uk> wrote:

> I'm going to duck out of this now though, and I think
> you should probably wait until someone a little more
> knowlegable replies.
>

Also see here:

http://archives.postgresql.org/pgsql-general/2006-07/msg01147.php




Re: Could not open file "pg_clog/...."

От
Alvaro Herrera
Дата:
Markus Wollny wrote:

> magazine=# vacuum analyze pcaction.article;
> PANIC:  corrupted item pointer: 5
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

This error comes from this piece of PageRepairFragmentation:

        /* Need to compact the page the hard way */
        itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage);
        itemidptr = itemidbase;
        totallen = 0;
        for (i = 0; i < nline; i++)
        {
            lp = PageGetItemId(page, i + 1);
            if (ItemIdHasStorage(lp))
            {
                itemidptr->offsetindex = i;
                itemidptr->itemoff = ItemIdGetOffset(lp);
                if (itemidptr->itemoff < (int) pd_upper ||
                    itemidptr->itemoff >= (int) pd_special)
                    ereport(ERROR,
                            (errcode(ERRCODE_DATA_CORRUPTED),
                             errmsg("corrupted item pointer: %u",
                                    itemidptr->itemoff)));
                itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
                totallen += itemidptr->alignedlen;
                itemidptr++;
            }
       }

What it means, AFAICT is that the page is corrupted beyond recovery.  If you
want to salvage the rest of the data on the table, you could zero out this
page, for which you'll obviously need to know what page it was.  It's not a
trivial patch to add that info to the error message, because
PageRepairFragmentation does not receive it.

I think it is worth our trouble to report block numbers on this kind of errors;
and in fact I had started on a patch to add errcontext() to vacuum and analyze
calls, but never finished it.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Could not open file "pg_clog/...."

От
Alvaro Herrera
Дата:
Alvaro Herrera wrote:
> Markus Wollny wrote:

> > magazine=# vacuum analyze pcaction.article;
> > PANIC:  corrupted item pointer: 5
> > server closed the connection unexpectedly
> >         This probably means the server terminated abnormally
> >         before or while processing the request.
> > The connection to the server was lost. Attempting reset: Failed.
>
> This error comes from this piece of PageRepairFragmentation:

>                 if (itemidptr->itemoff < (int) pd_upper ||
>                     itemidptr->itemoff >= (int) pd_special)
>                     ereport(ERROR,
>                             (errcode(ERRCODE_DATA_CORRUPTED),
>                              errmsg("corrupted item pointer: %u",
>                                     itemidptr->itemoff)));

I forgot to mention that this ERROR is promoted to PANIC by the
critical section defined in lazy_vacuum_page().

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.