Обсуждение: Could not open file "pg_clog/...."
Hello! Recently one of my PostgreSQL servers has started throwing error messages like these: ERROR: could not access status of transaction 3489956864 DETAIL: Could not open file "pg_clog/0D00": Datei oder Verzeichnis nicht gefunden. (file not found) The machine in question doesn't show any signs of a hardware defect, we're running a RAID-10 over 10 disks for this partition on a 3Ware hardware RAID controller with battery backup unit, the controller doesn't show any defects at all. We're running PostgreSQL 8.3.5 on that box, kernel is 2.6.18-6-amd64 of Debian Etch, the PostgreSQL binaries were compiled from source on that machine. I searched the lists and though I couldn't find an exact hint as to what's causing this, I found a suggestion for a more or less hotfix solution: Create a file of the required size filled with zeroes and then put that into the clog-directory, i.e. dd bs=262144 count=1 if=/dev/zero of=/tmp/pg_clog_replacements/0002 chown postgres.daemon /tmp/pg_clog_replacements/0002 chmod 600 /tmp/pg_clog_replacements/0002 mv /tmp/pg_clog_replacements/0002 /var/lib/pgsql/data/pg_clog I know that I'd be loosing some transactions, but in our use case this is not critical. Anyway, this made the problem go away for a while but now I'm getting those messages again - and indeed the clog-files in question appear to be missing altogether. And what's worse, the workaround no longer works properly but makes PostgreSQL crash: magazine=# vacuum analyze pcaction.article; PANIC: corrupted item pointer: 5 server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. And from the logfile: <2009-05-12 11:38:09 CEST - 6606: [local]@magazine>PANIC: corrupted item pointer: 5 <2009-05-12 11:38:09 CEST - 6606: [local]@magazine>STATEMENT: vacuum analyze pcaction.article; <2009-05-12 11:38:09 CEST - 29178: @>LOG: server process (PID 6606) was terminated by signal 6: Aborted <2009-05-12 11:38:09 CEST - 29178: @>LOG: terminating any other active server processes <2009-05-12 11:38:09 CEST - 6607: 192.168.222.134(57292)@magazine>WARNING: terminating connection because of crash of another server process <2009-05-12 11:38:09 CEST - 6607: 192.168.222.134(57292)@magazine>DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. <2009-05-12 11:38:09 CEST - 6569: 192.168.222.134(57214)@bluebox>HINT: In a moment you should be able to reconnect to the database and repeat your command. [...] <2009-05-12 11:38:09 CEST - 29178: @>LOG: all server processes terminated; reinitializing <2009-05-12 11:38:09 CEST - 6619: @>LOG: database system was interrupted; last known up at 2009-05-12 11:37:51 CEST <2009-05-12 11:38:09 CEST - 6619: @>LOG: database system was not properly shut down; automatic recovery in progress <2009-05-12 11:38:09 CEST - 6619: @>LOG: redo starts at 172/8B4EE118 <2009-05-12 11:38:09 CEST - 6619: @>LOG: record with zero length at 172/8B6AD510 <2009-05-12 11:38:09 CEST - 6619: @>LOG: redo done at 172/8B6AD4E0 <2009-05-12 11:38:09 CEST - 6619: @>LOG: last completed transaction was at log time 2009-05-12 11:38:09.550175+02 <2009-05-12 11:38:09 CEST - 6619: @>LOG: checkpoint starting: shutdown immediate <2009-05-12 11:38:09 CEST - 6619: @>LOG: checkpoint complete: wrote 351 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=0.008s, sync=0.000 s, total=0.009 s <2009-05-12 11:38:09 CEST - 6622: @>LOG: autovacuum launcher started <2009-05-12 11:38:09 CEST - 29178: @>LOG: database system is ready to accept connections Now what exactly is causing those missing clog files, what can I do to prevent this and what can I do to recover my database cluster, as this issue seems to prevent proper dumps at the moment? Kind regards Markus Jede Stimme zahlt, jetzt voten fur die besten Games: www.bamaward.de Computec Media AG Sitz der Gesellschaft und Registergericht: Furth (HRB 8818) Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch Vorsitzender des Aufsichtsrates: Jurg Marquard Umsatzsteuer-Identifikationsnummer: DE 812 575 276
--- On Tue, 12/5/09, Markus Wollny <Markus.Wollny@computec.de> wrote: > From: Markus Wollny <Markus.Wollny@computec.de> > Subject: [GENERAL] Could not open file "pg_clog/...." > To: pgsql-general@postgresql.org > Date: Tuesday, 12 May, 2009, 11:04 AM > Hello! > > Recently one of my PostgreSQL servers has started throwing > error > messages like these: > > ERROR: could not access status of transaction 3489956864 > DETAIL: Could not open file "pg_clog/0D00": > Datei oder Verzeichnis > nicht gefunden. (file not found) > > The machine in question doesn't show any signs of a > hardware defect, > we're running a RAID-10 over 10 disks for this > partition on a 3Ware > hardware RAID controller with battery backup unit, the > controller > doesn't show any defects at all. We're running > PostgreSQL 8.3.5 on that > box, kernel is 2.6.18-6-amd64 of Debian Etch, the > PostgreSQL binaries > were compiled from source on that machine. Apart from not being on the latest release you have a very similar setup to me. > > I know that I'd be loosing some transactions, but in > our use case this > is not critical. Anyway, this made the problem go away for > a while but > now I'm getting those messages again - and indeed the > clog-files in > question appear to be missing altogether. And what's > worse, the > workaround no longer works properly but makes PostgreSQL > crash: > The first thing I would have done if I've been forced to do that (if there was no other option?) would be a dump / restoredirectly afterwards, then pick through for any inconsistencies. > Now what exactly is causing those missing clog files, what > can I do to > prevent this and what can I do to recover my database > cluster, as this > issue seems to prevent proper dumps at the moment? Probably wait for the big-wigs to reply but perhaps a reindex may get you going. I'd definately be starting with a fresh database once I got out of the whole though...
Hi! > -----Ursprüngliche Nachricht----- > Von: Glyn Astill [mailto:glynastill@yahoo.co.uk] > Gesendet: Dienstag, 12. Mai 2009 12:33 > An: pgsql-general@postgresql.org; Markus Wollny > The first thing I would have done if I've been forced to do > that (if there was no other option?) would be a dump / > restore directly afterwards, then pick through for any > inconsistencies. That's a lot of data - somewhere around 43GB at the moment. And pg_dump seems to fail altogether on the affected databases,so the pg_clog issue actually means that I cannot make any current backups. > Probably wait for the big-wigs to reply but perhaps a reindex > may get you going. Tried that, but it also makes PostgreSQL crash, so no luck there either. I also dropped template0, recreated it from template1,did a VACUUM FREEZE on it, marked it as template again and disallowed connections. > I'd definately be starting with a fresh database once I got > out of the whole though... Yes, but that'll be a nightshift and I need some way to actually get at a working dump now... Kind regards Markus Jede Stimme zählt, jetzt voten für die besten Games: www.bamaward.de Computec Media AG Sitz der Gesellschaft und Registergericht: Fürth (HRB 8818) Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch Vorsitzender des Aufsichtsrates: Jürg Marquard Umsatzsteuer-Identifikationsnummer: DE 812 575 276
--- On Tue, 12/5/09, Markus Wollny <Markus.Wollny@computec.de> wrote: > From: Markus Wollny <Markus.Wollny@computec.de> > Subject: AW: [GENERAL] Could not open file "pg_clog/...." > To: glynastill@yahoo.co.uk, pgsql-general@postgresql.org > Date: Tuesday, 12 May, 2009, 11:52 AM > Hi! > > > -----Ursprüngliche Nachricht----- > > Von: Glyn Astill [mailto:glynastill@yahoo.co.uk] > > Gesendet: Dienstag, 12. Mai 2009 12:33 > > An: pgsql-general@postgresql.org; Markus Wollny > > > The first thing I would have done if I've been > forced to do > > that (if there was no other option?) would be a dump / > > > restore directly afterwards, then pick through for any > > > inconsistencies. > > That's a lot of data - somewhere around 43GB at the > moment. And pg_dump seems to fail altogether on the affected > databases, so the pg_clog issue actually means that I cannot > make any current backups. > > > Probably wait for the big-wigs to reply but perhaps a > reindex > > may get you going. > > Tried that, but it also makes PostgreSQL crash, so no luck > there either. I also dropped template0, recreated it from > template1, did a VACUUM FREEZE on it, marked it as template > again and disallowed connections. > > > I'd definately be starting with a fresh database > once I got > > out of the whole though... > > Yes, but that'll be a nightshift and I need some way to > actually get at a working dump now... > It appears to be failing on the pcaction.article table. Could you get away without that? Perhaps, and it'd be a longshot,you'd be able to dump the rest of the data with it gone? I'm going to duck out of this now though, and I think you should probably wait until someone a little more knowlegable replies.
Hi! > It appears to be failing on the pcaction.article table. Could > you get away without that? Perhaps, and it'd be a longshot, > you'd be able to dump the rest of the data with it gone? > > I'm going to duck out of this now though, and I think you > should probably wait until someone a little more knowlegable replies. Yes, I could get away with it without any problems, as this table isn't really needed any more (just some remains of a previous site version, which I didn't yet drop so we could still take a look at it if the need should arise). It is somewhat funny that this should happen on some table that is hardy accessed at all, whereas the more popular database objects seem to be in order. But I am not yet shure, that this one is the only affected object, but I'll try to dump that db without the pcaction.schema, see what happens. Kind regards Markus Jede Stimme zahlt, jetzt voten fur die besten Games: www.bamaward.de Computec Media AG Sitz der Gesellschaft und Registergericht: Furth (HRB 8818) Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch Vorsitzender des Aufsichtsrates: Jurg Marquard Umsatzsteuer-Identifikationsnummer: DE 812 575 276
--- On Tue, 12/5/09, Glyn Astill <glynastill@yahoo.co.uk> wrote: > I'm going to duck out of this now though, and I think > you should probably wait until someone a little more > knowlegable replies. > Also see here: http://archives.postgresql.org/pgsql-general/2006-07/msg01147.php
Markus Wollny wrote: > magazine=# vacuum analyze pcaction.article; > PANIC: corrupted item pointer: 5 > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. This error comes from this piece of PageRepairFragmentation: /* Need to compact the page the hard way */ itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage); itemidptr = itemidbase; totallen = 0; for (i = 0; i < nline; i++) { lp = PageGetItemId(page, i + 1); if (ItemIdHasStorage(lp)) { itemidptr->offsetindex = i; itemidptr->itemoff = ItemIdGetOffset(lp); if (itemidptr->itemoff < (int) pd_upper || itemidptr->itemoff >= (int) pd_special) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("corrupted item pointer: %u", itemidptr->itemoff))); itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp)); totallen += itemidptr->alignedlen; itemidptr++; } } What it means, AFAICT is that the page is corrupted beyond recovery. If you want to salvage the rest of the data on the table, you could zero out this page, for which you'll obviously need to know what page it was. It's not a trivial patch to add that info to the error message, because PageRepairFragmentation does not receive it. I think it is worth our trouble to report block numbers on this kind of errors; and in fact I had started on a patch to add errcontext() to vacuum and analyze calls, but never finished it. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Markus Wollny wrote: > > magazine=# vacuum analyze pcaction.article; > > PANIC: corrupted item pointer: 5 > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: Failed. > > This error comes from this piece of PageRepairFragmentation: > if (itemidptr->itemoff < (int) pd_upper || > itemidptr->itemoff >= (int) pd_special) > ereport(ERROR, > (errcode(ERRCODE_DATA_CORRUPTED), > errmsg("corrupted item pointer: %u", > itemidptr->itemoff))); I forgot to mention that this ERROR is promoted to PANIC by the critical section defined in lazy_vacuum_page(). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.