Обсуждение: Unable to startup postgres: Could not read from file "pg_clog/00EC"

Поиск
Список
Период
Сортировка

Unable to startup postgres: Could not read from file "pg_clog/00EC"

От
"Nick Renders"
Дата:
Hello,

Yesterday, we experienced some issues with our Postgres installation 
(v9.6 running on macOS 10.12).
It seems that the machine was automatically rebooted for a yet unknown 
reason, and afterwards we were unable to start the Postgres service.

The postgres log shows the following:

2020-02-04 15:20:41 CET LOG:  database system was interrupted; last 
known up at 2020-02-04 15:18:34 CET
2020-02-04 15:20:43 CET LOG:  database system was not properly shut 
down; automatic recovery in progress
2020-02-04 15:20:44 CET LOG:  invalid record length at 14A/9E426DF8: 
wanted 24, got 0
2020-02-04 15:20:44 CET LOG:  redo is not required
2020-02-04 15:20:44 CET FATAL:  could not access status of transaction 
247890764
2020-02-04 15:20:44 CET DETAIL:  Could not read from file "pg_clog/00EC" 
at offset 106496: Undefined error: 0.
2020-02-04 15:20:44 CET LOG:  startup process (PID 403) exited with exit 
code 1
2020-02-04 15:20:44 CET LOG:  aborting startup due to startup process 
failure
2020-02-04 15:20:44 CET LOG:  database system is shut down


After some searching, I found someone who had had a similar issue and 
was able to resolve it by overwriting the file in pg_clog.
So I tried the following command:

    dd if=/dev/zero of=[dbpath]/pg_clog/00EC bs=256k count=1

and now the service is running again.


But I am worried that there might still be some issues that we haven't 
noticed yet. I also have no idea what caused this error in the first 
place. It might have been the reboot, but maybe the reboot was a result 
of a Postgres issue.

Is there anything specific I should check in our postgres installation / 
database to make sure it is running ok now? Anyway to see what the 
consequences were of purging that one pg_clog file?

Best regards,

Nick Renders



Re: Unable to startup postgres: Could not read from file"pg_clog/00EC"

От
Alvaro Herrera
Дата:
On 2020-Feb-05, Nick Renders wrote:

> Is there anything specific I should check in our postgres installation /
> database to make sure it is running ok now? Anyway to see what the
> consequences were of purging that one pg_clog file?

Losing pg_clog files is pretty bad, and should not happen; then again,
this might have been something else (ie. the file was maybe not lost).
That said, wrongly overwriting files is even worse.

By zeroing an existing pg_clog file, you marked a bunch of transactions
as aborted.  Your data is now probably inconsistent, if not downright
corrupt.  I would be looking for my most recent backup ...

If you're very lucky, your database might be pg_dumpable.  I would try
that, followed by restoring it in a separate clean instance and seeing
what happens.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unable to startup postgres: Could not read from file"pg_clog/00EC"

От
"Nick Renders"
Дата:
Thank you for the feedback, Alvaro.

Unfortunately, the database is no longer "dumpable". We were able to do 
a pg_dump yesterday morning (12 hours after the crash + purging the 
pg_clog) but if we try one now, we get the following error:

    unexpected chunk number 1 (expected 0) for toast value 8282331 in 
pg_toast_38651

Looking at our data, there seem to be 6 tables that have corrupt 
records. Doing a SELECT * for one of those records, will return a 
similar error:

    missing chunk number 0 for toast value 8288522 in pg_toast_5572299


What is the best way to go from here? Is tracking down these corrupt 
records and deleting them the best / only solution?
Is there a way to determine of there are issues with new data (after the 
crash)?

Any help and advice is very much appreciated.

Thanks,


Nick Renders


On 5 Feb 2020, at 12:51, Alvaro Herrera wrote:

> On 2020-Feb-05, Nick Renders wrote:
>
>> Is there anything specific I should check in our postgres 
>> installation /
>> database to make sure it is running ok now? Anyway to see what the
>> consequences were of purging that one pg_clog file?
>
> Losing pg_clog files is pretty bad, and should not happen; then again,
> this might have been something else (ie. the file was maybe not lost).
> That said, wrongly overwriting files is even worse.
>
> By zeroing an existing pg_clog file, you marked a bunch of 
> transactions
> as aborted.  Your data is now probably inconsistent, if not downright
> corrupt.  I would be looking for my most recent backup ...
>
> If you're very lucky, your database might be pg_dumpable.  I would try
> that, followed by restoring it in a separate clean instance and seeing
> what happens.
>
> -- 
> Álvaro Herrera                https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unable to startup postgres: Could not read from file "pg_clog/00EC"

От
Thomas Munro
Дата:
On Fri, Feb 7, 2020 at 1:47 AM Nick Renders <postgres@arcict.com> wrote:
> Thank you for the feedback, Alvaro.
>
> Unfortunately, the database is no longer "dumpable". We were able to do
> a pg_dump yesterday morning (12 hours after the crash + purging the
> pg_clog) but if we try one now, we get the following error:
>
>         unexpected chunk number 1 (expected 0) for toast value 8282331 in
> pg_toast_38651
>
> Looking at our data, there seem to be 6 tables that have corrupt
> records. Doing a SELECT * for one of those records, will return a
> similar error:
>
>         missing chunk number 0 for toast value 8288522 in pg_toast_5572299
>
>
> What is the best way to go from here? Is tracking down these corrupt
> records and deleting them the best / only solution?
> Is there a way to determine of there are issues with new data (after the
> crash)?
>
> Any help and advice is very much appreciated.

This error indicates that the file did exist already, it was just
shorter than we expected:

2020-02-04 15:20:44 CET DETAIL:  Could not read from file
"pg_clog/00EC" at offset 106496: Undefined error: 0.

What was the length of the file before you overwrote it? Are there
00EB and 00ED files, and if so what size?  When your server rebooted,
did crash recovery run or had it shut down cleanly?  Do you know if
the machine lost power, or the kernel crashed, or if it was a normal
reboot?  What are your settings for "fsync" and "wal_sync_method"?
What is the output of pg_controldata -D pgdata?  I wonder if that part
of the clog file was supposed to be created before the checkpoint (ie
the checkpoint is somehow borked), or was supposed to be created
during recovery after that checkpoint (something else is borked, but I
don't know what), or if the xid is somehow corrupted.

Here's a dirty trick that might help rescue some data.  Assuming you
have a copy of the original file before you zeroed it, you could write
a 256kb file full of 0x55 (that's 01010101 and represents 4 commits,
so if you fill the file up with that it means 'all transactions in
this range committed', which is probably closer to the truth than all
zeroes), and then copy the original shorter file over the top of it,
so that at least the range of transactions represented by the earlier
part of the file that did make it to disk are preserved, and we have
just have bogus force-everything-to-look-committed data after that.
But as Alvaro said, this is a pretty bad situation, this is key meta
data used to interpret all other data files, so all bets are off here,
this is restore-from-backups territory.



Re: Unable to startup postgres: Could not read from file"pg_clog/00EC"

От
Marc
Дата:

Alvaro, Thomas,

Putting a 256kb file full of 0x55 that's 01010101 and represents 4 commits
It did the job in being able to restart the server.
According to our data a “better” way, with less garbage.

The “Toast” issues how ever are still present.

To spend our weekend well we setup a new server with version 12.1 but had to fallback on 11.6 ( see other post )

We kept our “old” server active to see if we can learn some more from this hard-times.

Thanks for the help

Marc

On 5 Feb 2020, at 12:14, Nick Renders wrote:

Hello,

Yesterday, we experienced some issues with our Postgres installation (v9.6 running on macOS 10.12).
It seems that the machine was automatically rebooted for a yet unknown reason, and afterwards we were unable to start the Postgres service.

The postgres log shows the following:

2020-02-04 15:20:41 CET LOG: database system was interrupted; last known up at 2020-02-04 15:18:34 CET
2020-02-04 15:20:43 CET LOG: database system was not properly shut down; automatic recovery in progress
2020-02-04 15:20:44 CET LOG: invalid record length at 14A/9E426DF8: wanted 24, got 0
2020-02-04 15:20:44 CET LOG: redo is not required
2020-02-04 15:20:44 CET FATAL: could not access status of transaction 247890764
2020-02-04 15:20:44 CET DETAIL: Could not read from file "pg_clog/00EC" at offset 106496: Undefined error: 0.
2020-02-04 15:20:44 CET LOG: startup process (PID 403) exited with exit code 1
2020-02-04 15:20:44 CET LOG: aborting startup due to startup process failure
2020-02-04 15:20:44 CET LOG: database system is shut down


After some searching, I found someone who had had a similar issue and was able to resolve it by overwriting the file in pg_clog.
So I tried the following command:

dd if=/dev/zero of=[dbpath]/pg_clog/00EC bs=256k count=1

and now the service is running again.


But I am worried that there might still be some issues that we haven't noticed yet. I also have no idea what caused this error in the first place. It might have been the reboot, but maybe the reboot was a result of a Postgres issue.

Is there anything specific I should check in our postgres installation / database to make sure it is running ok now? Anyway to see what the consequences were of purging that one pg_clog file?

Best regards,

Nick Renders



ARC - your Apple Authorised Service partnerH.D. Saviolaan 8
 B-1700 Dilbeek
 Belgium
info@arcict.comwww.arcict.com
tel. : +32 (0)2 466 50 00fax. : +32 (0)2 466 88 33