Обсуждение: WAL recovery failure

Поиск
Список
Период
Сортировка

WAL recovery failure

От
Rick Dicaire
Дата:
Hi folks, we're snapshotting at san level, two volumes containing a
postgres 9.1.5 database, and pg_xlog respectively.
This database cluster is a streaming replicant.
We're stopping the database completely, taking the snapshots,
restarting the database.
I export these snapshots as luns to a VM, mount the disks in the VM,
cp -a the database and pg_xlog contents to
a staging area. Start postgres using -D /staging/9.1/main. Postgres
fails to start.

From log:

2012-09-25 12:41:30.147 CDT,,,9597,,5061ecca.257d,1,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"database system was interrupted while in
recovery at 2012-09-25 12:18:06 CDT",,"This probably means that some
data is corrupted and you will have to use the last backup for
recovery.",,,,,,,""
2012-09-25 12:41:30.155 CDT,,,9597,,5061ecca.257d,2,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"starting archive recovery",,,,,,,,,""
2012-09-25 12:41:30.354 CDT,,,9597,,5061ecca.257d,3,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000EE"" from archive",,,,,,,,,""
2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,4,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000E2"" from archive",,,,,,,,,""
2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,5,,2012-09-25
12:41:30 CDT,,0,PANIC,XX000,"unexpected pageaddr 74D/7C000000 in log
file 1869, segment 226, offset 0",,,,,,,,,""
2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,1,,2012-09-25
12:41:28 CDT,,0,LOG,00000,"startup process (PID 9597) was terminated
by signal 6: Aborted",,,,,,,,,""
2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,2,,2012-09-25
12:41:28 CDT,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""

/staging/9.1/main/recovery.conf:

restore_command = 'cp /staging/9.1/main/pg_xlog/%f "%p"'

Does the PANIC shown in above log entry indicate a corrupt WAL file?


Re: WAL recovery failure

От
Fujii Masao
Дата:
On Wed, Sep 26, 2012 at 3:22 AM, Rick Dicaire <kritek@gmail.com> wrote:
> Hi folks, we're snapshotting at san level, two volumes containing a
> postgres 9.1.5 database, and pg_xlog respectively.
> This database cluster is a streaming replicant.
> We're stopping the database completely, taking the snapshots,
> restarting the database.
> I export these snapshots as luns to a VM, mount the disks in the VM,
> cp -a the database and pg_xlog contents to
> a staging area. Start postgres using -D /staging/9.1/main. Postgres
> fails to start.
>
> From log:
>
> 2012-09-25 12:41:30.147 CDT,,,9597,,5061ecca.257d,1,,2012-09-25
> 12:41:30 CDT,,0,LOG,00000,"database system was interrupted while in
> recovery at 2012-09-25 12:18:06 CDT",,"This probably means that some
> data is corrupted and you will have to use the last backup for
> recovery.",,,,,,,""
> 2012-09-25 12:41:30.155 CDT,,,9597,,5061ecca.257d,2,,2012-09-25
> 12:41:30 CDT,,0,LOG,00000,"starting archive recovery",,,,,,,,,""
> 2012-09-25 12:41:30.354 CDT,,,9597,,5061ecca.257d,3,,2012-09-25
> 12:41:30 CDT,,0,LOG,00000,"restored log file
> ""000000010000074D000000EE"" from archive",,,,,,,,,""
> 2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,4,,2012-09-25
> 12:41:30 CDT,,0,LOG,00000,"restored log file
> ""000000010000074D000000E2"" from archive",,,,,,,,,""
> 2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,5,,2012-09-25
> 12:41:30 CDT,,0,PANIC,XX000,"unexpected pageaddr 74D/7C000000 in log
> file 1869, segment 226, offset 0",,,,,,,,,""
> 2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,1,,2012-09-25
> 12:41:28 CDT,,0,LOG,00000,"startup process (PID 9597) was terminated
> by signal 6: Aborted",,,,,,,,,""
> 2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,2,,2012-09-25
> 12:41:28 CDT,,0,LOG,00000,"terminating any other active server
> processes",,,,,,,,,""
>
> /staging/9.1/main/recovery.conf:
>
> restore_command = 'cp /staging/9.1/main/pg_xlog/%f "%p"'

/staging/9.1/main/pg_xlog is really archive directory? restore_command
should be set to the command which copies the archived WAL files from
the archive area.

Regards,

--
Fujii Masao


Re: WAL recovery failure

От
Rick Dicaire
Дата:
> /staging/9.1/main/pg_xlog is really archive directory? restore_command
> should be set to the command which copies the archived WAL files from
> the archive area.

Yes. This is the archive directory.

This begs another question, does a replicant write its own wal files to disk?


Re: WAL recovery failure

От
"Albe Laurenz"
Дата:
Rick Dicaire wrote:
>> /staging/9.1/main/pg_xlog is really archive directory?
restore_command
>> should be set to the command which copies the archived WAL files from
>> the archive area.
>
> Yes. This is the archive directory.

But according to what you wrote before you start the server
with -D /staging/9.1/main.
That means that /staging/9.1/main/pg_xlog contains the active
WAL files, right?

So the restore_command 'cp /staging/9.1/main/pg_xlog/%f "%p"'
would just try to cp the files onto themselves, right?

The WAL archive directory must be different from the
active WAL directory.  Check if that's your problem!

Yours,
Laurenz Albe


Re: WAL recovery failure

От
Rick Dicaire
Дата:
On Wed, Sep 26, 2012 at 5:21 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
> But according to what you wrote before you start the server
> with -D /staging/9.1/main.
> That means that /staging/9.1/main/pg_xlog contains the active
> WAL files, right?
>
> So the restore_command 'cp /staging/9.1/main/pg_xlog/%f "%p"'
> would just try to cp the files onto themselves, right?
>
> The WAL archive directory must be different from the
> active WAL directory.  Check if that's your problem!

/staging/9.1/main/pg_xlog/ was a symlink to /staging/9.1/pg_xlog/,
removed the link, created
the directory, restarted postgres:

2012-09-26 06:29:20.094 CDT,,,13904,,5062e710.3650,1,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"database system was interrupted while in
recovery at log time 2012-09-25 09:33:28 CDT",,"If this has occurred
more than once some data might be corrupted and you might need to
choose an earlier recovery target.",,,,,,,""
2012-09-26 06:29:20.094 CDT,,,13904,,5062e710.3650,2,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"creating missing WAL directory
""pg_xlog/archive_status""",,,,,,,,,""
2012-09-26 06:29:20.117 CDT,,,13904,,5062e710.3650,3,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"starting archive recovery",,,,,,,,,""
2012-09-26 06:29:20.361 CDT,,,13904,,5062e710.3650,4,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000EE"" from archive",,,,,,,,,""
2012-09-26 06:29:20.498 CDT,,,13904,,5062e710.3650,5,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000E2"" from archive",,,,,,,,,""
2012-09-26 06:29:20.498 CDT,,,13904,,5062e710.3650,6,,2012-09-26
06:29:20 CDT,,0,PANIC,XX000,"unexpected pageaddr 74D/7C000000 in log
file 1869, segment 226, offset 0",,,,,,,,,""
2012-09-26 06:29:20.499 CDT,,,13902,,5062e70f.364e,1,,2012-09-26
06:29:19 CDT,,0,LOG,00000,"startup process (PID 13904) was terminated
by signal 6: Aborted",,,,,,,,,""
2012-09-26 06:29:20.499 CDT,,,13902,,5062e70f.364e,2,,2012-09-26
06:29:19 CDT,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""

Two WAL files are read in, choking on the third. Is it a corrupt WAL file?