Обсуждение: Continuous archiving fails routinely with "invalid magic number" error

Поиск
Список
Период
Сортировка

Continuous archiving fails routinely with "invalid magic number" error

От
"Bruce Reed"
Дата:
I am having problems getting continuous archiving to work reliably on my Postgres 8.3 system. My primary is running 8.3.1, has archiving turned on and writes logs to a NFS mount that lives on the backup server. The warm standby is running 8.3.3. Initially things work fine with the primary copying logs to the archive directory correctly and the warm standby  in recovery mode plays them back okay. At some point the following appears in the syslog:

May 26 08:58:37 ods2-prod postgres[3827]: [1424-1] LOG:  restored log file "00000001000000C2000000E3" from archive
May 26 08:58:51 ods2-prod postgres[3827]: [1425-1] LOG:  restored log file "00000001000000C2000000E4" from archive
May 26 08:58:52 ods2-prod postgres[3827]: [1426-1] LOG:  restored log file "00000001000000C2000000E5" from archive
May 26 08:58:55 ods2-prod postgres[3827]: [1427-1] LOG:  invalid magic number 0000 in log file 194, segment 229, offset 10780672
May 26 08:58:55 ods2-prod postgres[3827]: [1428-1] LOG:  redo done at C2/E5A45C80
May 26 08:58:55 ods2-prod postgres[3827]: [1429-1] LOG:  last completed transaction was at log time 2009-05-26 08:58:21.804665+00
May 26 08:58:56 ods2-prod postgres[3827]: [1430-1] LOG:  restored log file "00000001000000C2000000E5" from archive
May 26 08:59:11 ods2-prod postgres[3827]: [1431-1] LOG:  selected new timeline ID: 2
May 26 08:59:26 ods2-prod postgres[3827]: [1432-1] LOG:  could not create archive status file "pg_xlog/archive_status/00000002.history.ready": No such file or directory
May 26 08:59:26 ods2-prod postgres[3827]: [1433-1] LOG:  archive recovery complete

I’ve had this occur three times since setting up the warm standby. The log file looks intact and I can’t see any errors in the primary’s syslog at the time the log is generated and copied. There are  no NFS or space issues either on the backup server either. Can anyone suggest how to go about diagnosing this problem?

Bruce

Re: Continuous archiving fails routinely with "invalid magic number" error

От
"Kevin Grittner"
Дата:
"Bruce Reed" <breed@dvdplay.com> wrote:

> May 26 08:58:55 ods2-prod postgres[3827]: [1427-1] LOG:  invalid
> magic number 0000 in log file 194, segment 229, offset 10780672

Try copying to a different name or directory on the target volume and
renaming or moving into place.  It sounds as though maybe the copy is
allocating the file at full size and then copying in the contents, and
the recovery process is sometimes picking it up before it's done.
Either that or the file is being corrupted in transit.

-Kevin

Re: Continuous archiving fails routinely with "invalid magic number" error

От
Bruce Reed
Дата:
 I can, though I think I'll tweak the pg_standby parms and see if linking
instead of copying fixes the problem.  pg_standby would be at fault here,
wouldn't it? It seems NFS transport is a common way to manage this, in fact
it's mentioned as a solution on the Continuous Archiving doc page. It's a
fast link with low latency and NFS copy is only slightly slower than local
disk copy. I thought with NFS 3 and synchronous write the remote end can't
put the cart before the horse.

Maybe debug output will turn up something on next failure.

Bruce


On 5/28/09 5:18 PM, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> "Bruce Reed" <breed@dvdplay.com> wrote:
>
>> May 26 08:58:55 ods2-prod postgres[3827]: [1427-1] LOG:  invalid
>> magic number 0000 in log file 194, segment 229, offset 10780672
>
> Try copying to a different name or directory on the target volume and
> renaming or moving into place.  It sounds as though maybe the copy is
> allocating the file at full size and then copying in the contents, and
> the recovery process is sometimes picking it up before it's done.
> Either that or the file is being corrupted in transit.
>
> -Kevin


Re: Continuous archiving fails routinely with "invalid magic number" error

От
Tom Lane
Дата:
"Bruce Reed" <breed@dvdplay.com> writes:
> I am having problems getting continuous archiving to work reliably on my
> Postgres 8.3 system. My primary is running 8.3.1, has archiving turned on
> and writes logs to a NFS mount that lives on the backup server. The warm
> standby is running 8.3.3. Initially things work fine with the primary
> copying logs to the archive directory correctly and the warm standby  in
> recovery mode plays them back okay. At some point the following appears in
> the syslog:

I'm wondering whether you are doing anything to ensure that the copy
operation appears atomic to the standby?  This behavior looks about like
what I'd expect if the standby tried to read a WAL file before it'd been
completely copied.  You might try copying files to some temporary name
that doesn't look like a WAL file name, and then renaming them into
place when the copy is complete.

BTW, I see another issue here:

> May 26 08:59:26 ods2-prod postgres[3827]: [1432-1] LOG:  could not create
> archive status file "pg_xlog/archive_status/00000002.history.ready": No such
> file or directory

Does the archive_status/ subdirectory exist?  If not, create it by hand,
making sure it has the same ownership and permissions as pg_xlog/.
It's fairly easy to forget to recreate this directory when you are
setting up a backup system.  CVS HEAD has some code to recreate it
automatically if it's not there, but that's not in 8.3.

            regards, tom lane