Обсуждение: pg_standby, Restartable Recovery after Hard Failure

Поиск
Список
Период
Сортировка

pg_standby, Restartable Recovery after Hard Failure

От
"Thomas F. O'Connell"
Дата:
Wanting a nice test of restartable recovery and pg_standby in a warm standby server scenario I'm testing, today I pulled the plug on the box where I was using Simon's test_warm_standby test harness. Basically, in this scenario, I had one posgres cluster (primary) against which pgbench was being run and a separate cluster (standby) that had been created from a base backup and then put into continuous recovery using pg_standby. In the middle of this scenario, I literally pulled the plug.

When the box came back up, I restarted primary. Everything came up fine. Then I restarted secondary. Here's what I got:

Trigger file            : /tmp/pgsql.trigger.5442
Waiting for WAL file    : ../archive/00000001000000000000000E
WAL file path           : 00000001000000000000000E
Restoring to...         : pg_xlog/RECOVERYXLOG
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp ../archive/00000001000000000000000E pg_xlog/RECOVER
YXLOG
Num archived files kept : all files
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
...

So something seems to have misfired in pg_standby. I'm having a hard time telling what might have hung it up. I wound up recovering by touching the trigger file, and standby came up as a running postgres server, but it was behind, probably as far as 00000001000000000000000E. The curious part is that all the files were in the archive, so what state would pulling the plug have set that pg_standby either interpreted incorrectly or failed to interpret?

When I tested a lighter weight version of this scenario merely by killing standby from the command line and then restarting it, it did this:

Trigger file            : /tmp/pgsql.trigger.5442
Waiting for WAL file    : ../archive/00000001.history
WAL file path           : 00000001.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp ../archive/00000001.history pg_xlog/RECOVERYHISTORY
Num archived files kept : all files
running restore         :cp: cannot access ../archive/00000001.history
cp: cannot access ../archive/00000001.history
cp: cannot access ../archive/00000001.history
not restored            : history file not found

But then it got back in the game and continued the continuous recovery process. I was able then to complete final recovery, and it seemed caught up.

If anyone can shed light on what might've happened in the hard failure scenario, I'd be interested to know. I've kept the various archive, primary, and standby directories created by test_warm_standby, so I can report on any file contents.

It occurs to me that timestamp information might be nice to have in pg_standby with debug mode. I might try patching pg_standby.c if no one beats me to it.

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

615-260-0005

Re: pg_standby, Restartable Recovery after Hard Failure

От
"Simon Riggs"
Дата:
On Wed, 2007-04-18 at 18:11 -0500, Thomas F. O'Connell wrote:
> Wanting a nice test of restartable recovery and pg_standby in a warm
> standby server scenario I'm testing, today I pulled the plug on the
> box where I was using Simon's test_warm_standby test harness.
> Basically, in this scenario, I had one posgres cluster (primary)
> against which pgbench was being run and a separate cluster (standby)
> that had been created from a base backup and then put into continuous
> recovery using pg_standby. In the middle of this scenario, I literally
> pulled the plug.

> If anyone can shed light on what might've happened in the hard failure
> scenario, I'd be interested to know. I've kept the various archive,
> primary, and standby directories created by test_warm_standby, so I
> can report on any file contents.

There is a possible error situation that I recently posted a fix to for
8.3. If you use the -k option with a low value then pg_standby can
delete files from the archive too quickly, so that when it tries to
return to the last restartpoint the file for that is missing.

The fix posted passes a new %r parameter to pg_standby, allowing it to
avoid deleting files too early.

If that doesn't explain your situation please send me logfiles etc
off-list so I can examine the details some more. Thanks.

> It occurs to me that timestamp information might be nice to have in
> pg_standby with debug mode. I might try patching pg_standby.c if no
> one beats me to it.

I'll look at doing this.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: pg_standby, Restartable Recovery after Hard Failure

От
"Thomas F. O'Connell"
Дата:
On Apr 23, 5:36 am, s...@2ndquadrant.com ("Simon Riggs") wrote:
> On Wed, 2007-04-18 at 18:11 -0500, Thomas F. O'Connell wrote:
> > Wanting a nice test of restartable recovery and pg_standby in a warm
> > standby server scenario I'm testing, today I pulled the plug on the
> > box where I was using Simon's test_warm_standby test harness.
> > Basically, in this scenario, I had one posgres cluster (primary)
> > against which pgbench was being run and a separate cluster (standby)
> > that had been created from a base backup and then put into continuous
> > recovery using pg_standby. In the middle of this scenario, I literally
> > pulled the plug.
> > If anyone can shed light on what might've happened in the hard failure
> > scenario, I'd be interested to know. I've kept the various archive,
> > primary, and standby directories created by test_warm_standby, so I
> > can report on any file contents.
>
> There is a possible error situation that I recently posted a fix to for
> 8.3. If you use the -k option with a low value then pg_standby can
> delete files from the archive too quickly, so that when it tries to
> return to the last restartpoint the file for that is missing.
>
> The fix posted passes a new %r parameter to pg_standby, allowing it to
> avoid deleting files too early.
>
> If that doesn't explain your situation please send me logfiles etc
> off-list so I can examine the details some more. Thanks.
>
> > It occurs to me that timestamp information might be nice to have in
> > pg_standby with debug mode. I might try patching pg_standby.c if no
> > one beats me to it.
>
> I'll look at doing this.
>
> --
>   Simon Riggs            
>   EnterpriseDB  http://www.enterprisedb.com

Simon,

I'm using revision 1.3 of pg_standby.c from here:


I don't see the %r parameter there. Is your patch not in HEAD yet, or is there somewhere else I should be looking?

I'm testing it as built against 8.2.3 and wasn't using -k during testing.

In later tests, we had some catastrophic data loss on disk, so I'm wondering if perhaps the hard failure created an anomalous scenario more related to Solaris and our disk subsystem than to pg_standby and continuous recovery.

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

615-260-0005