Обсуждение: "Stand-in" server recovery techniques

Поиск
Список
Период
Сортировка

"Stand-in" server recovery techniques

От
"Kevin Grittner"
Дата:
I'm looking for advice on how to best switch from warm standby processing to
"stand-in" production use in our (rather unusual) environment.

THE ENVIRONMENT

We have 72 counties spread around the state, each with their own database,
which contains the official data for that county's court system.  We do PITR
backups to the file server in each county, and replicate the files (both
base backups and WAL files) back to a single central machine which hosts a
"farm" of warm standby PostgreSQL instances -- one for each county, all on
the same machine.  This serves to validate the integrity of the files
brought back from the counties.

Base backups are done weekly, and the WAN speed to a county depends
primarily on the size of its base backup -- we try to get a speed sufficient
to bring back the base backup within 24 hours.  We force WAL files to be
written at least once per hour, although many counties generate them much
more frequently during the work day.

It turns out that 72 warm standby instances can keep a machine pretty busy,
so we won't want to use that machine for production.  We have two machines
ready to use for temporary "stand-ins" over the WAN in the event of a
catastrophic failure of a county machine.  We will get a database going on
one of these, in production mode.  This would only be done after recovery
efforts on the county machine have failed, and that machine has been powered
down.

We will bring the warm standby machine as current as we can using WAL files,
copy it to a stand-in machine, start our application middle tier and "top
off" the stand-in database with transactions from the transaction repository
our framework maintains (normally current to within a second).

Once this stand-in is running, we will want to become the source of WAL
files which can be applied to the warm standby on "the farm".

THE QUESTION

How do we create the PostgreSQL instance on the stand-in box?

I see four possibilities:

(1)  Restore the latest base backup and apply all WAL files available.
This is likely to be the slowest option.

(2)  Kick the warm standby on "the farm" into production mode, shut down
the instance, and then copy the instance directory.  This should be
relatively quick and safe, but has the down side of needing to restart
the warm standby from the latest base backup afterwards, if that is even
possible.  It seems like we might need to make a fresh base backup from the
(stopped) instance on the warm standby farm.

(3)  Stop the warm standby while it is in recovery mode, copy the instance
directory, and restart it.  On the stand-in box, have the script specified
by recovery.conf just "exit 1" to kick it immediately into production mode.
As long as the WAL files generated in this situation work with the old warm
standby, I don't see a down side, but I'm not sure if that is a safe
assumption.

(4)  Capture information about where the warm standby is at and attempt a
PITR-style copy of the instance while the warm stanby is running.  This one
seems riskier than (3) without significant benefit.

So, I'm leaning toward option (3).  Does that sound safe and workable?  Will
the WAL files from the stand-in work with the old warm standby?  Did I miss
any good alternatives?

Thanks,

-Kevin



Re: "Stand-in" server recovery techniques

От
"Scott Marlowe"
Дата:
On 8/22/07, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

> THE QUESTION
>
> How do we create the PostgreSQL instance on the stand-in box?
>
> I see four possibilities:
>
> (1)  Restore the latest base backup and apply all WAL files available.
> This is likely to be the slowest option.

I'm leaning towards option 1, because you mentioned that you will be
doing this ONLY in the event the primary on site server can't come
back up.  So, I'm going to assume that you're going to spend a few
minutes (30 or so) trying to resurrect the primary server.

WHILE doing that, I would have the backup server running the WAL files
to get ready to go.

The nice thing about this setup is that it's the simplest to
implement, hence the least error prone, and you can therefore hand it
off to a worker bee while you work on getting the main server up and
running.

Then, when it's ready, you have your decision time, do you keep trying
to get the real server back up, or apply your remaining transactions
to the new server and go from there.

> (2)  Kick the warm standby on "the farm" into production mode, shut down
> the instance, and then copy the instance directory.  This should be
> relatively quick and safe, but has the down side of needing to restart
> the warm standby from the latest base backup afterwards, if that is even
> possible.  It seems like we might need to make a fresh base backup from the
> (stopped) instance on the warm standby farm.

I'm not sure exactly what you're saying here.  If you're saying what I
think you're saying, then you're already constantly playing the WAL
files as they arrive from off site.  If that's the case then this
option seems quite attractive in terms of getting you back up and
running fast.

Since the standby would now become production, making a snapshot of
the standby before making it production would mean that you can then
use the snapshot elsewhere for the standby.

If I understand what you're saying correctly.

Re: "Stand-in" server recovery techniques

От
"Kevin Grittner"
Дата:
Thanks for the reply, Scott.

>>> On Wed, Aug 22, 2007 at  5:04 PM, in message
<dcc563d10708221504k59e8961fyb830ec8eb83598fc@mail.gmail.com>, "Scott Marlowe"
<scott.marlowe@gmail.com> wrote:
> On 8/22/07, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>
>> How do we create the PostgreSQL instance on the stand-in box?
>>
>> (1)  Restore the latest base backup and apply all WAL files available.
>> This is likely to be the slowest option.
>
> I'm leaning towards option 1, because you mentioned that you will be
> doing this ONLY in the event the primary on site server can't come
> back up.  So, I'm going to assume that you're going to spend a few
> minutes (30 or so) trying to resurrect the primary server.
>
> WHILE doing that, I would have the backup server running the WAL files
> to get ready to go.
>
> The nice thing about this setup is that it's the simplest to
> implement, hence the least error prone

Point taken.  That would mean kicking off a process while still in the early
stages of analyzing the problem, and the results would often not be used,
but the cost is minimal.

>> (2)  Kick the warm standby on "the farm" into production mode, shut down
>> the instance, and then copy the instance directory.  This should be
>> relatively quick and safe, but has the down side of needing to restart
>> the warm standby from the latest base backup afterwards, if that is even
>> possible.  It seems like we might need to make a fresh base backup from the
>> (stopped) instance on the warm standby farm.
>
> I'm not sure exactly what you're saying here.  If you're saying what I
> think you're saying, then you're already constantly playing the WAL
> files as they arrive from off site.  If that's the case then this
> option seems quite attractive in terms of getting you back up and
> running fast.
>
> Since the standby would now become production, making a snapshot of
> the standby before making it production would mean that you can then
> use the snapshot elsewhere for the standby.
>
> If I understand what you're saying correctly.

I'm not talking about running production on the warm standby farm.  Given
the load of running all the other warm standby instances, I don't believe
that performance would be acceptable.  What I'm talking about is creating
the semaphore file which would kick the PostgreSQL instance for that county
out of archive recovery mode.  (Our script would see the semaphore file and
exit with "file not found" when it was unable to immediately provide the
requested file.)  The database cluster state reported by pg_controldata
switches from "in archive recovery" to "in production" at that point.  We
would then stop that PostgreSQL instance and scp it to the stand-in server.
We would start that up and be in business again within a few minutes.

My biggest concern with this approach is how we would get back into a
backed-up state.  I don't want to burden the stand-in machine with a PITR
base backup when users are working remotely and trying to make up for lost
time.  If I could somehow use the copy from the warm standby box to make a
base, that would be ideal.  That seems like it should be possible, since it
is the starting point from which we would then be generating WAL files in
production, but I'm not confident that I'd know what I was doing.

I could wait for day-end and keep my fingers crossed that the stand-in
didn't also fail, but that's not an entirely comfortable position, even
with our transaction repository.

-Kevin