Обсуждение: "Stand-in" server recovery techniques
I'm looking for advice on how to best switch from warm standby processing to "stand-in" production use in our (rather unusual) environment. THE ENVIRONMENT We have 72 counties spread around the state, each with their own database, which contains the official data for that county's court system. We do PITR backups to the file server in each county, and replicate the files (both base backups and WAL files) back to a single central machine which hosts a "farm" of warm standby PostgreSQL instances -- one for each county, all on the same machine. This serves to validate the integrity of the files brought back from the counties. Base backups are done weekly, and the WAN speed to a county depends primarily on the size of its base backup -- we try to get a speed sufficient to bring back the base backup within 24 hours. We force WAL files to be written at least once per hour, although many counties generate them much more frequently during the work day. It turns out that 72 warm standby instances can keep a machine pretty busy, so we won't want to use that machine for production. We have two machines ready to use for temporary "stand-ins" over the WAN in the event of a catastrophic failure of a county machine. We will get a database going on one of these, in production mode. This would only be done after recovery efforts on the county machine have failed, and that machine has been powered down. We will bring the warm standby machine as current as we can using WAL files, copy it to a stand-in machine, start our application middle tier and "top off" the stand-in database with transactions from the transaction repository our framework maintains (normally current to within a second). Once this stand-in is running, we will want to become the source of WAL files which can be applied to the warm standby on "the farm". THE QUESTION How do we create the PostgreSQL instance on the stand-in box? I see four possibilities: (1) Restore the latest base backup and apply all WAL files available. This is likely to be the slowest option. (2) Kick the warm standby on "the farm" into production mode, shut down the instance, and then copy the instance directory. This should be relatively quick and safe, but has the down side of needing to restart the warm standby from the latest base backup afterwards, if that is even possible. It seems like we might need to make a fresh base backup from the (stopped) instance on the warm standby farm. (3) Stop the warm standby while it is in recovery mode, copy the instance directory, and restart it. On the stand-in box, have the script specified by recovery.conf just "exit 1" to kick it immediately into production mode. As long as the WAL files generated in this situation work with the old warm standby, I don't see a down side, but I'm not sure if that is a safe assumption. (4) Capture information about where the warm standby is at and attempt a PITR-style copy of the instance while the warm stanby is running. This one seems riskier than (3) without significant benefit. So, I'm leaning toward option (3). Does that sound safe and workable? Will the WAL files from the stand-in work with the old warm standby? Did I miss any good alternatives? Thanks, -Kevin
On 8/22/07, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > THE QUESTION > > How do we create the PostgreSQL instance on the stand-in box? > > I see four possibilities: > > (1) Restore the latest base backup and apply all WAL files available. > This is likely to be the slowest option. I'm leaning towards option 1, because you mentioned that you will be doing this ONLY in the event the primary on site server can't come back up. So, I'm going to assume that you're going to spend a few minutes (30 or so) trying to resurrect the primary server. WHILE doing that, I would have the backup server running the WAL files to get ready to go. The nice thing about this setup is that it's the simplest to implement, hence the least error prone, and you can therefore hand it off to a worker bee while you work on getting the main server up and running. Then, when it's ready, you have your decision time, do you keep trying to get the real server back up, or apply your remaining transactions to the new server and go from there. > (2) Kick the warm standby on "the farm" into production mode, shut down > the instance, and then copy the instance directory. This should be > relatively quick and safe, but has the down side of needing to restart > the warm standby from the latest base backup afterwards, if that is even > possible. It seems like we might need to make a fresh base backup from the > (stopped) instance on the warm standby farm. I'm not sure exactly what you're saying here. If you're saying what I think you're saying, then you're already constantly playing the WAL files as they arrive from off site. If that's the case then this option seems quite attractive in terms of getting you back up and running fast. Since the standby would now become production, making a snapshot of the standby before making it production would mean that you can then use the snapshot elsewhere for the standby. If I understand what you're saying correctly.
Thanks for the reply, Scott. >>> On Wed, Aug 22, 2007 at 5:04 PM, in message <dcc563d10708221504k59e8961fyb830ec8eb83598fc@mail.gmail.com>, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > On 8/22/07, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > >> How do we create the PostgreSQL instance on the stand-in box? >> >> (1) Restore the latest base backup and apply all WAL files available. >> This is likely to be the slowest option. > > I'm leaning towards option 1, because you mentioned that you will be > doing this ONLY in the event the primary on site server can't come > back up. So, I'm going to assume that you're going to spend a few > minutes (30 or so) trying to resurrect the primary server. > > WHILE doing that, I would have the backup server running the WAL files > to get ready to go. > > The nice thing about this setup is that it's the simplest to > implement, hence the least error prone Point taken. That would mean kicking off a process while still in the early stages of analyzing the problem, and the results would often not be used, but the cost is minimal. >> (2) Kick the warm standby on "the farm" into production mode, shut down >> the instance, and then copy the instance directory. This should be >> relatively quick and safe, but has the down side of needing to restart >> the warm standby from the latest base backup afterwards, if that is even >> possible. It seems like we might need to make a fresh base backup from the >> (stopped) instance on the warm standby farm. > > I'm not sure exactly what you're saying here. If you're saying what I > think you're saying, then you're already constantly playing the WAL > files as they arrive from off site. If that's the case then this > option seems quite attractive in terms of getting you back up and > running fast. > > Since the standby would now become production, making a snapshot of > the standby before making it production would mean that you can then > use the snapshot elsewhere for the standby. > > If I understand what you're saying correctly. I'm not talking about running production on the warm standby farm. Given the load of running all the other warm standby instances, I don't believe that performance would be acceptable. What I'm talking about is creating the semaphore file which would kick the PostgreSQL instance for that county out of archive recovery mode. (Our script would see the semaphore file and exit with "file not found" when it was unable to immediately provide the requested file.) The database cluster state reported by pg_controldata switches from "in archive recovery" to "in production" at that point. We would then stop that PostgreSQL instance and scp it to the stand-in server. We would start that up and be in business again within a few minutes. My biggest concern with this approach is how we would get back into a backed-up state. I don't want to burden the stand-in machine with a PITR base backup when users are working remotely and trying to make up for lost time. If I could somehow use the copy from the warm standby box to make a base, that would be ideal. That seems like it should be possible, since it is the starting point from which we would then be generating WAL files in production, but I'm not confident that I'd know what I was doing. I could wait for day-end and keep my fingers crossed that the stand-in didn't also fail, but that's not an entirely comfortable position, even with our transaction repository. -Kevin