Re: Fault Tolerant Postgresql (two machines, two postmasters, one disk array)
От | Andrew Sullivan |
---|---|
Тема | Re: Fault Tolerant Postgresql (two machines, two postmasters, one disk array) |
Дата | |
Msg-id | 20070517143525.GJ6907@phlogiston.dyndns.org обсуждение исходный текст |
Ответ на | Re: Fault Tolerant Postgresql (two machines, two postmasters, one disk array) (John Gateley <gateley@jriver.com>) |
Ответы |
Re: Fault Tolerant Postgresql (two machines, two postmasters, one disk array)
("John D. Burger" <john@mitre.org>)
Re: Fault Tolerant Postgresql (two machines, two postmasters, one disk array) (Ron Johnson <ron.l.johnson@cox.net>) |
Список | pgsql-general |
On Mon, May 14, 2007 at 10:42:13AM -0500, John Gateley wrote: > Thanks very much to all who responded, the replies were very helpful. One thing I will mention, that seems not to have come out in a number of the replies: the details _really really_ count when you set up this sort of mutli-machine hot failover arrangement. The general idea is that you have two machines, and the "standby" machine notices when the "hot" machine disappears, and then mounts the disk on the standby and takes over for the (now failed) hot machine. The problems come when you get a false detection of machine failure. Consider a case, for instance, where the machine A gets overloaded, goes into swap madness, or has a billion runaway processes that cause it to stagger. In this case, A might not respond in time on the heartbeat monitor, and then the standby machine B thinks A has failed. But A doesn't know that, of course, because it is working as hard as it can just to stay up. Now, if B mounts the disk and starts the postmaster, but doesn't have a way to make _sure_ tha A is completely disconnected from the disk, then it's entirely possible A will flush buffers out to the still-mounted data area. Poof! Instant data corruption. People often dismiss these sorts of scenarios as unlikely, because of the timing issues involved. But you have to remember that, if you're building this kind of high-availability system, you've already built your individual servers to be very fault tolerant anyway. They have loads of extra capacity, ECC memory, multiple redundant data paths, RAID -- all the goodies. So you're talking about an already unlikely failure scenario. If you're going to the effort to get an "extra 9" of availability, then you have to think about not only how to ensure you get that availability, but the consequences of failure. In this case, the consequence of having two systems mount the same data area is extremely serious, and you have to be _absolutely sure_ that A is dead and disconnected from the disk when B mounts that disk. Anything else is just asking for your weekend to be ruined by a data recovery. A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler
В списке pgsql-general по дате отправления: