Re: Fully-automatic streaming replication failover when master dies?

Поиск
Список
Период
Сортировка
От Dmitry Koterov
Тема Re: Fully-automatic streaming replication failover when master dies?
Дата
Msg-id CA+CZih4zytO+ap1=_eOL15R5TAkA5wMz-YdtAUHn66vXxaxKqw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Fully-automatic streaming replication failover when master dies?  (Scott Marlowe <scott.marlowe@gmail.com>)
Ответы Re: Fully-automatic streaming replication failover when master dies?
Список pgsql-general
Complex in its implementation - maybe. Complex in its configuration and ideology - nope. 

Are you running your cluster in synchronous mode across geographically diverse data centers?
Config option #1 ("allow replicas to re-bind to the second synchronous master if the first one fails, and allow the second master to run separately").

> How long do you wait for the master to come back before you fail over?
Config option #2 ("how many missed heartbeats cause the automatic failover process").

There could be more options, of course. But all this is not a rocket science, it's just not yet implemented, I suppose. I thought your point was "The difference is that in MongoDB automatic failover is simple, in PostgreSQL it is much more complex" - I don't agree with that, I think the tasks have more or less same complexity. There could be no "silver bullet" with no data loss for PostgreSQL and MongoDB both (though both PostgreSQL and MongoDB support synchronous multi-node commits: PostgreSQL supports synchronous multi-master, MongoDB supports write concern, but this causes a performance penalty). I just mentioned MongoDB, because it has an excellent automatic failover mechanism, which originates not from its noSQLness and could be theoretically implemented in any other databases (including PostgreSQL), not for a holy war.



On Sun, Jan 26, 2014 at 8:50 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Please don't top post in technical discussions.

On Sat, Jan 25, 2014 at 11:29 AM, Dmitry Koterov
<dmitry.koterov@gmail.com> wrote:
>
> On Friday, January 24, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>
>> On Thu, Jan 23, 2014 at 7:16 PM, Sameer Kumar <sameer.kumar@ashnik.com>
>> wrote:
>> >
>> >
>> > On Fri, Jan 24, 2014 at 1:38 AM, Susan Cassidy
>> > <susan.cassidy@decisionsciencescorp.com> wrote:
>> >>
>> >> pgpool-II may do what you want.  Lots of people use it.
>> >
>> >
>> > I don't think pgpool adds the lost node on its own (once the node is
>> > live or available again). Plus if you have a 3 node replication you need to
>> > have your own failover_command (as a shell script) which changes the master
>> > node for 2nd secondary when one of the secondary servers decides to be
>> > promoted to primary). I hope things will get easy with version 9.4 (I guess
>> > in 9.4 one won't have to rebuild a master node from backup. if the wal files
>> > are available it will just roll forward).
>> >
>> >> > for all the machines). At least MongoDB does the work well, and with
>> >> > almost
>> >> > zero configuration.
>> >> Mongo's data guarantees are, um, somewhat less robust than
>> >> PostgreSQL's.
>> >
>> >
>> > I don't think this has anything to do with data reliability or ACID
>> > property (if that is what you are referring to).
>> >
>> >>  Failover is easy if you don't have to be exactly right.
>> >
>> >
>> > IMHO That's not a fair point. PostgreSQL supports sync replication (as
>> > well as async) and does that complicate the failover process or an async
>> > replication? I guess what he is asking for is automation of whatever feature
>> > PostgreSQL already supports.
>>
>> No it's a fair point. When you go from "we promise to try and not lose
>> your data" to "we promise to not lose any of your data" the situation
>> is much different.
>>
>> There are many things to consider in the postgresql situation. Is it
>> more important to keep your application up and running, even if only
>> in read only mode? Is performance more important than data integrity?
>> How many nodes do you have? How man can auto-fail over before you
>> auto-fail over to the very last one? How do you rejoin failed nodes,
>> one at a time, all at once, by hand, automagically? And so on. There
>> are a LOT of questions to ask that mongo already decided for you, and
>> the decision was that if you lose some data that's OK as long as the
>> cluster stays up. With PostgreSQL the decision making process probably
>> has a big impact on how you answer these types of questions and how
>> you fail over.
>>
>> Add to that that most postgresql database servers are VERY robust,
>> with multi-lane RAID array controllers and / or sturdy SANs underneath
>> them, and their failure rates are very low, you run the risk of your
>> auto-failover causing much of an outage as the server failing, since
>> most failovers are going to cause some short interruption in service.
>> It's not a simple push a button take a banana, one size fits all
>> problem and solution.

> Failover is NOT about the RAID or SAN robusness mostly. It's about
> datacenters connectivity and network issues. If you lose one datacenter (it
> happens, and there is no aid for it), you should redirect all traffic to
> another DC ASAP and failover the master DB to it. When the disconnected DC
> is up again, it should recover from this situation.
>
> So +1 for the previous man, PostgreSQL ACID and MongoDB non-ACID have
> absolute no relevance to the failover problem.

If you'll bother reading what I wrote AGAIN, you'll notice my mention
on ACID etc was more of an afterthought here. There are real questions
about data loss and recovery that matter when you are failing over.
Are you running your cluster in synchronous mode across geographically
diverse data centers? If not how long do you wait for the master to
come back before you fail over? A millisecond? A second? A minute? The
answer will likely be different for me than for you.

While ACID isn't the main or only reason for things being different,
it IS a valid reason because different people use PostgreSQL for
different things. If I'm running it as a session server, I treat it
one way, as a key-value store another, as a transactional database
handling monetary funds yet another. You're refusal to accept that
this is a complex issue with complex answers isn't helping you find
the right answer to your problem.

--
To understand recursion, one must first understand recursion.

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: problem with grant all privileges
Следующее
От: antono124
Дата:
Сообщение: Composite type