Re: Issues with Quorum Commit
От | Greg Smith |
---|---|
Тема | Re: Issues with Quorum Commit |
Дата | |
Msg-id | 4CAF7935.7050001@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: Issues with Quorum Commit (Markus Wanner <markus@bluegap.ch>) |
Ответы |
Re: Issues with Quorum Commit
(Markus Wanner <markus@bluegap.ch>)
|
Список | pgsql-hackers |
Markus Wanner wrote: > ..and how do you make sure you are not marking your second standby as > degraded just because it's currently lagging? Effectively degrading the > utterly needed one, because your first standby has just bitten the dust? > People are going to monitor the standby lag. If it gets excessive relative to where it's approaching the known timeout, the flashing yellow lights should go off at this point, before it gets this bad. And if you've set a reasonable business oriented timeout on how long you can stand for the master to be held up waiting for a lagging standby, the right thing to do may very well be to cut it off. At some point people will want to stop waiting for a standby if it's taking so long to commit that it's interfering with the ability of the master to operate normally. Such a master is already degraded, if your performance metrics for availability includes processing transactions in a timely manner. > And how do you prevent the split brain situation in case the master dies > shortly after these events, but fails to come up again immediately? > How is that a new problem? It's already possible to end up with a standby pair that has suffered through some bizarre failure chain such that it's not necessarily obvious which of the two systems has the most recent set of data on it. And that's not this project's problem to solve. Useful answers to the split brain problem involve fencing implementations that normally drop to the hardware level, and clustering solutions including those features are already available that PostgreSQL can integrate into. Assuming you have to solve this in order to deliver a useful database replication component is excessively ambitious. You seem to be under the assumption that a more complicated replication implementation here will make reaching a bad state impossible. I think that's optimistic, both in theory and in regards to how successful code gets built. Here's the thing: the difficultly of testing to prove your code actually works is also proportional to that complexity. This project can chose to commit and potentially ship a simple solution that has known limitations, and expect that people will fill in the gap with existing add-on software to handle the clustering parts it doesn't: fencing, virtual IP address assignment, etc. All while getting useful testing feedback on the simple bottom layer, whose main purpose in life is to transport WAL data synchronously. Or, we can argue in favor of adding additional complexity on top first instead, so we end up with layers and layers of untested code. That path leads to situations where you're lucky to ship at all, and when you do the result is difficult to support. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
В списке pgsql-hackers по дате отправления: