Обсуждение: PostgreSQL Synchronous Replication in production
Hello,
I have been testing the differences between asynchronous and synchronous hot standby streaming replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding some serious problems, and wonder how other people deal with them.
Action:
The Slave is unreachable (postgres is stopped, or machine is turned off)
Result:
Transactions stay open, and add up until connection is allowed between Master and Slave again.
My guess:
Create a script that detects when transactions are being opened, but not committed on slave. Alter postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the server back to asynchronous, more or less.
Caveats:
I haven't tested this yet, and I'm not sure what would happen to the transactions that started while synchronous replication was active. Any guesses?
The whole idea of my solution was to have no single point of failure. This seems to create two exclusive points of failure, each needing a completely separate reaction. My original proposal was asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which still results in a very speedy failover. Is it just me, or is that seeming better than just synchronous replication?
Another caveat I found is that setting up slaves becomes more complicated. You have to set up the Master in asynchronous style, and then switch it to synchronous only when the timing is right. Otherwise the transactions will sit there until everything is ready.
Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any resolutions to these issues which you guys have discovered.
Regards,
Colin
On 06/06/13 11:20, Colin Sloss wrote: > > I have been testing the differences between asynchronous and synchronous > hot standby streaming replication on PostgreSQL 9.2.4. There is some > push towards synchronous replication, but I am finding some serious > problems, and wonder how other people deal with them. [snip] > The whole idea of my solution was to have no single point of failure. > This seems to create two exclusive points of failure, each needing a > completely separate reaction. Synchronous replication provides a higher level of guarantee for an individual transaction (it's safely[1] on at least two boxes now) at the cost of making the system as a whole more brittle. Your uptime as a "service" will inevitably be reduced since in the event of problems talking to the slave the master will *have* to delay/cancel new transactions. I have seen people suggest some sort of mode where the server drops back to asynch mode in the event of problems. I can't quite understand the use-case for that though - either you want synchronous replication or you don't. Mostly-synchronous is just asynchronous. Here's a few questions. How you answer them will decide whether you really want synchronous replication or not: 1. The link between servers encounters network congestion a. The whole system should slow down. Committed transactions should ALWAYS be on two geographically separate machines. b. An alert should be sent. If it's not sorted in 5 mins we'll get someone to look at it. 2. Adding more servers[2] to my replication should: a. Make the system as a whole slower[3] and reduce uptime but increase the safety of committed transactions b. Make the system as a whole faster and increase uptime There are cases where you want (a), but lots where you want (b) and monitor the replication lag. [1] For various values of "safely" of course [2] In the same mode - adding async slaves doesn't count [3] Assuming a reasonable write load of course. Read-only databases won't care. -- Richard Huxton Archonet Ltd
> really want synchronous replication or not:
> 1. The link between servers encounters network congestion
> a. The whole system should slow down.
> Committed transactions should ALWAYS be on
> two geographically separate machines.
> b. An alert should be sent.
> If it's not sorted in 5 mins we'll get someone to look at it.
> 2. Adding more servers[2] to my replication should:
> a. Make the system as a whole slower[3] and reduce uptime
> but increase the safety of committed transactions
> b. Make the system as a whole faster and increase uptime
>
> There are cases where you want (a), but lots where you want (b) and
> monitor the replication lag.
>
>
> [1] For various values of "safely" of course
> [2] In the same mode - adding async slaves doesn't count
> [3] Assuming a reasonable write load of course. Read-only databases
> won't care.
>
> --
> Richard Huxton
> Archonet Ltd
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
On 06/06/13 12:48, Colin S wrote: > Thanks for your answer. I find it very interesting that you say that > synchronous setups should always be in two geographically separate > locations. In this case they are on the same subnet. Adding the lag of > committing to two, geographically separate, databases is not feasible > for this OLTP application. Well, if they're in the same building(s) then your transactions are all at the same risk from fire/earthquake/godzilla etc. Might/might not be important to you. > I also like your point that "mostly synchronous is just asynchronous." > So, responding by switching to asynchronous as a response to slow-down > is asynchronous anyway. "Mostly synchronous" is like "a bit pregnant". > Any other comments, or examples, of when synchronous is worth > implementing would be greatly appreciated. Note that PostgreSQL's synchronous replication just guarantees that the commit has reached the transaction log on the slave. That doesn't mean the slave has replayed the transaction log and a query against the slave will show the transaction's results. So - it doesn't in itself guarantee that you can see issue read-only queries against either server indiscriminately. However, if you really, really need to know that a committed transaction is on two physically separate sets of disks then synchronous is what you want. If both sets of disks are in the same building then you might be able to achieve the same result by other (cheaper/simpler?) means. If you have a business e.g. selling books or train tickets or some such then you might decide it's better to have a simpler more robust setup from the point of view of providing continuous service to end-customers. In the (hopefully rare) event of a crash irreparably losing some transactions apologise to your customers and recompense them generously. For a system handling multi-million pound inter-bank transfers you might decide it's better to have the system not working at all rather than have an increased risk of a lost transaction. Of course in both cases you might well want a separate list/cache of pending/recently-committed transactions to check against in the event of a failure. I believe what you should do from an engineering approach is to treat it in a similar way to security. What do you want to protect against? Make a list of possible failures and what they mean to the business/project and then decide how much time/money to spend protecting against each one. -- Richard Huxton Archonet Ltd
Colin Sloss wrote: > I have been testing the differences between asynchronous and synchronous hot standby streaming > replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding > some serious problems, and wonder how other people deal with them. > > Action: > The Slave is unreachable (postgres is stopped, or machine is turned off) > > Result: > Transactions stay open, and add up until connection is allowed between Master and Slave again. > > My guess: > Create a script that detects when transactions are being opened, but not committed on slave. Alter > postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the > server back to asynchronous, more or less. > > Caveats: > I haven't tested this yet, and I'm not sure what would happen to the transactions that started while > synchronous replication was active. Any guesses? > > The whole idea of my solution was to have no single point of failure. This seems to create two > exclusive points of failure, each needing a completely separate reaction. My original proposal was > asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent > it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which > still results in a very speedy failover. Is it just me, or is that seeming better than just > synchronous replication? > > Another caveat I found is that setting up slaves becomes more complicated. You have to set up the > Master in asynchronous style, and then switch it to synchronous only when the timing is right. > Otherwise the transactions will sit there until everything is ready. > > Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any > resolutions to these issues which you guys have discovered. One simple guideline first: If you use synchronous replication, you'll have to have at least two standby servers or the overall availability of your system will suffer. The differences between synchronous and asynchronous replication are mostly: a) With synchronous replication you cannot lose a committed transaction during failover. b) Synchronous replication will slow down your system; the higher the network latency between the servers, the slower it will get. You should work on getting the requirements defined: - In the case of failover, can you afford to lose a few committed transactions? - Are you ready to pay the price for synchronous replication (second standby if you don't want availability to suffer, slower database system, more complicated setup and failover procedures)? For synchronous replication, keep the servers close together with a strong network inbetween. To protect against catastrophes (if that's a requirement), you should use another asynchronous standby in a distant location. Yours, Laurenz Albe
On Fri, Jun 7, 2013 at 3:22 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote: > Colin Sloss wrote: >> I have been testing the differences between asynchronous and synchronous hot standby streaming >> replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding >> some serious problems, and wonder how other people deal with them. >> >> Action: >> The Slave is unreachable (postgres is stopped, or machine is turned off) >> >> Result: >> Transactions stay open, and add up until connection is allowed between Master and Slave again. >> >> My guess: >> Create a script that detects when transactions are being opened, but not committed on slave. Alter >> postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the >> server back to asynchronous, more or less. >> >> Caveats: >> I haven't tested this yet, and I'm not sure what would happen to the transactions that started while >> synchronous replication was active. Any guesses? >> >> The whole idea of my solution was to have no single point of failure. This seems to create two >> exclusive points of failure, each needing a completely separate reaction. My original proposal was >> asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent >> it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which >> still results in a very speedy failover. Is it just me, or is that seeming better than just >> synchronous replication? >> >> Another caveat I found is that setting up slaves becomes more complicated. You have to set up the >> Master in asynchronous style, and then switch it to synchronous only when the timing is right. >> Otherwise the transactions will sit there until everything is ready. >> >> Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any >> resolutions to these issues which you guys have discovered. > > One simple guideline first: > If you use synchronous replication, you'll have to have at least two standby > servers or the overall availability of your system will suffer. > > The differences between synchronous and asynchronous replication are mostly: > a) With synchronous replication you cannot lose a committed transaction > during failover. > b) Synchronous replication will slow down your system; the higher the > network latency between the servers, the slower it will get. > > You should work on getting the requirements defined: > - In the case of failover, can you afford to lose a few committed transactions? > - Are you ready to pay the price for synchronous replication > (second standby if you don't want availability to suffer, slower > database system, more complicated setup and failover procedures)? > > For synchronous replication, keep the servers close together with > a strong network inbetween. > To protect against catastrophes (if that's a requirement), you should > use another asynchronous standby in a distant location. yeah -- well put. Synchronous replication exists because in some cases even the loss of a single transaction is unacceptable. So it's not really fair to compare vs asynchronous which is more of a 'best effort' system; it in no way guarantees that every transaction is recoverable. That small fudge allows for big optimizations in terms of process simplicity and performance. merlin
If you have a business e.g. selling books or train tickets or some such then you might decide it's better to have a simpler more robust setup from the point of view of providing continuous service to end-customers. In the (hopefully rare) event of a crash irreparably losing some transactions apologise to your customers and recompense them generously.