Обсуждение: PostgreSQL Synchronous Replication in production

Поиск
Список
Период
Сортировка

PostgreSQL Synchronous Replication in production

От
Colin Sloss
Дата:

Hello,


I have been testing the differences between asynchronous and synchronous hot standby streaming replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding some serious problems, and wonder how other people deal with them.


Action:

The Slave is unreachable (postgres is stopped, or machine is turned off)


Result: 

Transactions stay open, and add up until connection is allowed between Master and Slave again.


My guess:

Create a script that detects when transactions are being opened, but not committed on slave. Alter postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the server back to asynchronous, more or less.


Caveats:

I haven't tested this yet, and I'm not sure what would happen to the transactions that started while synchronous replication was active. Any guesses?


The whole idea of my solution was to have no single point of failure. This seems to create two exclusive points of failure, each needing a completely separate reaction. My original proposal was asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which still results in a very speedy failover. Is it just me, or is that seeming better than just synchronous replication? 


Another caveat I found is that setting up slaves becomes more complicated. You have to set up the Master in asynchronous style, and then switch it to synchronous only when the timing is right. Otherwise the transactions will sit there until everything is ready.


Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any resolutions to these issues which you guys have discovered.


Regards,


Colin

Re: PostgreSQL Synchronous Replication in production

От
Richard Huxton
Дата:
On 06/06/13 11:20, Colin Sloss wrote:
>
> I have been testing the differences between asynchronous and synchronous
> hot standby streaming replication on PostgreSQL 9.2.4. There is some
> push towards synchronous replication, but I am finding some serious
> problems, and wonder how other people deal with them.
[snip]
> The whole idea of my solution was to have no single point of failure.
> This seems to create two exclusive points of failure, each needing a
> completely separate reaction.

Synchronous replication provides a higher level of guarantee for an
individual transaction (it's safely[1] on at least two boxes now) at the
cost of making the system as a whole more brittle.

Your uptime as a "service" will inevitably be reduced since in the event
of problems talking to the slave the master will *have* to delay/cancel
new transactions.

I have seen people suggest some sort of mode where the server drops back
to asynch mode in the event of problems. I can't quite understand the
use-case for that though - either you want synchronous replication or
you don't. Mostly-synchronous is just asynchronous.

Here's a few questions. How you answer them will decide whether you
really want synchronous replication or not:
1. The link between servers encounters network congestion
   a. The whole system should slow down.
      Committed transactions should ALWAYS be on
      two geographically separate machines.
   b. An alert should be sent.
      If it's not sorted in 5 mins we'll get someone to look at it.
2. Adding more servers[2] to my replication should:
   a. Make the system as a whole slower[3] and reduce uptime
      but increase the safety of committed transactions
   b. Make the system as a whole faster and increase uptime

There are cases where you want (a), but lots where you want (b) and
monitor the replication lag.


[1] For various values of "safely" of course
[2] In the same mode - adding async slaves doesn't count
[3] Assuming a reasonable write load of course. Read-only databases
won't care.

--
   Richard Huxton
   Archonet Ltd


Re: PostgreSQL Synchronous Replication in production

От
Colin S
Дата:

> Here's a few questions. How you answer them will decide whether you
> really want synchronous replication or not:
> 1. The link between servers encounters network congestion
> a. The whole system should slow down.
> Committed transactions should ALWAYS be on
> two geographically separate machines.
> b. An alert should be sent.
> If it's not sorted in 5 mins we'll get someone to look at it.
> 2. Adding more servers[2] to my replication should:
> a. Make the system as a whole slower[3] and reduce uptime
> but increase the safety of committed transactions
> b. Make the system as a whole faster and increase uptime
>
> There are cases where you want (a), but lots where you want (b) and
> monitor the replication lag.
>
>
> [1] For various values of "safely" of course
> [2] In the same mode - adding async slaves doesn't count
> [3] Assuming a reasonable write load of course. Read-only databases
> won't care.
>
> --
> Richard Huxton
> Archonet Ltd
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Hello,

Thanks for your answer. I find it very interesting that you say that synchronous setups should always be in two geographically separate locations. In this case they are on the same subnet. Adding the lag of committing to two, geographically separate, databases is not feasible for this OLTP application.

I also like your point that  "mostly synchronous is just asynchronous." So, responding by switching to asynchronous as a response to slow-down is asynchronous anyway.

Any other comments, or examples, of when synchronous is worth implementing would be greatly appreciated.

Regards,

Colin

Re: PostgreSQL Synchronous Replication in production

От
Richard Huxton
Дата:
On 06/06/13 12:48, Colin S wrote:
> Thanks for your answer. I find it very interesting that you say that
> synchronous setups should always be in two geographically separate
> locations. In this case they are on the same subnet. Adding the lag of
> committing to two, geographically separate, databases is not feasible
> for this OLTP application.

Well, if they're in the same building(s) then your transactions are all
at the same risk from fire/earthquake/godzilla etc. Might/might not be
important to you.

> I also like your point that "mostly synchronous is just asynchronous."
> So, responding by switching to asynchronous as a response to slow-down
> is asynchronous anyway.

"Mostly synchronous" is like "a bit pregnant".

> Any other comments, or examples, of when synchronous is worth
> implementing would be greatly appreciated.

Note that PostgreSQL's synchronous replication just guarantees that the
commit has reached the transaction log on the slave. That doesn't mean
the slave has replayed the transaction log and a query against the slave
will show the transaction's results. So - it doesn't in itself guarantee
that you can see issue read-only queries against either server
indiscriminately.

However, if you really, really need to know that a committed transaction
is on two physically separate sets of disks then synchronous is what you
want. If both sets of disks are in the same building then you might be
able to achieve the same result by other (cheaper/simpler?) means.

If you have a business e.g. selling books or train tickets or some such
then you might decide it's better to have a simpler more robust setup
from the point of view of providing continuous service to end-customers.
In the (hopefully rare) event of a crash irreparably losing some
transactions apologise to your customers and recompense them generously.

For a system handling multi-million pound inter-bank transfers you might
decide it's better to have the system not working at all rather than
have an increased risk of a lost transaction.

Of course in both cases you might well want a separate list/cache of
pending/recently-committed transactions to check against in the event of
a failure.

I believe what you should do from an engineering approach is to treat it
in a similar way to security. What do you want to protect against? Make
a list of possible failures and what they mean to the business/project
and then decide how much time/money to spend protecting against each one.

--
   Richard Huxton
   Archonet Ltd


Re: PostgreSQL Synchronous Replication in production

От
Albe Laurenz
Дата:
Colin Sloss wrote:
> I have been testing the differences between asynchronous and synchronous hot standby streaming
> replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding
> some serious problems, and wonder how other people deal with them.
> 
> Action:
> The Slave is unreachable (postgres is stopped, or machine is turned off)
> 
> Result:
> Transactions stay open, and add up until connection is allowed between Master and Slave again.
> 
> My guess:
> Create a script that detects when transactions are being opened, but not committed on slave. Alter
> postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the
> server back to asynchronous, more or less.
> 
> Caveats:
> I haven't tested this yet, and I'm not sure what would happen to the transactions that started while
> synchronous replication was active. Any guesses?
> 
> The whole idea of my solution was to have no single point of failure. This seems to create two
> exclusive points of failure, each needing a completely separate reaction. My original proposal was
> asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent
> it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which
> still results in a very speedy failover. Is it just me, or is that seeming better than just
> synchronous replication?
> 
> Another caveat I found is that setting up slaves becomes more complicated. You have to set up the
> Master in asynchronous style, and then switch it to synchronous only when the timing is right.
> Otherwise the transactions will sit there until everything is ready.
> 
> Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any
> resolutions to these issues which you guys have discovered.

One simple guideline first:
If you use synchronous replication, you'll have to have at least two standby
servers or the overall availability of your system will suffer.

The differences between synchronous and asynchronous replication are mostly:
a) With synchronous replication you cannot lose a committed transaction
   during failover.
b) Synchronous replication will slow down your system; the higher the
   network latency between the servers, the slower it will get.

You should work on getting the requirements defined:
- In the case of failover, can you afford to lose a few committed transactions?
- Are you ready to pay the price for synchronous replication
  (second standby if you don't want availability to suffer, slower
  database system, more complicated setup and failover procedures)?

For synchronous replication, keep the servers close together with
a strong network inbetween.
To protect against catastrophes (if that's a requirement), you should
use another asynchronous standby in a distant location.

Yours,
Laurenz Albe

Re: PostgreSQL Synchronous Replication in production

От
Merlin Moncure
Дата:
On Fri, Jun 7, 2013 at 3:22 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
> Colin Sloss wrote:
>> I have been testing the differences between asynchronous and synchronous hot standby streaming
>> replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding
>> some serious problems, and wonder how other people deal with them.
>>
>> Action:
>> The Slave is unreachable (postgres is stopped, or machine is turned off)
>>
>> Result:
>> Transactions stay open, and add up until connection is allowed between Master and Slave again.
>>
>> My guess:
>> Create a script that detects when transactions are being opened, but not committed on slave. Alter
>> postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the
>> server back to asynchronous, more or less.
>>
>> Caveats:
>> I haven't tested this yet, and I'm not sure what would happen to the transactions that started while
>> synchronous replication was active. Any guesses?
>>
>> The whole idea of my solution was to have no single point of failure. This seems to create two
>> exclusive points of failure, each needing a completely separate reaction. My original proposal was
>> asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent
>> it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which
>> still results in a very speedy failover. Is it just me, or is that seeming better than just
>> synchronous replication?
>>
>> Another caveat I found is that setting up slaves becomes more complicated. You have to set up the
>> Master in asynchronous style, and then switch it to synchronous only when the timing is right.
>> Otherwise the transactions will sit there until everything is ready.
>>
>> Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any
>> resolutions to these issues which you guys have discovered.
>
> One simple guideline first:
> If you use synchronous replication, you'll have to have at least two standby
> servers or the overall availability of your system will suffer.
>
> The differences between synchronous and asynchronous replication are mostly:
> a) With synchronous replication you cannot lose a committed transaction
>    during failover.
> b) Synchronous replication will slow down your system; the higher the
>    network latency between the servers, the slower it will get.
>
> You should work on getting the requirements defined:
> - In the case of failover, can you afford to lose a few committed transactions?
> - Are you ready to pay the price for synchronous replication
>   (second standby if you don't want availability to suffer, slower
>   database system, more complicated setup and failover procedures)?
>
> For synchronous replication, keep the servers close together with
> a strong network inbetween.
> To protect against catastrophes (if that's a requirement), you should
> use another asynchronous standby in a distant location.

yeah -- well put.  Synchronous replication exists because in some
cases even the loss of a single transaction is unacceptable.  So it's
not really fair to compare vs asynchronous which is more of a 'best
effort' system; it in no way guarantees that every transaction is
recoverable.  That small fudge allows for big optimizations in terms
of process simplicity and performance.

merlin


Re: PostgreSQL Synchronous Replication in production

От
Jeff Janes
Дата:
On Thu, Jun 6, 2013 at 5:23 AM, Richard Huxton <dev@archonet.com> wrote:

If you have a business e.g. selling books or train tickets or some such then you might decide it's better to have a simpler more robust setup from the point of view of providing continuous service to end-customers. In the (hopefully rare) event of a crash irreparably losing some transactions apologise to your customers and recompense them generously.

Unfortunately you probably no longer know who to apologize to, what for, or how much to compensate them!

I guess when you reconcile your cc credits from the cc company recorded in their system to the sales in your recovered system, you will have evidence of the discrepancies.

Now I can't use an ecommerce without pondering all the ways something can go wrong, and how to minimize/address them.

Cheers,

Jeff