On Sun, Jun 16, 2019 at 7:30 PM Stephen Frost <sfrost@snowman.net> wrote:
Ok, so you want fewer checkpoints because you expect to failover to a replica rather than recover the primary on a failure. If you're doing synchronous replication, then that certainly makes sense. If you aren't, then you're deciding that you're alright with losing some number of writes by failing over rather than recovering the primary, which can also be acceptable but it's certainly much more questionable.
Yes, in our setup that's the case: a few lost transactions will have a negligible impact to the business.
I'm getting the feeling that your replicas are async, but it sounds like you'd be better off with having at least one sync replica, so that you can flip to it quickly.
They are indeed async, we traded durability for performance here, because we can accept some lost transactions.
Alternatively, having a way to more easily make the primary to accepting new writes, flush everything to the replicas, report that it's completed doing so, to allow you to promote a replica without losing anything, and *then* go through the process on the primary of doing a checkpoint, would be kind of nice.
I suppose that would require being able to demote a master to a slave during runtime. That would definitely be nice-to-have.