Re: Logical replication and multimaster

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: Logical replication and multimaster
Дата
Msg-id CAMsr+YE9xgD_LoOm_LmSs9_MiuLgOay=LziWLFvGNN6xfKB-sA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Logical replication and multimaster  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
On 3 December 2015 at 20:39, Simon Riggs <simon@2ndquadrant.com> wrote:
On 30 November 2015 at 17:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
But looks like there is not so much sense in having multiple network connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide parallel execution of received transactions at destination side. But it seems to be also nontrivial. We have now in PostgreSQL some infrastructure for background works, but there is still no abstraction of workers pool and job queue which can provide simple way to organize parallel execution of some jobs. I wonder if somebody is working now on it or we should try to propose our solution?

There are definitely two clear places where additional help would be useful and welcome right now.

Three IMO, in that a re-usable, generic bgworker pool driven by shmem messaging would be quite handy. We'll want something like that when we have transaction interleaving.

I think Konstantin's design is a bit restrictive at the moment; at the least it needs to address sticky dispatch, and it almost certainly needs to be using dynamic bgworkers (and maybe dynamic shmem too) to be flexible. Some thought will be needed to make sure it doesn't rely on !EXEC_BACKEND stuff like passing pointers to fork()ed data from postmaster memory too. But the general idea sounds really useful, and we'll either need that or to use async libpq for concurrent apply.
 
1. Allowing logical decoding to have a "speculative pre-commit data" option, to allow some data to be made available via the decoding api, allowing data to be transferred prior to commit.

Petr, Andres and I tended to refer to that as interleaved transaction streaming. The idea being to send changes from multiple xacts mixed together in the stream, identifed by an xid sent with each message, as we decode them from WAL. Currently we add them to a local reorder buffer and send them only in commit order after commit.

This moves responsibility for xact ordering (and buffering, if necessary) to the downstream. It introduces the possibility that concurrently replayed xacts could deadlock with each other and a few exciting things like that, too, but with the payoff that we can continue to apply small transactions in a timely manner even as we're streaming a big transaction like a COPY.

We could possibly enable interleaving right from the start of the xact, or only once it crosses a certain size threshold. For your purposes Konstantin you'd want to do it right from the start since latency is crucial for you. For pglogical we'd probably want to buffer them a bit and only start streaming if they got big.

This would allow us to reduce the delay that occurs at commit, especially for larger transactions or very low latency requirements for smaller transactions. Some heuristic or user interface would be required to decide whether to and which transactions might make their data available prior to commit.

I imagine we'd have a knob, either global or per-slot, that sets a threshold based on size in bytes of the buffered xact. With 0 allowed as "start immediately".
 
And we would need to send abort messages should the transactions not commit as expected. That would be a patch on logical decoding and is an essentially separate feature to anything currently being developed.

I agree that this is strongly desirable. It'd benefit anyone using logical decoding and would have wide applications.
  
2. Some mechanism/theory to decide when/if to allow parallel apply.

I'm not sure it's as much about allowing it as how to do it.
 
We already have working multi-master that has been contributed to PGDG, so contributing that won't gain us anything.

Namely BDR.
 
There is a lot of code and pglogical is the most useful piece of code to be carved off and reworked for submission.

Starting with the already-published output plugin, with the downstream to come around the release of 9.5.
 
Having a single network connection between nodes would increase efficiency but also increase replication latency, so its not useful in all cases.

If we interleave messages I'm not sure it's too big a problem. Latency would only become an issue there if a big single row (big Datum contents) causes lots of small work to get stuck behind it.

IMO this is a separate issue to be dealt with later.

I think having some kind of message queue between nodes would also help, since there are many cases for which we want to transfer data, not just a replication data flow. For example, consensus on DDL, or MPP query traffic. But that is open to wider debate.

Logical decoding doesn't really define any network protocol at all. It's very flexible, and we can throw almost whatever we want down it. The pglogical_output protocol is extensible enough that we can just add additional messages when we need to, making them opt-in so we don't break clients that don't understand them.

I'm likely to need to do that soon for sequence-advance messages if I can get logical decoding of sequence advance working.

We might want a way to queue those messages at a particular LSN, so we can use them for replay barriers etc and ensure they're crash-safe. Like the generic WAL messages used in BDR and proposed for core. Is that what you're getting at? WAL messages would certainly be nice, but I think we can mostly if not entirely avoid the need for them if we have transaction interleaving and concurrent transaction support.

Somewhat related, I'd quite like to be able to send messages from downstream back to upstream, where they're passed to a hook on the logical decoding plugin. That'd eliminate the need to do a whole bunch of stuff that currently has to be done using direct libpq connections or a second decoding slot in the other direction. Basically send a CopyData packet in the other direction and have its payload passed to a new hook on output plugins.
--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [RFC] overflow checks optimized away
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: proposal: add 'waiting for replication' to pg_stat_activity.state