Re: Multimaster

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: Multimaster
Дата
Msg-id CAMsr+YEuW7HbCwBQzoQJuPaMh8i0O7VKvcNuoy-_tgTw_OJDiA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Multimaster  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Ответы Re: Multimaster  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Список pgsql-general
On 18 April 2016 at 16:28, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
I intend to make the same split in pglogical its self - a receiver and apply worker split. Though my intent is to have them communicate via a shared memory segment until/unless the apply worker gets too far behind and spills to disk.


In case of multimaster  "too far behind" scenario can never happen.

I disagree. In the case of tightly coupled synchronous multi-master it can't happen, sure. But that's hardly the only case of multi-master out there.

I expect you'll want the ability to weaken synchronous guarantees for some commits anyway, like we have with physical replication's synchronous_commit = remote_write, synchronous_commit = local, etc. In that case lag becomes relevant again.

You might also want to be able to spool a big tx to temporary storage even as you apply it, if you're running over a WAN or something. That way if you crash during apply you don't have to transfer the data over the WAN again. Like we do with physical replication, where we write the WAL to disk then replay from disk.

I agree that spilling to disk isn't needed for the simplest cases of synchronous logical MM. But it's far from useless.
 
It seems to me that pglogical plugin is now becoming too universal, trying to address a lot of different issues and play different roles.

I'm not convinced. They're all closely related, overlapping, and require much of the same functionality. While some use cases don't need certain pieces of functionality, they can still be _useful_. Asynchronous MM replication doesn't need table mapping and transforms, for example ... except that in reality lots of the flexibility offered by replication sets, table mapping, etc is actually really handy in MM too.

We may well want to move much of that into core and have much thinner plugins, but the direction Andres, Robert etc are talking about seems to be more along the lines of a fully in-core logical replication subsystem. It'll need to (eventually) meet all theses sorts of needs.

Before you start cutting or assuming you need something very separate I suggest taking a closer look at why each piece is there,  whether there's truly any significant performance impact, and whether it can be avoided without just cutting out the functionality entirely.

1. Asynchronous replication (including georeplication) - this is actually BDR.

Well, BDR is asynchronous MM. There's also the single-master case and related ones for non-overlapping multimaster where any given set of tables are only written on one node.
 
2. Logical backup: transfer data to different database (including new version of Postgres)

I think that's more HA than logical backup. Needs to be able to be synchronous or asynchronous, much like our current phys.rep.

Closely related but not quite the same is logical read replicas/standbys.
 
3. Change notification: there are many different subscribers which can be interested in receiving notifications about database changes.

Yep. I suspect we'll want a json output plugin for this, separate to pglogical etc, but we'll need to move a bunch of functionality from pglogical into core so it can be shared rather than duplicated.
 
4. Synchronous replication: multimaster

"Synchronous multimaster". Not all multimastrer is synchronous, not all synchronous replication is multimaster. 

We are not enforcing order of commits as Galera does. Consistency is enforces by DTM, which enforce that transactions at all nodes are given consistent snapshots and assigned same CSNs. We have also global deadlock detection algorithm which build global lock graph (but still false positives are possible because  this graphs is build incrementally and so it doesn't correspond to some global snapshot).

OK, so you're relying on a GTM to determine safe, conflict-free apply orderings.

I'm ... curious ... about how you do that. Do you have a global lock manager too? How do you determine ordering for things that in a single-master case are addressed via unique b-tree indexes, not (just) heavyweight locking?

Multimaster is just particular (and simplest) case of distributed transactions. Specific of multimaster is that the same transaction has to be applied at all nodes and that selects can be executed at any node.

The specification of your symmetric, synchronous tightly-coupled multimaster design, yes. Which sounds like it's intended to be transparent or near-transparent multi-master clustering.
 

The only exception is recovery of multimaster node. In this case we have to apply transaction exactly in the same order as them were applied at the original node performing recovery. It is done by applying changes in recovery mode by pglogical_receiver itself.

I'm not sure I understand what you area saying here.

Sorry for unclearness.
I just said that normally transactions are applied concurrently by multiple workers and DTM is used to enforce consistency.
But in case of recovery (when some node is crashed and then reconnect to the cluster), we perform recovery of this node sequentially, by single worker. In this case DTM is not used (because other nodes are far ahead) and to restore the same state of node we need to apply changes exactly in the same order and at the source node. In this case case content of target (recovered) node should be the same as of source node.

OK, that makes perfect sense.

Presumably in this case you could save a local snapshot of the DTM's knowledge of the correct apply ordering of those tx's as you apply, so when you crash you can consult that saved ordering information to still parallelize apply. Later.
 

 
We are now replicating DDL in the way similar with one used in BDR: DDL statements are inserted in special table and are replayed at destination node as part of transaction. 
We have also alternative implementation done by Artur Zakirov <a.zakirov@postgrespro.ru
Patch for custom WAL records was committed in 9.6, so we are going to switch to this approach.

How does that really improve anything over using a table?

It is more straightforward approach, isn't it? You can either try to restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible.

Understatement of the century ;) 
 
Or need to add to somehow add original DDL statements to the log.

Actually you need to be able to add normalized statements to the xlog. The original DDL text isn't quite good enough due to issues with search_path among other things. Hence DDL deparse.
 
I agree, that custom WAL adds no performance or functionality advantages over using a table.
This is why we still didn't switch to it. But IMHO approach with inserting DDL (or any other user-defined information) in special table looks like hack.

Yeah, it is a hack. Logical WAL messages do provide a cleaner way to do it, though with the minor downside that they're opaque to the user, who can't see what DDL is being done / due to be done anymore. I'd rather do it with generic logical WAL messages in future, now that they're in core. 
 
Also now pglogical plugin contains a lot of code which performs mapping between source and target database schemas. So it it is assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay extra cost for the functionality we do not need.

All it's really doing is mapping upstream to downstream tables by name, since the oids will be different.

Really?
Why then you send all table metadata (information about attributes) and handle invalidation messages?

Right, you meant columns, not tables.

See DESIGN.md.

We can't just use attno since column drops on one node will cause attno to differ even if the user-visible table schema is the same.

BDR solves this (now) by either initalizing nodes from a physical pg_basebackup of another node, including dropped cols etc, or using pg_dump's binary upgrade mode to preserve dropped columns when bringing a node up from a logical copy.

That's not viable for general purpose logical replication like pglogical, so we send a table attribute mapping.

I agree that this can be avoided if the system can guarantee that the upstream and downstream tables have exactly the same structure including dropped columns. Which it can only guarantee when it has DDL replication and all DDL is either replicated or blocked from being run. That's the approach BDR tries to take, and it works - with problems. One of the problems you won't have because it's caused by the need to sync up the otherwise asynchronous cluster so there are no outstanding committed-but-not-replayed changes for the old table structure on any node before we change the structure on all nodes. But others, with coverage of DDL replication, problems with full table rewrites, etc, you will have.

I think it would be reasonable for pglogical to offer the option of sending a minimal table metadata message that simply says that it expects the downstream to deal with the upstream attnos exactly as-is, either by having them exactly the same or managing its own translations. In this case column mapping etc can be omitted. Feel free to send a patch.
 
Multimater really  needs to map local or remote OIDs.  We do not need to provide any attribute mapping and handle catalog invalidations.

For synchronous tightly-coupled multi-master with a GTM and GLM that doesn't allow non-replicated DDL, yes, I agree.



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-general по дате отправления:

Предыдущее
От: Иван Фролков
Дата:
Сообщение: How are files of tables/indexes/etc deleting?
Следующее
От: Adrian Klaver
Дата:
Сообщение: Re: error while installing auto_explain contrib module