Re: Multimaster

Поиск

Список

Период

Сортировка

От	Konstantin Knizhnik
Тема	Re: Multimaster
Дата	18 апреля 2016 г. 11:28:08
Msg-id	57149A92.7020605@postgrespro.ru обсуждение исходный текст
Ответ на	Re: Multimaster (Craig Ringer <craig@2ndquadrant.com>)
Ответы	Re: Multimaster
Список	pgsql-general

Дерево обсуждения

Hi,
Thank you for your response.

On 17.04.2016 15:30, Craig Ringer wrote:

I intend to make the same split in pglogical its self - a receiver and apply worker split. Though my intent is to have them communicate via a shared memory segment until/unless the apply worker gets too far behind and spills to disk.

In case of multimaster "too far behind" scenario can never happen. So here is yet another difference in asynchronous and synchronous replication approaches. For asynchronous replication situation when replica is far behind master is quite normal and has to be addressed without blocking master. For synchronous replication it is not possible all this "spill to disk" adds just extra overhead.

It seems to me that pglogical plugin is now becoming too universal, trying to address a lot of different issues and play different roles.
Here are some use cases for logical replication which I see (I am quite sure that you know more):
1. Asynchronous replication (including georeplication) - this is actually BDR.
2. Logical backup: transfer data to different database (including new version of Postgres)
3. Change notification: there are many different subscribers which can be interested in receiving notifications about database changes.
As far as I know new JDBC driver is going to use logical replication to receive update streams. It can be also used for update/invalidation of caches in ORMs.
4. Synchronous replication: multimaster

Any vacant worker form this pool can dequeue this work and proceed it.

How do you handle correctness of ordering though? A naïve approach will suffer from a variety of anomalies when subject to insert/delete/insert write patterns, among other things. You can also get lost updates, rows deleted upstream that don't get deleted downstream and various other exciting ordering issues.

At absolute minimum you'd have to commit on the downstream in the same commit order as the upstream.. This can deadlock. So when you get a deadlock you'd abort the xacts of the deadlocked worker and all xacts with later commit timestamps, then retry the lot.

We are not enforcing order of commits as Galera does. Consistency is enforces by DTM, which enforce that transactions at all nodes are given consistent snapshots and assigned same CSNs. We have also global deadlock detection algorithm which build global lock graph (but still false positives are possible because this graphs is build incrementally and so it doesn't correspond to some global snapshot).

BDR has enough trouble with this when applying transactions from multiple peer nodes. To a degree it just throws its hands up and gives up - in particular, it can't tell the difference between an insert/update conflict and an update/delete conflict. But that's between loosely coupled nodes where we explicitly document that some kinds of anomalies are permitted. I can't imagine it being OK to have an even more complex set of possible anomalies occur when simply replaying transactions from a single peer...

We should definitely perform more testing here, but right now we do not have any tests causing some synchronization anomalies.

It is certainly possible with this approach that order of applying transactions can be not the same at different nodes.

Well, it can produce downright wrong results, and the results even in a single-master case will be all over the place.

But it is not a problem if we have DTM.

How does that follow?

Multimaster is just particular (and simplest) case of distributed transactions. Specific of multimaster is that the same transaction has to be applied at all nodes and that selects can be executed at any node. The goal of DTM is to provide consistent execution of distributed transactions. If it is able to do for arbitrary transactions then, definitely, it can do it for multimaster.
I can not give you here formal prove that our DTM is able to solve all this problems. Certainly there are may be bugs in implementation
and this is why we need to perform more testing. But actually we are not "reinventing the wheel", our DTM is based on the existed approaches.

The only exception is recovery of multimaster node. In this case we have to apply transaction exactly in the same order as them were applied at the original node performing recovery. It is done by applying changes in recovery mode by pglogical_receiver itself.

I'm not sure I understand what you area saying here.

Sorry for unclearness.
I just said that normally transactions are applied concurrently by multiple workers and DTM is used to enforce consistency.
But in case of recovery (when some node is crashed and then reconnect to the cluster), we perform recovery of this node sequentially, by single worker. In this case DTM is not used (because other nodes are far ahead) and to restore the same state of node we need to apply changes exactly in the same order and at the source node. In this case case content of target (recovered) node should be the same as of source node.

We also need 2PC support but this code was sent to you by Stas, so I hope that sometime it will be included in PostgreSQL core and pglogical plugin.

I never got a response to my suggestion that testing of upstream DDL is needed for that. I want to see more on how you plan to handle DDL on the upstream side that changes the table structure and acquires strong locks. Especially when it's combined with row changes in the same prepared xacts.

We are now replicating DDL in the way similar with one used in BDR: DDL statements are inserted in special table and are replayed at destination node as part of transaction.
We have also alternative implementation done by Artur Zakirov <a.zakirov@postgrespro.ru>
which is using custom WAL records: https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
Patch for custom WAL records was committed in 9.6, so we are going to switch to this approach.

How does that really improve anything over using a table?

It is more straightforward approach, isn't it? You can either try to restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible. Or need to add to somehow add original DDL statements to the log.
It can be done using some special table or store this information directly in the log (if custom WAL records are supported).
Certainly in the last case logical protocol should be extended to support playback of user-defined WAl records.
But it seems to be universal mechanism which can be used not only for DDL.

I agree, that custom WAL adds no performance or functionality advantages over using a table.
This is why we still didn't switch to it. But IMHO approach with inserting DDL (or any other user-defined information) in special table looks like hack.

This doesn't address what I asked above though, which is whether you have tried doing ALTER TABLE in a 2PC xact with your 2PC replication patch, especially one that also makes row changes.
Well, recently I have made attempt to merge our code with the latest version of pglogical plugin (because our original implementation of multimaster was based on the code partly taken fro BDR) but finally have to postpone most of changes. My primary intention was to support metadata caching. But presence of multiple apply workers make it not possible to implement it in the same way as it is done node in pglogical plugin.

Not with a simplistic implementation of multiple workers that just round-robin process transactions, no. Your receiver will have to be smart enough to read the protocol stream and write the metadata changes to a separate stream all the workers read. Which is awkward.

I think you'll probably need your receiver to act as a metadata broker for the apply workers in the end.

Also now pglogical plugin contains a lot of code which performs mapping between source and target database schemas. So it it is assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay extra cost for the functionality we do not need.

All it's really doing is mapping upstream to downstream tables by name, since the oids will be different.

Really?
Why then you send all table metadata (information about attributes) and handle invalidation messages?
What is the purpose of "mapping to local relation, filled as needed" fields in PGLogicalRelation if are are not going to perform such mapping?

Multimater really needs to map local or remote OIDs. We do not need to provide any attribute mapping and handle catalog invalidations.

--

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

В списке pgsql-general по дате отправления:

Предыдущее

От: Albe Laurenz
Дата: 18 апреля 2016 г., 11:10:57
Сообщение: Re: what's the exact command definition in read committed isolation level?

Следующее

От:
Дата: 18 апреля 2016 г., 13:58:26
Сообщение: Re: How do BEGIN/COMMIT/ABORT operate in a nested SPI query?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Multimaster

Предыдущее

Следующее