Re: [HACKERS] logical decoding of two-phase transactions

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: [HACKERS] logical decoding of two-phase transactions
Дата
Msg-id CAMsr+YHQzGxnR-peT4SbX2-xiG2uApJMTgZ4a3TiRBM6COyfqg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] logical decoding of two-phase transactions  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: [HACKERS] logical decoding of two-phase transactions  (Michael Paquier <michael.paquier@gmail.com>)
Re: [HACKERS] logical decoding of two-phase transactions  (Stas Kelvich <s.kelvich@postgrespro.ru>)
Список pgsql-hackers


On 31 Jan. 2017 19:29, "Michael Paquier" <michael.paquier@gmail.com> wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
> wal_level >= logical I don't think that's the end of the world. But
> since we already have almost everything we need in memory, why not
> just stash the gid on ReorderBufferTXN?

I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding.

TL;DR: this lets us decode the xact after prepare but before commit so decoding/replay outcomes can affect the commit-or-abort decision.


The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.

That's where you've misunderstood - it isn't committed yet. The point or this change is to allow us to do logical decoding at the PREPARE TRANSACTION point. The xact is not yet committed or rolled back.

This allows the results of logical decoding - or more interestingly results of replay on another node / to another app / whatever to influence the commit or rollback decision.

Stas wants this for a conflict-free logical semi-synchronous replication multi master solution. At PREPARE TRANSACTION time we replay the xact to other nodes, each of which applies it and PREPARE TRANSACTION, then replies to confirm it has successfully prepared the xact. When all nodes confirm the xact is prepared it is safe for the origin node to COMMIT PREPARED. The other nodes then see hat the first node has committed and they commit too.

Alternately if any node replies "could not replay xact" or "could not prepare xact" the origin node knows to ROLLBACK PREPARED. All the other nodes see that and rollback too.

This makes it possible to be much more confident that what's replicated is exactly the same on all nodes, with no after-the-fact MM conflict resolution that apps must be aware of to function correctly.

To really make it rock solid you also have to send the old and new values of a row, or have row versions, or send old row hashes. Something I also want to have, but we can mostly get that already with REPLICA IDENTITY FULL.

It is of interest to me because schema changes in MM logical replication are more challenging awkward and restrictive without it. Optimistic conflict resolution doesn't work well for schema changes and once the conflciting schema changes are committed on different nodes there is no going back. So you need your async system to have a global locking model for schema changes to stop conflicts arising. Or expect the user not to do anything silly / misunderstand anything and know all the relevant system limitations and requirements... which we all know works just great in practice. You also need a way to ensure that schema changes don't render committed-but-not-yet-replayed row changes from other peers nonsensical. The safest way is a barrier where all row changes committed on any node before committing the schema change on the origin node must be fully replayed on every other node, making an async MM system temporarily sync single master (and requiring all nodes to be up and reachable). Otherwise you need a way to figure out how to conflict-resolve incoming rows with missing columns / added columns / changed types / renamed tables  etc which is no fun and nearly impossible in the general case.

2PC decoding lets us avoid all this mess by sending all nodes the proposed schema change and waiting until they all confirm successful prepare before committing it. It can also be used to solve the row compatibility problems with some more lazy inter-node chat in logical WAL messages.

I think the purpose of having the GID available to the decoding output plugin at PREPARE TRANSACTION time is that it can co-operate with a global transaction manager that way. Each node can tell the GTM "I'm ready to commit [X]". It is IMO not crucial since you can otherwise use a (node-id, xid) tuple, but it'd be nice for coordinating with external systems, simplifying inter node chatter, integrating logical deocding into bigger systems with external transaction coordinators/arbitrators etc. It seems pretty silly _not_ to have it really.

Personally I don't think lack of access to the GID justifies blocking 2PC logical decoding. It can be added separately. But it'd be nice to have especially if it's cheap.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Etsuro Fujita
Дата:
Сообщение: Re: [HACKERS] Push down more full joins in postgres_fdw
Следующее
От: Abbas Butt
Дата:
Сообщение: [HACKERS] An issue in remote query optimization