Re: Replication Node Identifiers and crashsafe Apply Progress

Поиск

Список

Период

Сортировка

От	Steve Singer
Тема	Re: Replication Node Identifiers and crashsafe Apply Progress
Дата	20 ноября 2013 г. 02:49:40
Msg-id	BLU0-SMTP40787261A4CCD7A35F1028DCE70@phx.gbl обсуждение исходный текст
Ответ на	Replication Node Identifiers and crashsafe Apply Progress (Andres Freund <andres@2ndquadrant.com>)
Ответы	Re: Replication Node Identifiers and crashsafe Apply Progress
Список	pgsql-hackers

Дерево обсуждения

On 11/14/2013 12:26 PM, Andres Freund wrote:
> Hello,
>
> As you know, the reason we are working changeset extraction is that we
> want to build logical unidirection and bidirectional replication
> ontop. To use changeset extraction effectively, I think one set of
> related features ontop is very useful:
>
> When extracting changes using the changeset extraction patchset (latest
> version at [1]) the START_LOGICAL_REPLICATION command is used to stream
> changes from a source system. When started it will continue to send
> changes as long as the connection is up or it is aborted. For obvious
> performance reasons it will *not* wait for an ACK for each transaction
> commit it streams out.
> Instead it relies on the receiver, exactly as in physical replication,
> sending feedback messages containing the LSN up to which data has safely
> been received.
> That means frequently something like:
> walsender: => COMMIT 0/10000000
> walsender: => COMMIT 0/10000200
> walsender: => COMMIT 0/10000400
> walsender: => COMMIT 0/10000600
> receiver:  <= ACKNOWLEDGE 0/10000270
> walsender: => COMMIT 0/10000800
> is possible and important for performance. I.e. the server has streamed
> out more changes than it got confirmation for.
>
> So, when the the replication connection goes down, e.g. because the
> receiving side has crashed, we need to tell the server from where to
> start. Every position between the last ACKed and the end of WAL is
> legal.
> The receiver then can ask the source to start replication from the last
> replayed commit using START_LOGICAL_REPLICATION 'slot_name'
> '0/10000600' which would then re-stream all the changes in the
> transaction that committe at 0/10000600 and all that follow.
>
> But for that the receiving side needs to know up to where changes have
> been applied. One relatively easy solution for that is that the
> receiving side does something like:
> UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> before the end of every replayed transaction. But that obviously will
> quickly cause bloat.

I don't see how this is going to cause any more bloat than what 
trigger-based slony does today with sl_confirm and I don't hear a lot of 
complaints about that being a big problem.   This might be because slony 
doesn't do a commit on the replica for every transaction but groups the 
transactions together, logical slony will behave the same way where we 
would only commit on SYNC transactions.




> Our solution to that is that a replaying process can tell the backend
> that it is currently doing so and setup three variables for every
> transaction:
> 1) an identifier for the the source database
> 2) the LSN at which the replayed transaction has committed remotely
> 3) the time at which the replayed transaction has committed remotely
>
> When the transaction then commits the commit record will set the
> XACT_CONTAINS_ORIGIN flag to ->xinfo and will add that data to the end
> of the commit record. During crash recovery the startup process will
> remember the newest LSN for each remote database in shared memory.
>
> This way, after a crash, restart, disconnect the replay process can look
> into shared memory and check how far it has already replayed and restart
> seamlessly. With minimal effort.
>
> We previously discussed the topic and some were very adverse to using
> any sort of numeric node identifiers across systems and suggested that
> those should only be used internally. So what the attached patch does is
> to add a new shared system catalog called 'pg_replication_identifier'
> (suggestions for a better name welcome) which translates a number of
> identifying traits into a numeric identifier.
> The set of identifiers currently are:
> * the sysid of the remote system, combined with the remote TLI
> * the oid of the local database
> * the oid of the remote database
> * an optional name
> but that's just what we needed in our multimaster prototype, and not
> what I necessarily think is correct.
>
> The added API (which surely need some work, I am not particularly happy
> with the naming of functions for one) basically consists of two parts:
> 1) functions to query/create replication identifiers:
>     * GetReplicationIdentifier(identifying traits) - search for a numeric replication identifier
>     * CreateReplicationIdentifier(identifying traits) - creates a numeric replication identifier
>     * GetReplicationInfoByIdentifier(numeric identifier) - returns identifying traits
>
> 2) functions to query/manipulate replication progress:
>     * AdvanceReplicationIdentifier(node, local_lsn, remote_lsn)
>     * XLogRecPtr RemoteCommitFromReplicationIdentifier(node)
>
> Internally the code also maintains some on-disk data which is updated
> during checkpoints to store the replication progress, otherwise it'd
> vanish if we shutdown gracefully ;).
>
> The attached code also integrates with the "commit timestamp" module
> that Alvaro submitted ([2]). Everytime a remote transaction is committed
> we store a) the remote commit's timestamp, b) the origin node id in it.
> That allows to relatively easily build multimaster systems with conflict
> resolution ontop, since whenever there's a conflict the originating
> node, and originating commit timestamp for a row can be queried
> efficiently.
>
> Having information about the origin of a change/transaction allows to
> implement complex replication topologies since the information is
> available to changeset extration output plugins.
> It allows to do write plugins that:
> * decode all changes, independent from the system they were originally
>    executed on by the user
> * decode changes generated locally, but none from remote systems
> * pick and choose between those, say only decode those the receiving
>    system isn't replicating from itself
>
> Questions are:
> * Which identifying traits do we want to use to identify nodes?
> * How do we want to manipulate replication identifiers? Currently they
>    can only be manipulated by using C functions, which is fine for some users,
>    but probably not for others?
> * Do we want to allow setting (remote_lsn, remote_timestamp,
>    remote_node_id) via SQL? Currently the remote_node_id can be set as a
>    GUC, but the other's can't. They probably should be a function that
>    can be called instead of GUCs?

A way of advancing the replication pointer via SQL would be nice, 
otherwise I'll just have to write my own C function that I will invoke 
via SQL (which sin't hard but everyone would need to do the same)



> * Suggestions for better names!
> * Would slony et al need something ontop to use this?

In the slony world we identifer nodes with a 32 bit integer.   I think 
the idea is that I'm going to have to pass arguments into

+extern RepNodeId GetReplicationIdentifier(uint64 remotesysid, Oid remotetli,
+                                          Oid remotedb, Name riname,
+                                          Oid rilocaldb);


to map my slony concept of a node id to a 16 bit "node id" that is only 
useful on the local system.  In additon to a slony 32 bit node id I have 
a conninfo that  I can use to contact that node.  I think the slon would 
need to connect to the remote node with that conninfo (which it does 
anyway) and get the remote oid's and then use the slony node_id 
converted to a string as the "riname" value.
I would then have to invoke AdvanceReplicationIdentifier at some point 
before I issue the commit.


What does building up node_id key from

(sysid, tlid, remote_dbid, local_dbid, name) get us over just mapping from an arbitrary name field to a 16 bit node_id
?
I agree with the other comments on the thread that letting the replication system figure out its own unique naming is
better. If we were going t come up with a schema then I am also not sure if using the remote TLI as part of the node
keyis a good idea.
 


> Todo:
> * cleanup/naming
> * Set returning function to see the replication progress
> * remove old checkpoint files
>
> Note that this only applies a) ontop the changeset extraction code b)
> the commit timestamp code. The 'replication-identifiers' git branch
> ([3]) contains all integrated together.
>
> Comments welcome!
>
> Greetings,
>
> Andres Freund
>
> [1]http://archives.postgresql.org/message-id/20131114134647.GA26172%40alap2.anarazel.de
> [2]http://archives.postgresql.org/message-id/20131022221600.GE4987%40eldon.alvh.no-ip.org
>
[3]http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/replication-identifiers
>
>

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Josh Berkus
Дата: 20 ноября 2013 г., 02:42:15
Сообщение: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

Следующее

От: Christophe Pettus
Дата: 20 ноября 2013 г., 02:53:44
Сообщение: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Replication Node Identifiers and crashsafe Apply Progress

Предыдущее

Следующее