Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Дата
Msg-id 201206201115.51305.andres@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne <cbbrowne@gmail.com> 
wrote:
> > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> 
wrote:
> >>> Btw, what do you mean with "conflating" the stream? I don't really see
> >>> that being proposed.
> >> 
> >> It seems to me that you are intent on using the WAL stream as the
> >> logical change stream.  I think that's a bad design.  Instead, you
> >> should extract changes from WAL and then ship them around in a format
> >> that is specific to logical replication.
> > 
> > Yeah, that seems worth elaborating on.
> > 
> > What has been said several times is that it's pretty necessary to
> > capture the logical changes into WAL.  That seems pretty needful, in
> > order that the replication data gets fsync()ed avidly, and so that we
> > don't add in the race condition of needing to fsync() something *else*
> > almost exactly as avidly as is the case for WAL today..
> 
> Check.
> 
> > But it's undesirable to pull *all* the bulk of contents of WAL around
> > if it's only part of the data that is going to get applied.  On a
> > "physical streaming" replica, any logical data that gets captured will
> > be useless.  And on a "logical replica," they "physical" bits of WAL
> > will be useless.
> > 
> > What I *want* you to mean is that there would be:
> > a) WAL readers that pull the "physical bits", and
> > b) WAL readers that just pull "logical bits."
> > 
> > I expect it would be fine to have a tool that pulls LCRs out of WAL to
> > prepare that to be sent to remote locations.  Is that what you have in
> > mind?
> Yes.  I think it should be possible to generate LCRs from WAL, but I
> think that the on-the-wire format for LCRs should be different from
> the WAL format.  Trying to use the same format for both things seems
> like an unpleasant straightjacket.  This discussion illustrates why:
> we're talking about consuming scarce bit-space in WAL records for a
> feature that only a tiny minority of users will use, and it's still
> not really enough bit space.  That stinks.  If LCR transmission is a
> separate protocol, this problem can be engineered away at a higher
> level.
As I said before, I definitely agree that we want to have a separate transport 
format once we have decoding nailed down. We still need to ship wal around if 
the decoding happens in a different instance, but *after* that it can be 
shipped in something more convenient/appropriate.

> Suppose we have three servers, A, B, and C, that are doing
> multi-master replication in a loop.  A sends LCRs to B, B sends them
> to C, and C sends them back to A.  Obviously, we need to make sure
> that each server applies each set of changes just once, but it
> suffices to have enough information in WAL to distinguish between
> replication transactions and non-replication transactions - that is,
> one bit.  So suppose a change is made on server A.  A generates LCRs
> from WAL, and tags each LCR with node_id = A.  It then sends those
> LCRs to B.  B applies them, flagging the apply transaction in WAL as a
> replication transaction, AND ALSO sends the LCRs to C.  The LCR
> generator on B sees the WAL from apply, but because it's flagged as a
> replication transaction, it does not generate LCRs.  So C receives
> LCRs from B just once, without any need for the node_id to to be known
> in WAL.  C can now also apply those LCRs (again flagging the apply
> transaction as replication) and it can also skip sending them to A,
> because it seems that they originated at A.
One bit is fine if you have only very simple replication topologies. Once you 
think about globally distributed databases its a bit different. You describe 
some of that below, but just to reiterate: 
Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF 
in europe). You may only want to have full intercontinental interconnect 
between two of those (say A and D). If you only have one bit to represent the 
origin thats not going to work because you won't be able discern the changes 
from BC on A from the changes from those originating on DEF.

Another topology which is interesting is circular replications (i.e. changes 
get shipped A->B, B->C, C->A) which is a sensible topology if you only have a 
low change rate and a relatively high number of nodes because you don't need 
the full combinatorial amount of connections.

You still have the origin_id's be meaningful in the local context though. As 
described before, in the communication between the different nodes you can 
simply replace 16bit node id with some fancy UUID or such. And do the reverse 
when replaying LCRs.

> Now suppose we have a more complex topology.  Suppose we have a
> cluster of four servers A .. D which, for improved tolerance against
> network outages, are all connected pairwise.  Normally all the links
> are up, so each server sends all the LCRs it generates directly to all
> other servers.  But how do we prevent cycles?  A generates a change
> and sends it to B, C, and D.  B then sees that the change came from A
> so it sends it to C and D.  C, receiving that change, sees that came
> from A via B, so it sends it to D again, whereupon D, which got it
> from C and knows that the origin is A, sends it to B, who will then
> send it right back over to D.  Obviously, we've got an infinite loop
> here, so this topology will not work.  However, there are several
> obvious ways to patch it by changing the LCR protocol.  Most
> obviously, and somewhat stupidly, you could add a TTL. A bit smarter,
> you could have each LCR carry a LIST of node_ids that it had already
> visited, refusing to send it to any node it had already been to it,
> instead of a single node_id.  Smarter still, you could send
> handshaking messages around the cluster so that each node can build up
> a spanning tree and prefix each LCR it sends with the list of
> additional nodes to which the recipient must deliver it.  So,
> normally, A would send a message to each of B, C, and D destined only
> for that node; but if the A-C link went down, A would choose either B
> or D and send each LCR to that node destined for that node *and C*;
> then, A would forward the message.  Or perhaps you think this is too
> complex and not worth supporting anyway, and that might be true, but
> the point is that if you insist that all of the identifying
> information must be carried in WAL, you've pretty much ruled it out,
> because we are not going to put TTL fields, or lists of node IDs, or
> lists of destinations, in WAL.  But there is no reason they can't be
> attached to LCRs, which is where they are actually needed.
Most of those topologies are possible if you have the ability to retain the 
information about where a change originated. All the more complex information 
like the list of nodes you want to apply changes from and such doesn't belong 
in the wal.

> > Or are you feeling that the "logical bits" shouldn't get
> > captured in WAL altogether, so we need to fsync() them into a
> > different stream of files?
> No, that would be ungood.
Agreed.

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Too frequent message of pgbench -i?
Следующее
От: Magnus Hagander
Дата:
Сообщение: Re: Release versioning inconsistency