Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Дата
Msg-id CA+TgmoZqdFjapxPXSiDLC=1WE-h=Hr7y0bGsckqfTpc1k+A68Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node  (Christopher Browne <cbbrowne@gmail.com>)
Ответы Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node  (Simon Riggs <simon@2ndQuadrant.com>)
Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Btw, what do you mean with "conflating" the stream? I don't really see that
>>> being proposed.
>>
>> It seems to me that you are intent on using the WAL stream as the
>> logical change stream.  I think that's a bad design.  Instead, you
>> should extract changes from WAL and then ship them around in a format
>> that is specific to logical replication.
>
> Yeah, that seems worth elaborating on.
>
> What has been said several times is that it's pretty necessary to
> capture the logical changes into WAL.  That seems pretty needful, in
> order that the replication data gets fsync()ed avidly, and so that we
> don't add in the race condition of needing to fsync() something *else*
> almost exactly as avidly as is the case for WAL today..

Check.

> But it's undesirable to pull *all* the bulk of contents of WAL around
> if it's only part of the data that is going to get applied.  On a
> "physical streaming" replica, any logical data that gets captured will
> be useless.  And on a "logical replica," they "physical" bits of WAL
> will be useless.
>
> What I *want* you to mean is that there would be:
> a) WAL readers that pull the "physical bits", and
> b) WAL readers that just pull "logical bits."
>
> I expect it would be fine to have a tool that pulls LCRs out of WAL to
> prepare that to be sent to remote locations.  Is that what you have in
> mind?

Yes.  I think it should be possible to generate LCRs from WAL, but I
think that the on-the-wire format for LCRs should be different from
the WAL format.  Trying to use the same format for both things seems
like an unpleasant straightjacket.  This discussion illustrates why:
we're talking about consuming scarce bit-space in WAL records for a
feature that only a tiny minority of users will use, and it's still
not really enough bit space.  That stinks.  If LCR transmission is a
separate protocol, this problem can be engineered away at a higher
level.

Suppose we have three servers, A, B, and C, that are doing
multi-master replication in a loop.  A sends LCRs to B, B sends them
to C, and C sends them back to A.  Obviously, we need to make sure
that each server applies each set of changes just once, but it
suffices to have enough information in WAL to distinguish between
replication transactions and non-replication transactions - that is,
one bit.  So suppose a change is made on server A.  A generates LCRs
from WAL, and tags each LCR with node_id = A.  It then sends those
LCRs to B.  B applies them, flagging the apply transaction in WAL as a
replication transaction, AND ALSO sends the LCRs to C.  The LCR
generator on B sees the WAL from apply, but because it's flagged as a
replication transaction, it does not generate LCRs.  So C receives
LCRs from B just once, without any need for the node_id to to be known
in WAL.  C can now also apply those LCRs (again flagging the apply
transaction as replication) and it can also skip sending them to A,
because it seems that they originated at A.

Now suppose we have a more complex topology.  Suppose we have a
cluster of four servers A .. D which, for improved tolerance against
network outages, are all connected pairwise.  Normally all the links
are up, so each server sends all the LCRs it generates directly to all
other servers.  But how do we prevent cycles?  A generates a change
and sends it to B, C, and D.  B then sees that the change came from A
so it sends it to C and D.  C, receiving that change, sees that came
from A via B, so it sends it to D again, whereupon D, which got it
from C and knows that the origin is A, sends it to B, who will then
send it right back over to D.  Obviously, we've got an infinite loop
here, so this topology will not work.  However, there are several
obvious ways to patch it by changing the LCR protocol.  Most
obviously, and somewhat stupidly, you could add a TTL.  A bit smarter,
you could have each LCR carry a LIST of node_ids that it had already
visited, refusing to send it to any node it had already been to it,
instead of a single node_id.  Smarter still, you could send
handshaking messages around the cluster so that each node can build up
a spanning tree and prefix each LCR it sends with the list of
additional nodes to which the recipient must deliver it.  So,
normally, A would send a message to each of B, C, and D destined only
for that node; but if the A-C link went down, A would choose either B
or D and send each LCR to that node destined for that node *and C*;
then, A would forward the message.  Or perhaps you think this is too
complex and not worth supporting anyway, and that might be true, but
the point is that if you insist that all of the identifying
information must be carried in WAL, you've pretty much ruled it out,
because we are not going to put TTL fields, or lists of node IDs, or
lists of destinations, in WAL.  But there is no reason they can't be
attached to LCRs, which is where they are actually needed.

> Or are you feeling that the "logical bits" shouldn't get
> captured in WAL altogether, so we need to fsync() them into a
> different stream of files?

No, that would be ungood.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: REVIEW: Optimize referential integrity checks (todo item)
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node