Обсуждение: 32/64-bit transaction IDs?

Поиск
Список
Период
Сортировка

32/64-bit transaction IDs?

От
"Ed L."
Дата:
I need a guaranteed ordering of transaction IDs as they are queued for
asyncronous replication.  I'd use the 32-bit internal transaction ID from
GetTransactionId(), but 32-bits will not suffice for the logic I'd like to
use.  A 64-bit ID would work well enough.

Further, I am uncertain those 32-bit transaction IDs represent a correctly
ordered sequence of transactions.

Do the internal transaction IDs provide the correct transaction order?

Are 64-bit transaction IDs available?

If not, what other tricks are there to get guaranteed ordering?

Ed


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> I need a guaranteed ordering of transaction IDs as they are queued for
> asyncronous replication.  I'd use the 32-bit internal transaction ID from
> GetTransactionId(), but 32-bits will not suffice for the logic I'd like to
> use.  A 64-bit ID would work well enough.

Uh ... your replication logic cannot ensure to reach closure within two
billion transactions?  Something's pretty out of joint there.

> Further, I am uncertain those 32-bit transaction IDs represent a correctly
> ordered sequence of transactions.

XIDs are issued in transaction start order, not transaction commit
order.  Dunno if that's what's concerning you.

            regards, tom lane

Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Thursday March 20 2003 11:12, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > I need a guaranteed ordering of transaction IDs as they are queued for
> > asyncronous replication.  I'd use the 32-bit internal transaction ID
> > from GetTransactionId(), but 32-bits will not suffice for the logic I'd
> > like to use.  A 64-bit ID would work well enough.
>
> Uh ... your replication logic cannot ensure to reach closure within two
> billion transactions?  Something's pretty out of joint there.

Uh...er...well... It just simplifies and robustifies a little logic if I can
do a clean-sweep delete of replicated transactions based on the xid being
less than some marker value.  Something like

    DELETE
    FROM replication_queue
    WHERE xid < (SELECT MIN(last_mirrored_xid) FROM slave);

where slave is a relation of slaves to keep track of where they are in the
replication process.  When transaction IDs wrap around, that logic breaks
down for the new xids of 1, 2, 3, ... , and more accounting is required.
In our case, under considerable load, it probably wouldn't wrap for a year
or three, but it'd be nice to avoid that little timebomb it if it's easy.

> > Further, I am uncertain those 32-bit transaction IDs represent a
> > correctly ordered sequence of transactions.
>
> XIDs are issued in transaction start order, not transaction commit
> order.  Dunno if that's what's concerning you.

Yes, I'm interested in any reliable ordering of transaction commits for the
purpose of serial asyncronous replication.  Is that possible?

Ed

Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
>> XIDs are issued in transaction start order, not transaction commit
>> order.  Dunno if that's what's concerning you.

> Yes, I'm interested in any reliable ordering of transaction commits for the
> purpose of serial asyncronous replication.  Is that possible?

The order in which commit records are inserted into the WAL log would
work (and if you just need an ID number, the commit records' WAL-log
addresses are 64-bits and would serve nicely).  But you'll have to add
code to the backend to get at that.

            regards, tom lane

Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Friday March 21 2003 11:53, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> >> XIDs are issued in transaction start order, not transaction commit
> >> order.  Dunno if that's what's concerning you.
> >
> > Yes, I'm interested in any reliable ordering of transaction commits for
> > the purpose of serial asyncronous replication.  Is that possible?
>
> The order in which commit records are inserted into the WAL log would
> work (and if you just need an ID number, the commit records' WAL-log
> addresses are 64-bits and would serve nicely).  But you'll have to add
> code to the backend to get at that.

A 64-bit serial ID number would suffice.  I could spend a little time on
that.  Any starter hints on files/functions/structure of such an addition?
Something akin to GetTransactionId()?

Ed


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
Ed Loehr <ed@LoehrTech.com> writes:
> A 64-bit serial ID number would suffice.  I could spend a little time on
> that.  Any starter hints on files/functions/structure of such an addition?
> Something akin to GetTransactionId()?

You'd have to capture the WAL "recptr" returned to
RecordTransactionCommit by XLogInsert.  What you do with it afterwards,
I have no idea.

            regards, tom lane

Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Friday March 21 2003 2:51, you wrote:
> Ed Loehr <ed@LoehrTech.com> writes:
> > A 64-bit serial ID number would suffice.  I could spend a little time
> > on that.  Any starter hints on files/functions/structure of such an
> > addition? Something akin to GetTransactionId()?
>
> You'd have to capture the WAL "recptr" returned to
> RecordTransactionCommit by XLogInsert.  What you do with it afterwards,
> I have no idea.

Well, I'm trying to capture a 64-bit psuedo-transaction ID, just like the
WAL record number, but to do it within a C trigger so I can queue it into
another table and have all-or-none semantics.  Am I correct in assuming the
XLogInsert() call is made after the transaction is guaranteed to completed?
If so, wouldn't this be beyond the triggered function's reach?

Ed


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> On Friday March 21 2003 2:51, you wrote:
>> You'd have to capture the WAL "recptr" returned to
>> RecordTransactionCommit by XLogInsert.  What you do with it afterwards,
>> I have no idea.

> Well, I'm trying to capture a 64-bit psuedo-transaction ID, just like the
> WAL record number, but to do it within a C trigger so I can queue it into
> another table and have all-or-none semantics.  Am I correct in assuming the
> XLogInsert() call is made after the transaction is guaranteed to completed?
> If so, wouldn't this be beyond the triggered function's reach?

It's certainly out of reach of anything executed within the transaction,
since by definition the commit record is only written after the
transaction is done.  It seems to me to be a contradiction in terms to
expect within-transaction actions to have any information about commit
order of their transaction.

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
Ed Loehr
Дата:
On Friday March 21 2003 11:53, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> >> XIDs are issued in transaction start order, not transaction commit
> >> order.  Dunno if that's what's concerning you.
> >
> > Yes, I'm interested in any reliable ordering of transaction commits for
> > the purpose of serial asyncronous replication.  Is that possible?
>
> The order in which commit records are inserted into the WAL log would
> work (and if you just need an ID number, the commit records' WAL-log
> addresses are 64-bits and would serve nicely).  But you'll have to add
> code to the backend to get at that.

A 64-bit serial ID number would suffice.  I could spend a little time on
that.  Any starter hints on files/functions/structure of such an addition?
Something akin to GetTransactionId()?

Ed

Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Friday March 21 2003 10:07, Tom Lane wrote:
> >
> > Well, I'm trying to capture a 64-bit psuedo-transaction ID, just like
> > the WAL record number, but to do it within a C trigger so I can queue
> > it into another table and have all-or-none semantics.  Am I correct in
> > assuming the XLogInsert() call is made after the transaction is
> > guaranteed to completed? If so, wouldn't this be beyond the triggered
> > function's reach?
>
> It's certainly out of reach of anything executed within the transaction,
> since by definition the commit record is only written after the
> transaction is done.  It seems to me to be a contradiction in terms to
> expect within-transaction actions to have any information about commit
> order of their transaction.

I see your point.  Maybe it's not possible to get perfect ordering from any
information available within a transaction?

Using the transaction ID for ordering seems problematic given the
variability of transaction lifetimes, not to mention the 32-bit issue.  I
wonder if it'd be advisable to make WAL data available in a (system?)
table, maybe mapping the transaction ID to the WAL record number?  Just
looking for some way to make the true commit order visible to a SQL query
in order to leverage existing replication code ...

Ed


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Friday March 21 2003 11:12, Ed L. wrote:
>
> Using the transaction ID for ordering seems problematic given the
> variability of transaction lifetimes, not to mention the 32-bit issue.  I
> wonder if it'd be advisable to make WAL data available in a (system?)
> table, maybe mapping the transaction ID to the WAL record number?  Just
> looking for some way to make the true commit order visible to a SQL query
> in order to leverage existing replication code ...

This is admittedly a half-baked idea from someone with little knowledge of
the pg code, but more specifically, I'm imagining a system table that looks
like this (I know this is not how system tables are specified):

    create table pg_xlog (
        xid        int4 unique not null,
        recno    int8 unique not null,
    );
    -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff

This would map transaction IDs to WAL log record numbers.  It seems
straight-forward to get the right data into this table.  But of course,
this table would get a gazillion inserts, and the appropriate logic for
clearing/truncating and the potential performance and memory impacts are
unclear to me.  Still, it does have the appeal of allowing a combination of
a trigger and SQL to reliably determine the transaction order, which would
seem to allow asyncronous trigger-based replication schemes to get the
right replay order.

Any other ideas as to how to get the guaranteed correct transaction order
via triggers?

Ed


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
>     create table pg_xlog (
>         xid        int4 unique not null,
>         recno    int8 unique not null,
>     );
>     -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff

> This would map transaction IDs to WAL log record numbers.  It seems
> straight-forward to get the right data into this table.

... except that once a transaction has committed, it can't go around
making more database entries.  You can only modify the DB within a
transaction.

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Saturday March 22 2003 8:44, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> >     create table pg_xlog (
> >         xid        int4 unique not null,
> >         recno    int8 unique not null,
> >     );
> >     -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff
> >
> > This would map transaction IDs to WAL log record numbers.  It seems
> > straight-forward to get the right data into this table.
>
> ... except that once a transaction has committed, it can't go around
> making more database entries.  You can only modify the DB within a
> transaction.

There may be other good reasons not to do it, but it appears it does not all
need to be done within the trigger:

1.  Within the transaction, a user trigger records the transaction ID into a
table along with tuple data via GetCurrentTransactionId() just as dbmirror
and rserv do now, essentially like

    insert into replication_queue (xid, data, ...)

2.  In xact.c, RecordTransactionCommit() inserts the (xid,recno) pair into
pg_xlog just after XLogInsert() for the commit.  It doesn't matter if it's
within reach of the user's transaction-constrained trigger code or not
because it's done by the system.

3.  An asyncronous slave replicator comes along later and match the xid to
to the replication tuples, as in

    select
    from pg_xlog l, replication_queue q
    where l.xid = q.xid
    order by l.recno;

...and gets a guaranteed ordering.

I'm also wondering if my concerns are unfounded in regard to the impact of
variable transaction lifetimes on the xid.  Maybe MVCC maybe takes care of
any concerns from variable transaction lifetimes so that the xid, even
though it comes from the beginning of the transaction, is a reliable
ordering?

Ed


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Saturday March 22 2003 9:15, Ed L. wrote:
> On Saturday March 22 2003 8:44, Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > >     create table pg_xlog (
> > >         xid        int4 unique not null,
> > >         recno    int8 unique not null,
> > >     );
> > >     -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff
> > >
> > > This would map transaction IDs to WAL log record numbers.  It seems
> > > straight-forward to get the right data into this table.
> >
> > ... except that once a transaction has committed, it can't go around
> > making more database entries.  You can only modify the DB within a
> > transaction.
>
...
>
> 2.  In xact.c, RecordTransactionCommit() inserts the (xid,recno) pair
> into pg_xlog just after XLogInsert() for the commit.  It doesn't matter
> if it's within reach of the user's transaction-constrained trigger code
> or not because it's done by the system.

Am I correct in assuming the system can modify system tables outside a
transaction?


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> Am I correct in assuming the system can modify system tables outside a
> transaction?

No, it can't.  *Any* DB change has to be done within a transaction.

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Saturday March 22 2003 9:26, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > Am I correct in assuming the system can modify system tables outside a
> > transaction?
>
> No, it can't.  *Any* DB change has to be done within a transaction.

D'oh.  Horse...dead.

You didn't say, but do you think the xid is a reliable replication replay
ordering even though it comes at the start of the transaction?

Ed


Thousands INSERT going slow ...

От
Hervé Piedvache
Дата:
Hi,

I'm just testing insertion of about 600 000 records inside 3 tables.

Just making a big text file with 3 inserts each time (for my 3 tables) like
insert into xx (yy) values ('data'); so I have 3 x 600 000 inserts inside the
file.

Table N°2 have a reference on the Table N°1 with the primary key ...
It's not a transaction ... I have only a primary key on each 3 tables ...
The tables where empty at the beginning and all the database vacuum full
analysed before the test.
Same result on PostgreSQL 7.2.3 and 7.3.2 ...

So I get about 1000 inserts done in 10 seconds at the beginning of the file,
then after one hour I get 25 seconds for 1000 inserts ... and it's going
slower and slower ...

Before making a text file ... I have done the same thing with a Perl script
with DBI to make the 3 inserts in a transaction (of the 600 000 records x3),
and it was really slower (I stopped the script after 10 hours and 819 000
inserts done.) And same thing it was going slower and slower ... it was about
2000 inserts for 10 seconds the 10 first minutes... then after 10 hours I was
at 40 inserts for 10 seconds ...

I'm using for my tests as server with Bi-Pentium III 1.8 Ghz - 2Mb of RAM and
Linux 2.4.18 ... the memory do not move ... the swap was never used ...

Why PostgreSQL's going slower ?
Do you have any idea to have a better and quicker insert method ?

Thanks per advance for your help :o)

regards,
--
Hervé


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> You didn't say, but do you think the xid is a reliable replication replay
> ordering even though it comes at the start of the transaction?

Dunno.  What requirements have you really got?

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Saturday March 22 2003 9:55, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > You didn't say, but do you think the xid is a reliable replication
> > replay ordering even though it comes at the start of the transaction?
>
> Dunno.  What requirements have you really got?

Again, the context is asyncronous trigger-based master-slave replication ala
contrib/dbmirror.  In this context, AFAICS, the primary requirement is to
queue updates, inserts, and deletes on a master db for later SQL retrieval
and subsequent serial "replay" into a slave in the "right" order.  The
master-queued tuple changes should be groupable as transactions so that any
replication can enforce all-or-none semantics on the slave, though normally
it should never hit a snag since the master and slave are initialized as
identical copies.  The queue should be ordered the "right" way for serial
replay into a slave, whatever that "right" way is in order to maintain
consistency with the master.  I assume triggers will have to be disabled
during replay on the slave to avoid time-sensitive side-effects.

DBMirror basically already does all of this, except for disabling triggers
and my uncertainty about the ordering issue.

Ed


Re: Thousands INSERT going slow ...

От
Tom Lane
Дата:
=?iso-8859-15?q?Herv=E9=20Piedvache?= <herve@elma.fr> writes:
> Table N�2 have a reference on the Table N�1 with the primary key ...
> It's not a transaction ... I have only a primary key on each 3 tables ...
> The tables where empty at the beginning and all the database vacuum full
> analysed before the test.
> Same result on PostgreSQL 7.2.3 and 7.3.2 ...

> So I get about 1000 inserts done in 10 seconds at the beginning of the file,
> then after one hour I get 25 seconds for 1000 inserts ... and it's going
> slower and slower ...

You probably should *not* have vacuum analyzed while the tables were
empty.  The planner generated plans for the foreign-key checks based on
the assumption that the tables are tiny --- so it's probably using
seqscans to make the checks.  The actual inserts won't slow down
materially as the tables fill, but seqscan foreign-key checks will.

You could vacuum now to update the stats, but that won't affect
the foreign-key plans that the inserting process has already cached.
You'll have to stop the inserting process and start a new one in any
case.

Another possibility is to start again with freshly created tables and no
initial vacuum.  If the planner has no stats it should default to
indexscans.

            regards, tom lane


Re: Thousands INSERT going slow ...

От
Doug McNaught
Дата:
=?iso-8859-15?q?Herv=E9=20Piedvache?= <herve@elma.fr> writes:

> Hi,
>
> I'm just testing insertion of about 600 000 records inside 3 tables.
>
> Just making a big text file with 3 inserts each time (for my 3 tables) like
> insert into xx (yy) values ('data'); so I have 3 x 600 000 inserts inside the
> file.
>
> Table N�2 have a reference on the Table N�1 with the primary key ...
> It's not a transaction ... I have only a primary key on each 3 tables ...

You're getting killed by transaction overhead.  Make it so you do 1000
or so (or even more, big transactions don't hurt) inserts in a single
transaction and you'll see much better performance.

Also (under 7.3) run VACUUM during the insertion process--it may help.

You could also disable the foreign key triggers during the run if
you're sure the data is consistent.

-Doug


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> On Saturday March 22 2003 9:55, Tom Lane wrote:
>> Dunno.  What requirements have you really got?

> Again, the context is asyncronous trigger-based master-slave replication ala
> contrib/dbmirror.

AFAICS that technique does not need to worry about transaction ordering,
as long as updates are sent in the order they are inserted into the
pending-updates table.  When two transactions update the same row, one
is going to be forced to wait till the other commits; so the waiter's
report will appear second.  For nonconflicting updates you might not
apply them in the same order they were physically made at the master,
but I can't see how it matters.

(I'm not entirely certain, but I think this works only if the reports
are made by AFTER triggers.  However, you'd have to do that anyway,
since a BEFORE trigger can't be certain it's the last BEFORE trigger.
Sending a report from a BEFORE trigger might send a report containing
a row value that's not what ultimately gets inserted, due to mods made
by later BEFORE triggers.)

However, this may just move the problem somewhere else --- how will you
be sure that you transmit entries in the update table in the correct
order?

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Saturday March 22 2003 11:05, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> >
> > Again, the context is asyncronous trigger-based master-slave
> > replication ala contrib/dbmirror.
>
> AFAICS that technique does not need to worry about transaction ordering,
> as long as updates are sent in the order they are inserted into the
> pending-updates table... [snip]
>
> However, this may just move the problem somewhere else --- how will you
> be sure that you transmit entries in the update table in the correct
> order?

That is my question.  How it is done now in dbmirror is by including the xid
and using a sequence object, essentially similar to

insert into queue (xid, tuple_id, data, ...)
    values (GetCurrentTransactionId(), nextval('tuplesequence'), ...);

Actually, dbmirror does it in a couple of steps, but I think this is the
basic idea.  So the queued tuple changes are groupable (and thus
replayable) by xid and order of queueing.  Then the slave gets them in the
hopefully-correct order via

    select ...
    from queue q
    where q.xid is in some appropriate range
    order by xid asc, tuple_id asc;

Ed


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> Actually, dbmirror does it in a couple of steps, but I think this is the
> basic idea.  So the queued tuple changes are groupable (and thus
> replayable) by xid and order of queueing.  Then the slave gets them in the
> hopefully-correct order via
>     select ...
>     from queue q
>     where q.xid is in some appropriate range
>     order by xid asc, tuple_id asc;

I think this last part is wrong.  It shouldn't be using the xid as part
of the ordering, only the sequence value.

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Saturday March 22 2003 11:29, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > Actually, dbmirror does it in a couple of steps, but I think this is
> > the basic idea.  So the queued tuple changes are groupable (and thus
> > replayable) by xid and order of queueing.  Then the slave gets them in
> > the hopefully-correct order via
> >     select ...
> >     from queue q
> >     where q.xid is in some appropriate range
> >     order by xid asc, tuple_id asc;
>
> I think this last part is wrong.  It shouldn't be using the xid as part
> of the ordering, only the sequence value.

Why not?  How would you replay them on the slave in the same transaction
groupings and order that occurred on the master?


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> On Saturday March 22 2003 11:29, Tom Lane wrote:
>> I think this last part is wrong.  It shouldn't be using the xid as part
>> of the ordering, only the sequence value.

> Why not?  How would you replay them on the slave in the same transaction
> groupings and order that occurred on the master?

Why not is easy:

    begin xact 1;
    update tuple X;
    ...

                    begin xact 2;
                    update tuple Y;
                    commit;

    ...
    update tuple Y;
    commit;

(Note that this is only possible in read-committed mode, else xact 1
wouldn't be allowed to update tuple Y like this.)  Here, you must
replay xact 1's update of Y after xact 2's update of Y, else you'll
get the wrong final state on the slave.  On the other hand, it really
does not matter whether you replay the tuple X update before or after
you replay xact 2, because xact 2 didn't touch tuple X.

If the existing DBmirror code is sorting as you say, then it will fail
in this scenario --- unless it always manages to execute a propagation
step in between the commits of xacts 2 and 1, which doesn't seem very
trustworthy.

What I'm envisioning is that you should just send updates in the order
of their insertion sequence numbers and *not* try to force them into
transactional grouping.  In the above scenario, depending on when
propagation runs it might send X update, Y update, second Y update
all in one batch; or it might send Y update in one batch and then X
update and second Y update in a later batch.  Both of these are legal
and consistent orderings.  The only ordering constraint is that the
second Y update must be applied second --- but that is certain as
long as we sort the contents of a batch by insertion order.  (The
pending-update report for the second Y update cannot have been made
until after xact 2 commits, because the MVCC semantics will force
xact 1 to wait for 2 to commit before it can update Y.)

Note that all of a transaction's updates will become visible in the
pending-update table simultaneously when it commits, so (as long as
you grab batches in single SELECTs, or with a serializable transaction)
there's no problems with partial transactions being applied by a batch.
A batch will contain all the updates of some specific set of
transactions, namely all those that committed since the prior batch
for that slave.  But you can't order the updates done by a batch in xid
order, they have to be applied in insertion order.

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
"Ed L."
Дата:
On Saturday March 22 2003 12:00, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > On Saturday March 22 2003 11:29, Tom Lane wrote:
> >> I think this last part is wrong.  It shouldn't be using the xid as
> >> part of the ordering, only the sequence value.
> >
> > Why not?  How would you replay them on the slave in the same
> > transaction groupings and order that occurred on the master?
>
> Why not is easy:
>
>     begin xact 1;
>     update tuple X;
>     ...
>                     begin xact 2;
>                     update tuple Y;
>                     commit;
>     ...
>     update tuple Y;
>     commit;
>
> (Note that this is only possible in read-committed mode, else xact 1
> wouldn't be allowed to update tuple Y like this.)  Here, you must
> replay xact 1's update of Y after xact 2's update of Y, else you'll
> get the wrong final state on the slave.  On the other hand, it really
> does not matter whether you replay the tuple X update before or after
> you replay xact 2, because xact 2 didn't touch tuple X.
>
> If the existing DBmirror code is sorting as you say, then it will fail
> in this scenario --- unless it always manages to execute a propagation
> step in between the commits of xacts 2 and 1, which doesn't seem very
> trustworthy.

Well, I'm not absolutely certain, but I think this problem may indeed exist
in dbmirror.  If I'm reading it correctly, dbmirror basically has the
following:

    create table xact_queue (xid int, seqid serial, ...);
    create table tuple_queue (seqid int, data, ...);

The dbmirror trigger does this:

myXid = GetCurrentTransactionId();
insert into xact_queue (myXid, nextval(seqid_seq));
insert into tuple_queue (seqid, data, ...) values (currval(seqid_seq), ...);

The slave then grabs all queued xids in order of the max seqid within each
transaction.  Essentially,

    SELECT xid, MAX(seqid)
    FROM xact_queue
    GROUP BY xid
    ORDER BY MAX(seqid);

In your scenario it would order them xact1, then xact2, since xact 1's
update of Y would have the max seqid.  For each xact, it replays the tuples
for that xact in seqid order.

    SELECT t.seqid, t.data, ...
    FROM tuple_queue t, xact_queue x
    WHERE t.seqid = x.seqid
        AND x.xid = $XID
    ORDER BY t.seqid;

So the actual replay order would be

    xact1: update X
    xact1: update Y
    xact2: update Y

leading to slave inconsistency.

> What I'm envisioning is that you should just send updates in the order
> of their insertion sequence numbers and *not* try to force them into
> transactional grouping. ...

Very good.  Makes perfect sense to me now.  That also apparently obviates
the need for 64-bit transactions since the sequence can be a BIGINT.

Thanks,
Ed


Re: 32/64-bit transaction IDs?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> Well, I'm not absolutely certain, but I think this problem may indeed exist
> in dbmirror.

Sounds that way to me too.  Can anyone try the experiment?  I don't have
dbmirror set up here ...

            regards, tom lane


Re: 32/64-bit transaction IDs?

От
Andrew Sullivan
Дата:
On Sat, Mar 22, 2003 at 02:00:16PM -0500, Tom Lane wrote:

[stuff about how xid is not a good guide to what committed first]

FWIW, I think this is why rserv applies current-state snapshots, and
doesn't try to replay everything.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
"Ed L."
Дата:
On Saturday March 22 2003 12:00, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > On Saturday March 22 2003 11:29, Tom Lane wrote:
> >> I think this last part is wrong.  It shouldn't be using the xid as
> >> part of the ordering, only the sequence value.
> >
> > Why not?  How would you replay them on the slave in the same
> > transaction groupings and order that occurred on the master?
>
> Why not is easy:
>
>     begin xact 1;
>     update tuple X;
>     ...
>
>                     begin xact 2;
>                     update tuple Y;
>                     commit;
>
>     ...
>     update tuple Y;
>     commit;
>
> (Note that this is only possible in read-committed mode, else xact 1
> wouldn't be allowed to update tuple Y like this.)  Here, you must
> replay xact 1's update of Y after xact 2's update of Y, else you'll
> get the wrong final state on the slave.  On the other hand, it really
> does not matter whether you replay the tuple X update before or after
> you replay xact 2, because xact 2 didn't touch tuple X.
>
> If the existing DBmirror code is sorting as you say, then it will fail
> in this scenario --- unless it always manages to execute a propagation
> step in between the commits of xacts 2 and 1, which doesn't seem very
> trustworthy.
>
> What I'm envisioning is that you should just send updates in the order
> of their insertion sequence numbers and *not* try to force them into
> transactional grouping.  In the above scenario, depending on when
> propagation runs it might send X update, Y update, second Y update
> all in one batch; or it might send Y update in one batch and then X
> update and second Y update in a later batch.  Both of these are legal
> and consistent orderings.  The only ordering constraint is that the
> second Y update must be applied second --- but that is certain as
> long as we sort the contents of a batch by insertion order.  (The
> pending-update report for the second Y update cannot have been made
> until after xact 2 commits, because the MVCC semantics will force
> xact 1 to wait for 2 to commit before it can update Y.)
>
> Note that all of a transaction's updates will become visible in the
> pending-update table simultaneously when it commits, so (as long as
> you grab batches in single SELECTs, or with a serializable transaction)
> there's no problems with partial transactions being applied by a batch.
> A batch will contain all the updates of some specific set of
> transactions, namely all those that committed since the prior batch
> for that slave.  But you can't order the updates done by a batch in xid
> order, they have to be applied in insertion order.

If you grab everything in the queue with a single SELECT, this works.
Depending on the queue length, that's not always practical, and as noted
above, committed batches could result in partial transactions on the slave.
So the riddle is how to get a consistent but batchable replication order.

Suppose you have a replication queue, with dbmirror-like trigger-based
insertions, that looks something like this:

    create table replication_queue (
        xid    integer,
        seqid    serial primary key,
        data...
    );

Here, 'xid' is the transaction ID, 'seqid' is the queue insertion order, and
'data' has all the info needed to replicate the update.  Every row
update/delete/inserte results in a row in this queue.

Suppose further you wish to limit the number of updates replicated in a
particular cycle.  For example, suppose you have a million changes queued
for replication on the master, but for obvious reasons you don't want to
select a million rows to replicate all at once.  Suppose you also don't
want to grab them one tuple or one transaction at a time, preferring to
avoid hammering the master.  Rather, you want to grab them in batches of no
more than N transactions, replicate them all to a slave and commit on the
slave, take a breather, repeating until the slave is caught up.  And during
each breather, you want to have committed only complete transactions so
that any slave clients see consistent data.

My question:  Is there an ordering algorithm that would make a consistent
but limited batchsize replication possible?

The algorithm I'm considering right now is the following:

    select xid, max(seqid) as "max_seqid"
    into temp replication_order
    from replication_queue
    group by xid
    order by max(seqid)
    limit N;

Then, to get the actual queue replication order,

    select q.xid, q.seqid, q.data
    from replication_queue q, replication_order o
    where q.xid = o.xid
    order by o.max_seqid, q.seqid;

[This is a batched variation of dbmirror's original algorithm.]

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> On Saturday March 22 2003 12:00, Tom Lane wrote:
>> Note that all of a transaction's updates will become visible in the
>> pending-update table simultaneously when it commits, so (as long as
>> you grab batches in single SELECTs, or with a serializable transaction)
>> there's no problems with partial transactions being applied by a batch.

> If you grab everything in the queue with a single SELECT, this works.
> Depending on the queue length, that's not always practical, and as noted
> above, committed batches could result in partial transactions on the slave.
> So the riddle is how to get a consistent but batchable replication order.

You don't have to do anything special if you pull the contents of a
batch in a single serializable transaction.  I see no reason to think
that using a serializable transaction is "hammering the master"; so
you are asking for a solution to a non-problem.

            regards, tom lane


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
"Ed L."
Дата:
On Thursday April 10 2003 4:11, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > On Saturday March 22 2003 12:00, Tom Lane wrote:
> >> Note that all of a transaction's updates will become visible in the
> >> pending-update table simultaneously when it commits, so (as long as
> >> you grab batches in single SELECTs, or with a serializable
> >> transaction) there's no problems with partial transactions being
> >> applied by a batch.
> >
> > If you grab everything in the queue with a single SELECT, this works.
> > Depending on the queue length, that's not always practical, and as
> > noted above, committed batches could result in partial transactions on
> > the slave. So the riddle is how to get a consistent but batchable
> > replication order.
>
> You don't have to do anything special if you pull the contents of a
> batch in a single serializable transaction.  I see no reason to think
> that using a serializable transaction is "hammering the master"; so
> you are asking for a solution to a non-problem.

I don't think so.  Can you imagine a replication queue big enough to that
someone might not want to process it entirely in one transaction?  I sure
can.  Consider the following sequence:

1.  Replication queueing is begun for a particular slave on the master.  A
dump is taken with which to initialize the slave.

2.  A sufficient number of updates are queued such that the total amount of
data exceeds the amount one wants to process before giving the master, or
the slave, or the network, a break).  This could easily happen in our case
if there were delays in getting the slave setup for whatever reason.

3.  The slave is finally setup for replication.  By this time, the queue is
bigger than we want to process in one round.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
"Ed L."
Дата:
On Thursday April 10 2003 4:44, Ed L. wrote:
> On Thursday April 10 2003 4:11, Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > > On Saturday March 22 2003 12:00, Tom Lane wrote:
> > >> Note that all of a transaction's updates will become visible in the
> > >> pending-update table simultaneously when it commits, so (as long as
> > >> you grab batches in single SELECTs, or with a serializable
> > >> transaction) there's no problems with partial transactions being
> > >> applied by a batch.
> > >
> > > If you grab everything in the queue with a single SELECT, this works.
> > > Depending on the queue length, that's not always practical, and as
> > > noted above, committed batches could result in partial transactions
> > > on the slave. So the riddle is how to get a consistent but batchable
> > > replication order.
> >
> > You don't have to do anything special if you pull the contents of a
> > batch in a single serializable transaction.  I see no reason to think
> > that using a serializable transaction is "hammering the master"; so
> > you are asking for a solution to a non-problem.
>
> I don't think so.  Can you imagine a replication queue big enough to that
> someone might not want to process it entirely in one transaction?  I sure
> can.  Consider the following sequence:
>
> 1.  Replication queueing is begun for a particular slave on the master.
> A dump is taken with which to initialize the slave.
>
> 2.  A sufficient number of updates are queued such that the total amount
> of data exceeds the amount one wants to process before giving the master,
> or the slave, or the network, a break).  This could easily happen in our
> case if there were delays in getting the slave setup for whatever reason.

By "sufficient number of updates", I really mean a large enough number of
transactions.  Imagine a queue with a 100,000 transactions that would
require a number of hours or days to replicate straight through.

Ed

>
> 3.  The slave is finally setup for replication.  By this time, the queue
> is bigger than we want to process in one round.
>
> Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> I don't think so.  Can you imagine a replication queue big enough to that
> someone might not want to process it entirely in one transaction?

No, I can't.  The bigger the queue is, the further behind you are, and
the more you need to catch up; twiddling your thumbs for awhile gets
progressively less attractive.

Also, AFAIR from prior discussion, the *slave* side doesn't need to
commit the whole batch in one transaction.  I can't recall if this
could expose transaction intermediate states on the slave, but if
you're that far behind you'd best not be having any live clients
querying the slave anyway.

In any case you can throttle the load by sleeping between selects while
holding the transaction open.  I think your concern is predicated on
Oracle-ish assumptions about the cost of holding open a transaction.
The only such cost in PG is that it interferes with VACUUM cleaning
out old tuples --- which I'm not sure VACUUM could do anyway for stuff
that still hasn't propagated to a slave.

            regards, tom lane


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
"Ed L."
Дата:
On Thursday April 10 2003 5:26, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > I don't think so.  Can you imagine a replication queue big enough to
> > that someone might not want to process it entirely in one transaction?
>
> No, I can't.  The bigger the queue is, the further behind you are, and
> the more you need to catch up; twiddling your thumbs for awhile gets
> progressively less attractive.

Well, if an arbitrarily large queue of transactions doesn't persuade you,
you are not going to be persuaded.   If (a) the catch-up rate is fast
enough, (b) the master load periodically ebbs enough, as it does at various
predictably slow times, (c) you care about lessening traffic between master
and slave, and/or (d) you care about lessening constancy of the replication
load on the master, having the replicator periodically back-off is very
attractive.

> Also, AFAIR from prior discussion, the *slave* side doesn't need to
> commit the whole batch in one transaction.  I can't recall if this
> could expose transaction intermediate states on the slave, but if
> you're that far behind you'd best not be having any live clients
> querying the slave anyway.

Exposing intermediate transaction states is precisely the issue and the
reason for my original question.  Your apparent presumption of the lack of
value of querying a slave that's running significantly behind is a false
blanket assumption.  Of course it depends on the situation and the nature
of the data.  I can think of a number of past instances where some
considerable lagtime in the data propagation was just fine, but
inconsistency was not.  If you aren't replicating to the slave and
committing in one big all-inclusive batch, then there needs to be some care
to commit in transaction units if you don't want to offer room for
inconsistent views to slave clients.  We left the conversation a while back
thinking there was no need for anything other than a serial replay by
insertion order, but a batching requirement seems to change that, which is
why I posed the original question.

> In any case you can throttle the load by sleeping between selects while
> holding the transaction open.  I think your concern is predicated on
> Oracle-ish assumptions about the cost of holding open a transaction.
> The only such cost in PG is that it interferes with VACUUM cleaning
> out old tuples --- which I'm not sure VACUUM could do anyway for stuff
> that still hasn't propagated to a slave.

I'd have thought there might be other problems resulting from holding a
transaction open for hours, like the progress you lose if the master or
slave or the connection between goes down before the mondo commit.  Maybe
that kind of thing just never happens.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
Richard Huxton
Дата:
On Friday 11 Apr 2003 3:03 am, Ed L. wrote:
> On Thursday April 10 2003 5:26, Tom Lane wrote:
[snip][
> > Also, AFAIR from prior discussion, the *slave* side doesn't need to
> > commit the whole batch in one transaction.  I can't recall if this
> > could expose transaction intermediate states on the slave, but if
> > you're that far behind you'd best not be having any live clients
> > querying the slave anyway.
>
> Exposing intermediate transaction states is precisely the issue and the
> reason for my original question.  Your apparent presumption of the lack of
> value of querying a slave that's running significantly behind is a false
> blanket assumption.  Of course it depends on the situation and the nature
> of the data.  I can think of a number of past instances where some
> considerable lagtime in the data propagation was just fine, but
> inconsistency was not.  If you aren't replicating to the slave and
> committing in one big all-inclusive batch, then there needs to be some care
> to commit in transaction units if you don't want to offer room for
> inconsistent views to slave clients.

Surely the batching should be happening at the "master" end? That's the only
place with the context to mark a "determinate state". Batch things as fine as
you like at that end, and just have the slave process multiple batches if it
can/wants to.

--
  Richard Huxton


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Jan Wieck
Дата:
Tom Lane wrote:
>
> "Ed L." <pgsql@bluepolka.net> writes:
> > I don't think so.  Can you imagine a replication queue big enough to that
> > someone might not want to process it entirely in one transaction?
>
> No, I can't.  The bigger the queue is, the further behind you are, and
> the more you need to catch up; twiddling your thumbs for awhile gets
> progressively less attractive.

That is absolutely sure in an asynchronous multi-master situation, where
"twiddling" only leads to conflicts ... not making your situation any
easier.

But in a pure master slave situation? There I can imagine this.

What I cannot imagine is why one would want to try to make batches any
other size than the original transaction. Committing smaller "chunks" of
the masters transactions at the slave side would allow a client there to
see an inconsistent snapshot - that is bad (tm). Committing bigger
groups contains the risk that the slave run's out of resources that the
master didn't need - not any better.

> Also, AFAIR from prior discussion, the *slave* side doesn't need to
> commit the whole batch in one transaction.  I can't recall if this
> could expose transaction intermediate states on the slave, but if
> you're that far behind you'd best not be having any live clients
> querying the slave anyway.

I'm not aware of any "COMMIT BUT HIDE AND RESUME;"

There are scenarios where your online processes have clear priority over
the master to multislave replication. Think of a load balanced multisite
search engine ... does it really matter that all the sites stay as sync
as possible? Do people expect Google to be up to date on a per minute
base?


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Dennis Gearon
Дата:
I'm not really very cognizant of these slave/master things. I know I will have
to be someday.

but the case I heard in the conversation that seemed to most be appropriate to
batching was when a slave starts from a clean slate. It should not be able to be
accessed until it has caught up, true. But catching up a blank slave would most
definitely hammer on a master, I think.

Jan Wieck wrote:
> Tom Lane wrote:
>
>>"Ed L." <pgsql@bluepolka.net> writes:
>>
>>>I don't think so.  Can you imagine a replication queue big enough to that
>>>someone might not want to process it entirely in one transaction?
>>
>>No, I can't.  The bigger the queue is, the further behind you are, and
>>the more you need to catch up; twiddling your thumbs for awhile gets
>>progressively less attractive.
>
>
> That is absolutely sure in an asynchronous multi-master situation, where
> "twiddling" only leads to conflicts ... not making your situation any
> easier.
>
> But in a pure master slave situation? There I can imagine this.
>
> What I cannot imagine is why one would want to try to make batches any
> other size than the original transaction. Committing smaller "chunks" of
> the masters transactions at the slave side would allow a client there to
> see an inconsistent snapshot - that is bad (tm). Committing bigger
> groups contains the risk that the slave run's out of resources that the
> master didn't need - not any better.
>
>
>>Also, AFAIR from prior discussion, the *slave* side doesn't need to
>>commit the whole batch in one transaction.  I can't recall if this
>>could expose transaction intermediate states on the slave, but if
>>you're that far behind you'd best not be having any live clients
>>querying the slave anyway.
>
>
> I'm not aware of any "COMMIT BUT HIDE AND RESUME;"
>
> There are scenarios where your online processes have clear priority over
> the master to multislave replication. Think of a load balanced multisite
> search engine ... does it really matter that all the sites stay as sync
> as possible? Do people expect Google to be up to date on a per minute
> base?
>
>
> Jan
>


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Friday April 11 2003 5:48, Jan Wieck wrote:
> Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > > I don't think so.  Can you imagine a replication queue big enough to
> > > that someone might not want to process it entirely in one
> > > transaction?
> >
> > No, I can't.  The bigger the queue is, the further behind you are, and
> > the more you need to catch up; twiddling your thumbs for awhile gets
> > progressively less attractive.
>
> That is absolutely sure in an asynchronous multi-master situation, where
> "twiddling" only leads to conflicts ... not making your situation any
> easier.
>
> But in a pure master slave situation? There I can imagine this.

The context of my question is strictly master slave.

> What I cannot imagine is why one would want to try to make batches any
> other size than the original transaction. Committing smaller "chunks" of
> the masters transactions at the slave side would allow a client there to
> see an inconsistent snapshot - that is bad (tm). Committing bigger
> groups contains the risk that the slave run's out of resources that the
> master didn't need - not any better.

To what slave resources are you referring?

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
"Ed L."
Дата:
On Friday April 11 2003 2:38, Richard Huxton wrote:
> Surely the batching should be happening at the "master" end? That's the
> only place with the context to mark a "determinate state". Batch things
> as fine as you like at that end, and just have the slave process multiple
> batches if it can/wants to.

Huh?  I don't understand this comment.  How to retrieve properly ordered but
limited batches of transactions and process them on the slave is the
question.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Jan Wieck
Дата:
"Ed L." wrote:
>
> On Friday April 11 2003 5:48, Jan Wieck wrote:
> > Tom Lane wrote:
> > > "Ed L." <pgsql@bluepolka.net> writes:
> > > > I don't think so.  Can you imagine a replication queue big enough to
> > > > that someone might not want to process it entirely in one
> > > > transaction?
> > >
> > > No, I can't.  The bigger the queue is, the further behind you are, and
> > > the more you need to catch up; twiddling your thumbs for awhile gets
> > > progressively less attractive.
> >
> > That is absolutely sure in an asynchronous multi-master situation, where
> > "twiddling" only leads to conflicts ... not making your situation any
> > easier.
> >
> > But in a pure master slave situation? There I can imagine this.
>
> The context of my question is strictly master slave.
>
> > What I cannot imagine is why one would want to try to make batches any
> > other size than the original transaction. Committing smaller "chunks" of
> > the masters transactions at the slave side would allow a client there to
> > see an inconsistent snapshot - that is bad (tm). Committing bigger
> > groups contains the risk that the slave run's out of resources that the
> > master didn't need - not any better.
>
> To what slave resources are you referring?

Clearly a bug, but we had memory leaks that clear up at transaction end.

One of the "leaks" we still have: Constraint trigger queue.

Be sure, you don't want to find out that you have that kind of problem
by loosing slave by slave because your workload got worse ;-)


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Friday April 11 2003 10:08, Jan Wieck wrote:
> Clearly a bug, but we had memory leaks that clear up at transaction end.

That seems like yet another reason for constraining the size of a batch of
transactions.

> One of the "leaks" we still have: Constraint trigger queue.

What is that about?  Or if you don't want to re-explain, what would I search
for in the archive?

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Jan Wieck
Дата:
"Ed L." wrote:
>
> On Friday April 11 2003 10:08, Jan Wieck wrote:
> > Clearly a bug, but we had memory leaks that clear up at transaction end.
>
> That seems like yet another reason for constraining the size of a batch of
> transactions.

Er ... what? I said:

What I cannot imagine is why one would want to try to make batches any
other size than the original transaction.

"the original transaction" - singular!!! Not a couple, few, maybe some,
part, fraction or anything in between, above or below. Exactly ONE.

>
> > One of the "leaks" we still have: Constraint trigger queue.
>
> What is that about?  Or if you don't want to re-explain, what would I search
> for in the archive?

If you have a deferred referential integrity constraint defined (one of
the reasons why half of a transaction cannot work at all), where does
the backend remember the ctid's and other information for the triggers
to call at commit time?


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Friday April 11 2003 1:07, Jan Wieck wrote:
> "Ed L." wrote:
> > On Friday April 11 2003 10:08, Jan Wieck wrote:
> > > Clearly a bug, but we had memory leaks that clear up at transaction
> > > end.
> >
> > That seems like yet another reason for constraining the size of a batch
> > of transactions.
>
> What I cannot imagine is why one would want to try to make batches any
> other size than the original transaction.

Again, if there are a large number of transactions queued on the master,
possibly constituted by an even much larger number of actual row changes, a
batch size of 1 transaction means, in this context, the master will be hit
once for every single transaction to retrieve the queued rows corresponding
to that transaction.  A batchsize of, say 1000 transactions, on the other
hand, will save 999 queries.  Granted, you may not share my concern about
the impact of that on master or network.

> "the original transaction" - singular!!! Not a couple, few, maybe some,
> part, fraction or anything in between, above or below. Exactly ONE.

Repeated exclamation points, pedanticism, and all caps shouting duly noted.

Now, suppose there are in fact memory leaks that only clear up at
transaction end.  Suppose the memory leak increases with the size of the
transaction (which nobody asserted but I assume may be true).  Suppose one
wishes to avoid a per-transaction hit on the master during replication.
Suppose one needs to process a replication queue in batches due to an
overly large replication queue.  Then, I claim, one may quite reasonably
wish to retrieve replication tuples from the master in batches containing
rows constituting more than one transaction.  All complete transactions,
yes, but more than one.

> If you have a deferred referential integrity constraint defined (one of
> the reasons why half of a transaction cannot work at all), where does
> the backend remember the ctid's and other information for the triggers
> to call at commit time?

Again, I have no interest in partial transactions, only groups of multiple
transactions.

===

For those of you who do see the validity in batching multiple transactions,
my question is restated here:

My question:  Is there an ordering algorithm that would make a consistent
but limited batchsize replication possible?  I propose one below.

Suppose you have a replication queue on the master, with dbmirror-like
trigger-based insertions, that looks something like this:

        create table replication_queue (
                xid     integer,
                seqid   serial primary key,
                data...
        );

Here, 'xid' is the transaction ID, 'seqid' is the queue insertion order, and
'data' has all the info needed to replicate the update.  Every row
update/delete/inserte results in a row in this queue, and a transaction may
consist of one to many rows in the queue.

Suppose further you wish to limit the number of updates replicated in a
particular cycle.  For example, suppose you have a million changes queued
for replication on the master, but for obvious reasons you don't want to
select a million rows to replicate all at once.  Suppose you also don't
want to grab them one tuple or one transaction at a time, preferring to
avoid hammering the master.  Rather, you want to grab them in batches of no
more than N transactions, replicate them all to a slave and commit on the
slave, take a breather, repeating until the slave is caught up.  And during
each breather, you want to have committed only complete transactions so
that any slave clients see consistent data.

The algorithm I'm considering right now is the following:

        select xid, max(seqid) as "max_seqid"
        into temp replication_order
        from replication_queue
        group by xid
        order by max(seqid)
        limit N;

Then, to get the actual queue replication order,

        select q.xid, q.seqid, q.data
        from replication_queue q, replication_order o
        where q.xid = o.xid
        order by o.max_seqid, q.seqid;

[This is a batched variation of dbmirror's original algorithm.]

So, replication is done by transaction groupings, in ascending order
according to the maximum seqid in each transaction.  I'm hoping someone can
poke holes in this algorithm if they exist.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Jan Wieck
Дата:
"Ed L." wrote:
>
> On Friday April 11 2003 1:07, Jan Wieck wrote:
>
> > "the original transaction" - singular!!! Not a couple, few, maybe some,
> > part, fraction or anything in between, above or below. Exactly ONE.
>
> Repeated exclamation points, pedanticism, and all caps shouting duly noted.

That's not nice, it's so seldom that I have a chance to do that and you
won't even let me. ;-(

I apologize. Wasn't meant that way.

> Again, I have no interest in partial transactions, only groups of multiple
> transactions.

Okay, you want to lower the impact on the master by slurping the effects
of multiple transactions (across all tables they hit) in one
transaction.

Now you can do two basic things.

A) You apply those changes in the order you read them out of the master
on the slave. This requires that you do it all in one big transaction on
the slave side and that you can set all foreign key constraints to
deferred, getting you into the deferred trigger queue overflow risk
mentioned.

B) You read all the changes across all tables, but regroup them into
their correct order and original transaction boundaries for playback on
the slaves. Now you need some staging area where you collect and sort it
all running the risk to run out of disk space before you even begin the
first transaction on the slave side.

B2) You read all the changes across all tables simultaneously via
cursors. Worst case you need as many cursors as you have tables and it's
a bit complicated to keep track of all the group changes all over the
place, but you can rebuild the transaction groups on the fly. Certainly
doable, but I wouldn't want to maintain that software.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
elein
Дата:
If the items going to the slave are ordered and complete transactions,
does it matter whether or not they are applied as complete transactions
on the slave in groups or one at a time?

It is clear that incomplete or partial transactions should not be queued.

What I'm not sure about is the overlapping transactions where the
second completes before the first.

begin (xact 1)
    select ... for update ...
begin (xact 2)
    update ...
end (xact 2)

end (xact 1)

In this case, on the slave, xact2 will be commited before xact 1 is begun.
This seems OK to me, but I'd like some confirmation to understand this.
Or is this the wrong question?

Thanks,
elein@varlena.com

On Thursday 10 April 2003 19:03, Ed L. wrote:
> On Thursday April 10 2003 5:26, Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > > I don't think so.  Can you imagine a replication queue big enough to
> > > that someone might not want to process it entirely in one transaction?
> >
> > No, I can't.  The bigger the queue is, the further behind you are, and
> > the more you need to catch up; twiddling your thumbs for awhile gets
> > progressively less attractive.
>
> Well, if an arbitrarily large queue of transactions doesn't persuade you,
> you are not going to be persuaded.   If (a) the catch-up rate is fast
> enough, (b) the master load periodically ebbs enough, as it does at various
> predictably slow times, (c) you care about lessening traffic between master
> and slave, and/or (d) you care about lessening constancy of the replication
> load on the master, having the replicator periodically back-off is very
> attractive.
>
> > Also, AFAIR from prior discussion, the *slave* side doesn't need to
> > commit the whole batch in one transaction.  I can't recall if this
> > could expose transaction intermediate states on the slave, but if
> > you're that far behind you'd best not be having any live clients
> > querying the slave anyway.
>
> Exposing intermediate transaction states is precisely the issue and the
> reason for my original question.  Your apparent presumption of the lack of
> value of querying a slave that's running significantly behind is a false
> blanket assumption.  Of course it depends on the situation and the nature
> of the data.  I can think of a number of past instances where some
> considerable lagtime in the data propagation was just fine, but
> inconsistency was not.  If you aren't replicating to the slave and
> committing in one big all-inclusive batch, then there needs to be some care
> to commit in transaction units if you don't want to offer room for
> inconsistent views to slave clients.  We left the conversation a while back
> thinking there was no need for anything other than a serial replay by
> insertion order, but a batching requirement seems to change that, which is
> why I posed the original question.
>
> > In any case you can throttle the load by sleeping between selects while
> > holding the transaction open.  I think your concern is predicated on
> > Oracle-ish assumptions about the cost of holding open a transaction.
> > The only such cost in PG is that it interferes with VACUUM cleaning
> > out old tuples --- which I'm not sure VACUUM could do anyway for stuff
> > that still hasn't propagated to a slave.
>
> I'd have thought there might be other problems resulting from holding a
> transaction open for hours, like the progress you lose if the master or
> slave or the connection between goes down before the mondo commit.  Maybe
> that kind of thing just never happens.
>
> Ed
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--
----------------------------------------------------------------------------------------
elein@varlena.com     Database Consulting     www.varlena.com
              I have always depended on the [QA] of strangers.


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Friday April 11 2003 4:34, Jan Wieck wrote:
> >
> > Repeated exclamation points, pedanticism, and all caps shouting duly
> > noted.
>
> I apologize. Wasn't meant that way.

Apology accepted, thanks.

> > Again, I have no interest in partial transactions, only groups of
> > multiple transactions.
>
> Okay, you want to lower the impact on the master by slurping the effects
> of multiple transactions (across all tables they hit) in one
> transaction.

Your phrase "across all the tables they hit" makes me wonder if you're
talking about dbmirror-style replication.  Replication, in this modified
dbmirror context, is essentially reading one table on the master, and
replaying it on the slave, not counting a couple of bookkeeping tables.

During normal master activity, when table T1 has a row updated, a trigger
inserts a new row into the queue table Q with enough information for a
replicator to update the same row in table T1 on the slave.  When table T2
gets an insert or delete, the same queue table Q again gets a new row via
the trigger.  Then, during master-to-slave replication, the replicator
comes along and reads rows from Q, not all tables.  The replicator does not
read any user tables besides Q and a bookkeeping table or two.

> Now you can do two basic things.
>
> A) You apply those changes in the order you read them out of the master
> on the slave. This requires that you do it all in one big transaction on
> the slave side and that you can set all foreign key constraints to
> deferred, getting you into the deferred trigger queue overflow risk
> mentioned.

Well, this leads back to my original question.  If it is possible to order
them in transaction groupings, then intermediate commits are possible and
one seems less likely to slam into the deferred trigger queue issue.  The
ordering under examination would allow you to replay them in transaction
order, committing every M transactions with consistency intact, and even
resetting the connection if you wish.  But is it a correct ordering?  I
think so, but I dunno.

> B) You read all the changes across all tables, but regroup them into
> their correct order and original transaction boundaries for playback on
> the slaves. Now you need some staging area where you collect and sort it
> all running the risk to run out of disk space before you even begin the
> first transaction on the slave side.

If the ordering I seek exists via the algorithm I've described, I don't
think you need to stage the entire queue at once.  The overhead I'd use is
minimally, I think, about 56 bytes per transaction (int xid, bigint
max_seqid, int slaveid + 40 Pg overhead bytes), not counting my own
audit/debug columns.  If I add my 40 bytes of debug/audit column bloat,
that's 96 bytes/row in ordering.  Since we only need N transactions, the
staging area only needs to have N rows of 96 bytes or less.  In the bloat
row case, a batchsize of 100,000 transactions would only require 10MB, and
I'd expect to use a much smaller batchsize.  And no need to keep the batch
ordering around once it has been replicated.

ISTM, the key is whether or not one can group and replay according to
transaction groups, with the groups ordered by max seqid within each
transaction.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
"Ed L."
Дата:
On Friday April 11 2003 4:44, elein wrote:
> If the items going to the slave are ordered and complete transactions,
> does it matter whether or not they are applied as complete transactions
> on the slave in groups or one at a time?

I don't think it really matters as long as the order is correct and the
commits occur on the transactional boundaries.

> What I'm not sure about is the overlapping transactions where the
> second completes before the first.
>
> begin (xact 1)
>     select ... for update ...
> begin (xact 2)
>     update ...
> end (xact 2)
>
> end (xact 1)
>
> In this case, on the slave, xact2 will be commited before xact 1 is
> begun. This seems OK to me, but I'd like some confirmation to understand
> this. Or is this the wrong question?

IMO, this is exactly what makes the analysis a little tricky for people like
me who don't regularly live in the world of transaction minutiae.

If xact2 tries to update a row already updated by xact1, xact2 will block
until xact1 commits or rolls back, and cannot end before xact1 due to MVCC.
If xact1 tries to update a row already updated by xact2, xact1 will block
until xact2 is resolved.

My question is, given all the ways in which these orderings can occur, which
I think simplifies to about 3 in the case of 2 transactions updating the
same row, is there a fool-proof ordering for replaying in batches?

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Friday April 11 2003 5:29, Ed L. wrote:
> queue issue.  The ordering under examination would allow you to replay
> them in transaction order, committing every M transactions with

A more accurate phrase:  "The ordering under examination would allow you to
replay them in transaction *groupings*, ordered by insertion order within
each transaction, committing every M transactions..."

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Stephan Szabo
Дата:
On Fri, 11 Apr 2003, Ed L. wrote:

> For those of you who do see the validity in batching multiple transactions,
> my question is restated here:
>
> My question:  Is there an ordering algorithm that would make a consistent
> but limited batchsize replication possible?  I propose one below.
>
> Suppose you have a replication queue on the master, with dbmirror-like
> trigger-based insertions, that looks something like this:
>
>         create table replication_queue (
>                 xid     integer,
>                 seqid   serial primary key,
>                 data...
>         );
>
> Here, 'xid' is the transaction ID, 'seqid' is the queue insertion order, and
> 'data' has all the info needed to replicate the update.  Every row
> update/delete/inserte results in a row in this queue, and a transaction may
> consist of one to many rows in the queue.

> The algorithm I'm considering right now is the following:
>
>         select xid, max(seqid) as "max_seqid"
>         into temp replication_order
>         from replication_queue
>         group by xid
>         order by max(seqid)
>         limit N;
>
> Then, to get the actual queue replication order,
>
>         select q.xid, q.seqid, q.data
>         from replication_queue q, replication_order o
>         where q.xid = o.xid
>         order by o.max_seqid, q.seqid;
>
> [This is a batched variation of dbmirror's original algorithm.]
>
> So, replication is done by transaction groupings, in ascending order
> according to the maximum seqid in each transaction.  I'm hoping someone can
> poke holes in this algorithm if they exist.

Does it matter if transactions that do not affect each other are committed
on the slave in a different order than they were on the master?  I don't
think that's guaranteed by the above (unless the inserts into
replication_queue were deferred to transaction end), but I'm also
uncertain if it's a constraint you're concerned with.


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Friday April 11 2003 6:08, Stephan Szabo wrote:
> On Fri, 11 Apr 2003, Ed L. wrote:
> > For those of you who do see the validity in batching multiple
> > transactions, my question is restated here:
> >
> > My question:  Is there an ordering algorithm that would make a
> > consistent but limited batchsize replication possible?  I propose one
> > below.
> >
> > Suppose you have a replication queue on the master, with dbmirror-like
> > trigger-based insertions, that looks something like this:
> >
> >         create table replication_queue (
> >                 xid     integer,
> >                 seqid   serial primary key,
> >                 data...
> >         );
> >
> > Here, 'xid' is the transaction ID, 'seqid' is the queue insertion
> > order, and 'data' has all the info needed to replicate the update.
> > Every row update/delete/inserte results in a row in this queue, and a
> > transaction may consist of one to many rows in the queue.
> >
> > The algorithm I'm considering right now is the following:
> >
> >         select xid, max(seqid) as "max_seqid"
> >         into temp replication_order
> >         from replication_queue
> >         group by xid
> >         order by max(seqid)
> >         limit N;
> >
> > Then, to get the actual queue replication order,
> >
> >         select q.xid, q.seqid, q.data
> >         from replication_queue q, replication_order o
> >         where q.xid = o.xid
> >         order by o.max_seqid, q.seqid;
> >
> > [This is a batched variation of dbmirror's original algorithm.]
> >
> > So, replication is done by transaction groupings, in ascending order
> > according to the maximum seqid in each transaction.  I'm hoping someone
> > can poke holes in this algorithm if they exist.
>
> Does it matter if transactions that do not affect each other are
> committed on the slave in a different order than they were on the master?
>  I don't think that's guaranteed by the above (unless the inserts into
> replication_queue were deferred to transaction end), but I'm also
> uncertain if it's a constraint you're concerned with.

I appreciate your pointing that out.  It is pretty undesirable for data to
appear on the slave in an order different from the one in which it appears
on the master.  I guess that's another downside to batching.  I'm not sure
this approach can do any better than approximating the order since there is
no knowledge of the commit order.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Jan Wieck
Дата:
"Ed L." wrote:
> If the ordering I seek exists via the algorithm I've described, I don't

The right ordering is commit order, because that is the order that
played back serialized (without possible lock conflicts) guarantees the
same tuples hit.

Unfortunately I know of no sure way to tell the commit order. The
max(seqid) per transaction in ascending order would be a good
approximation at least.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Jim C. Nasby"
Дата:
On Fri, Apr 11, 2003 at 07:32:15PM -0600, Ed L. wrote:
> I appreciate your pointing that out.  It is pretty undesirable for data to
> appear on the slave in an order different from the one in which it appears
> on the master.  I guess that's another downside to batching.  I'm not sure
> this approach can do any better than approximating the order since there is
> no knowledge of the commit order.

I know this will probably require more work than you'd like, but it
seems like it might be very useful/important for the replication queue
to have definitive information about when commits occur.

BTW, I don't know how this would apply to pgsql, but both DB2 and Oracle
handle replication via the transaction logs. AFAICT they don't keep
seperate replication tables or anything; they just ship whole
transaction logs off to the slaves (a bit of a simplification, but my
point is that all the data the slaves get is in the form of the
transaction logs that are normally kept by the master anyway).
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Jim C. Nasby"
Дата:
On Fri, Apr 11, 2003 at 02:18:41PM -0600, Ed L. wrote:
> Suppose further you wish to limit the number of updates replicated in a
> particular cycle.  For example, suppose you have a million changes queued
> for replication on the master, but for obvious reasons you don't want to
> select a million rows to replicate all at once.  Suppose you also don't
> want to grab them one tuple or one transaction at a time, preferring to
> avoid hammering the master.  Rather, you want to grab them in batches of no
> more than N transactions, replicate them all to a slave and commit on the
> slave, take a breather, repeating until the slave is caught up.  And during
> each breather, you want to have committed only complete transactions so
> that any slave clients see consistent data.

I know I'm a bit ignorant to how dbmirror works, but why do the select
from the master and the actual commits on the slave(s) have to occupy a
single transaction? I would expect that there was some kind of
book-keeping tables on both the master and the slaves to keep track of
how far along mirroring/replication was; the master would need to know
what it could remove out of the replication table (table Q as it's
called elsewhere in this thread), and the slave needs to know where it
left off. If this is the case, isn't it acceptable to select a chunk of
data out of the queue table on the master, and run through it on the
slave, with the slave committing as it pleases? Of course every time the
slave commits it will have to update the book-keeping tables, but that
seems to be a given... or is this exactly what you're worried about
hammering the master with?
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Saturday April 12 2003 9:54, Jim C. Nasby wrote:
> On Fri, Apr 11, 2003 at 07:32:15PM -0600, Ed L. wrote:
> > I appreciate your pointing that out.  It is pretty undesirable for data
> > to appear on the slave in an order different from the one in which it
> > appears on the master.  I guess that's another downside to batching.
> > I'm not sure this approach can do any better than approximating the
> > order since there is no knowledge of the commit order.
>
> I know this will probably require more work than you'd like, but it
> seems like it might be very useful/important for the replication queue
> to have definitive information about when commits occur.

Yes, just hard/impossible to get commit data.  There is work in progress on
a log-based replication.
1
> BTW, I don't know how this would apply to pgsql, but both DB2 and Oracle
> handle replication via the transaction logs. AFAICT they don't keep
> seperate replication tables or anything; they just ship whole
> transaction logs off to the slaves (a bit of a simplification, but my
> point is that all the data the slaves get is in the form of the
> transaction logs that are normally kept by the master anyway).

Yes, sounds like such an approach might be more robust in some ways.  Its
just not ready, and it may be a release or two before it shows up, last I
heard.  Dbmirror, in my view, is reasonably close, free and open, and ready
to go until something better comes along.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Saturday April 12 2003 9:55, Jim C. Nasby wrote:
> On Fri, Apr 11, 2003 at 02:18:41PM -0600, Ed L. wrote:
> > Suppose further you wish to limit the number of updates replicated in a
> > particular cycle.  For example, suppose you have a million changes
> > queued for replication on the master, but for obvious reasons you don't
> > want to select a million rows to replicate all at once.  Suppose you
> > also don't want to grab them one tuple or one transaction at a time,
> > preferring to avoid hammering the master.  Rather, you want to grab
> > them in batches of no more than N transactions, replicate them all to a
> > slave and commit on the slave, take a breather, repeating until the
> > slave is caught up.  And during each breather, you want to have
> > committed only complete transactions so that any slave clients see
> > consistent data.
>
> I know I'm a bit ignorant to how dbmirror works, but why do the select
> from the master and the actual commits on the slave(s) have to occupy a
> single transaction? I would expect that there was some kind of
> book-keeping tables on both the master and the slaves to keep track of
> how far along mirroring/replication was; the master would need to know
> what it could remove out of the replication table (table Q as it's
> called elsewhere in this thread), and the slave needs to know where it
> left off. If this is the case, isn't it acceptable to select a chunk of
> data out of the queue table on the master, and run through it on the
> slave, with the slave committing as it pleases? Of course every time the
> slave commits it will have to update the book-keeping tables, but that
> seems to be a given... or is this exactly what you're worried about
> hammering the master with?

You are correct, IMO.  Master transaction and slave transaction do not and
cannot occupy the same transaction (since they're on different dbs).  They
may be independent, AFAIC, as long as the ordering is "close enough".
Slave may commit every 5 transactions while replaying a batch of 1000
transactions.  If there is slave and master bookkeeping, that seems to work
ok.

My interest in batching is that if you have many, many transactions in the
queue, it's a bit of hammering and traffic to go back and update the
bookkeeping tables and get the next transaction if you do it on every
transaction.  So mostly, its just seeing if I can't minimize the load on
the master and traffic on the net.

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)

От
Martijn van Oosterhout
Дата:
On Fri, Apr 11, 2003 at 05:33:00PM -0600, Ed L. wrote:
> On Friday April 11 2003 4:44, elein wrote:
> > In this case, on the slave, xact2 will be commited before xact 1 is
> > begun. This seems OK to me, but I'd like some confirmation to understand
> > this. Or is this the wrong question?
>
> IMO, this is exactly what makes the analysis a little tricky for people like
> me who don't regularly live in the world of transaction minutiae.
>
> If xact2 tries to update a row already updated by xact1, xact2 will block
> until xact1 commits or rolls back, and cannot end before xact1 due to MVCC.
> If xact1 tries to update a row already updated by xact2, xact1 will block
> until xact2 is resolved.
>
> My question is, given all the ways in which these orderings can occur, which
> I think simplifies to about 3 in the case of 2 transactions updating the
> same row, is there a fool-proof ordering for replaying in batches?

Well, it seems to me that you're trying to simulate the results of READ
COMMITTED mode in SERIALISABLE mode, which would seem to me to be impossible
(in the general case) if you ask me.

You'd have to get some kind of save points where the are no outstanding
writable commits and use that to divide.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> "the West won the world not by the superiority of its ideas or values or
> religion but rather by its superiority in applying organized violence.
> Westerners often forget this fact, non-Westerners never do."
>   - Samuel P. Huntington

Вложения

Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Steven Singer
Дата:
On Fri, 11 Apr 2003, Ed L. wrote:

>
> Again, if there are a large number of transactions queued on the master,
> possibly constituted by an even much larger number of actual row changes, a
> batch size of 1 transaction means, in this context, the master will be hit
> once for every single transaction to retrieve the queued rows corresponding
> to that transaction.  A batchsize of, say 1000 transactions, on the other
> hand, will save 999 queries.  Granted, you may not share my concern about
> the impact of that on master or network.

How you query things on the master and when you commit things on the slave
aren't all that related.  As far as I can tell your batch version of the
query using the temporary table will give you the same ordering of the
updates as the current dbmirror implementation.    As your reading the
result of the second query(that gives data to make the SQL statements to
send to the slave) you can decide when to do a comit on teh slave.
Reading the discussion it sounds like your talking about sending this
batch of 1000 transactions to teh slave in 1 transaction.  Why not issue a
commit on the slave everytime the xid changes?  From the slaves point of
view the behaviour is the smae as in teh current dbmirror(the original
transactions are preserved with an approximation of the ordering based on
the last update/insert in each transaction)

I still don't see why you would want to only do 1 commit per group on the
slave though.


>
> The algorithm I'm considering right now is the following:
>
>         select xid, max(seqid) as "max_seqid"
>         into temp replication_order
>         from replication_queue
>         group by xid
>         order by max(seqid)
>         limit N;
>
> Then, to get the actual queue replication order,
>
>         select q.xid, q.seqid, q.data
>         from replication_queue q, replication_order o
>         where q.xid = o.xid
>         order by o.max_seqid, q.seqid;
>
> [This is a batched variation of dbmirror's original algorithm.]
>
> So, replication is done by transaction groupings, in ascending order
> according to the maximum seqid in each transaction.  I'm hoping someone can
> poke holes in this algorithm if they exist.
>
> Ed
>

--
Steven Singer                                       ssinger@navtechinc.com
Dispatch Systems                            Phone:  519-747-1170 ext 282
Navtech Systems Support Inc.                AFTN:   CYYZXNSX SITA: YYZNSCR
Waterloo, Ontario                           ARINC:  YKFNSCR


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Steven Singer
Дата:
>
> Does it matter if transactions that do not affect each other are committed
> on the slave in a different order than they were on the master?  I don't
> think that's guaranteed by the above (unless the inserts into
> replication_queue were deferred to transaction end), but I'm also
> uncertain if it's a constraint you're concerned with.


Well it all depends on what you mean by "Does not accect each other".  Do
2 transactions not affect each other if they touch different rows,tables,
or deal with data that is unrelated from the point of view of an
application query the database.

Consider the following transactions on the master
T1: Insert DeptId=1 INTO Department
T2: Insert EmpId=1,DeptId=1 INTO Employee

Those 2 transactions update different tables but if T2 is sent to slaves
before T2 then A) You might have foreign key violations and B) Even if you
don't have foreign keys someone querying the slave will see a database
that never existed(at any point in time) on the master.

But if you to ensuer that T1 and T2 are sent to the slave in the same
transaction(and you defered any constraint checks to the end of your
combined transaction) you wouldn't have this problem but if the two
transactions where to update the same row then the order would be
important and you would still need your code to deal with this case.


--
Steven Singer                                       ssinger@navtechinc.com
Dispatch Systems                            Phone:  519-747-1170 ext 282
Navtech Systems Support Inc.                AFTN:   CYYZXNSX SITA: YYZNSCR
Waterloo, Ontario                           ARINC:  YKFNSCR


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Andrew Sullivan
Дата:
On Fri, Apr 11, 2003 at 06:34:53PM -0400, Jan Wieck wrote:
>
> A) You apply those changes in the order you read them out of the master
> on the slave. This requires that you do it all in one big transaction on

> B) You read all the changes across all tables, but regroup them into
> their correct order and original transaction boundaries for playback on

> B2) You read all the changes across all tables simultaneously via
> cursors. Worst case you need as many cursors as you have tables and it's

What I am confused about is why one needs to apply now-superseded
transactions on the slave at all.  Don't you just want a
(serializable, mind) snapshot of the master to be applied to the
slave?

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Monday April 14 2003 12:20, Andrew Sullivan wrote:
> On Fri, Apr 11, 2003 at 06:34:53PM -0400, Jan Wieck wrote:
> > A) You apply those changes in the order you read them out of the master
> > on the slave. This requires that you do it all in one big transaction
> > on
> >
> > B) You read all the changes across all tables, but regroup them into
> > their correct order and original transaction boundaries for playback on
> >
> > B2) You read all the changes across all tables simultaneously via
> > cursors. Worst case you need as many cursors as you have tables and
> > it's
>
> What I am confused about is why one needs to apply now-superseded
> transactions on the slave at all.  Don't you just want a
> (serializable, mind) snapshot of the master to be applied to the
> slave?

I'd say, yes, if the process of creating such a snapshot is not overly
intensive or lengthy.  IMO, this is one potentially significant drawback of
the dbmirror approach in general.  The upside to dbmirror is that its
pretty straight-forward, works pretty well for certain situations, it's
open source, and it's free.

I know rserv/erServer are reported to use the snapshot approach.  But rserv
didn't work at all for me without mods and looks very much like an
abandoned prototype for eRServer.  ERServer, its successor, is done in part
by Vadim M. who if I'm not mistaken did an excellent job with MVCC.  But,
at least as of Feb 28, 2003, eRServer was $10,000 minimum, closed source,
doesn't replicate DDL either, doesn't release trial versions, and has no
plans to support Redhat 8.0.  For us, that was more than enough incentive
to investigate the alternatives.

I think that'd be a great improvement for dbmirror, along with DDL
replication.

Opinions on a better *currently available* option?

Ed


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Dennis Gearon
Дата:
I saw somewhere on this thread that there is no way to tell when
something is committed. that really doesn't make sense, since postgres
is a MVCC system. Doesn't the versioning automatically supply commit
times?

"Ed L." wrote:
>
> On Monday April 14 2003 12:20, Andrew Sullivan wrote:
> > On Fri, Apr 11, 2003 at 06:34:53PM -0400, Jan Wieck wrote:
> > > A) You apply those changes in the order you read them out of the master
> > > on the slave. This requires that you do it all in one big transaction
> > > on
> > >
> > > B) You read all the changes across all tables, but regroup them into
> > > their correct order and original transaction boundaries for playback on
> > >
> > > B2) You read all the changes across all tables simultaneously via
> > > cursors. Worst case you need as many cursors as you have tables and
> > > it's
> >
> > What I am confused about is why one needs to apply now-superseded
> > transactions on the slave at all.  Don't you just want a
> > (serializable, mind) snapshot of the master to be applied to the
> > slave?
>
> I'd say, yes, if the process of creating such a snapshot is not overly
> intensive or lengthy.  IMO, this is one potentially significant drawback of
> the dbmirror approach in general.  The upside to dbmirror is that its
> pretty straight-forward, works pretty well for certain situations, it's
> open source, and it's free.
>
> I know rserv/erServer are reported to use the snapshot approach.  But rserv
> didn't work at all for me without mods and looks very much like an
> abandoned prototype for eRServer.  ERServer, its successor, is done in part
> by Vadim M. who if I'm not mistaken did an excellent job with MVCC.  But,
> at least as of Feb 28, 2003, eRServer was $10,000 minimum, closed source,
> doesn't replicate DDL either, doesn't release trial versions, and has no
> plans to support Redhat 8.0.  For us, that was more than enough incentive
> to investigate the alternatives.
>
> I think that'd be a great improvement for dbmirror, along with DDL
> replication.
>
> Opinions on a better *currently available* option?
>
> Ed
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
Andrew Sullivan
Дата:
On Mon, Apr 14, 2003 at 01:01:37PM -0600, Ed L. wrote:

> Opinions on a better *currently available* option?

Sorry, no.  (And I'm afraid I can't help suggest what you could do to
help the dbmirror code towards the snapshot approach, either, since I
participated in the design of the current eRServer code.  :-(  It was
my understanding at the time that it would eventually become free
software, but I am not sure what the timetable is, and I don't own
the code.)

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

От
"Ed L."
Дата:
On Monday April 14 2003 1:16, Dennis Gearon wrote:
> I saw somewhere on this thread that there is no way to tell when
> something is committed. that really doesn't make sense, since postgres
> is a MVCC system. Doesn't the versioning automatically supply commit
> times?

I believe the context of those comments was regarding what information was
(not) available from within a transaction.  Of course, commit ordering is
available in the WAL logs.

Ed