Обсуждение: 32/64-bit transaction IDs?
I need a guaranteed ordering of transaction IDs as they are queued for asyncronous replication. I'd use the 32-bit internal transaction ID from GetTransactionId(), but 32-bits will not suffice for the logic I'd like to use. A 64-bit ID would work well enough. Further, I am uncertain those 32-bit transaction IDs represent a correctly ordered sequence of transactions. Do the internal transaction IDs provide the correct transaction order? Are 64-bit transaction IDs available? If not, what other tricks are there to get guaranteed ordering? Ed
"Ed L." <pgsql@bluepolka.net> writes: > I need a guaranteed ordering of transaction IDs as they are queued for > asyncronous replication. I'd use the 32-bit internal transaction ID from > GetTransactionId(), but 32-bits will not suffice for the logic I'd like to > use. A 64-bit ID would work well enough. Uh ... your replication logic cannot ensure to reach closure within two billion transactions? Something's pretty out of joint there. > Further, I am uncertain those 32-bit transaction IDs represent a correctly > ordered sequence of transactions. XIDs are issued in transaction start order, not transaction commit order. Dunno if that's what's concerning you. regards, tom lane
On Thursday March 20 2003 11:12, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > I need a guaranteed ordering of transaction IDs as they are queued for > > asyncronous replication. I'd use the 32-bit internal transaction ID > > from GetTransactionId(), but 32-bits will not suffice for the logic I'd > > like to use. A 64-bit ID would work well enough. > > Uh ... your replication logic cannot ensure to reach closure within two > billion transactions? Something's pretty out of joint there. Uh...er...well... It just simplifies and robustifies a little logic if I can do a clean-sweep delete of replicated transactions based on the xid being less than some marker value. Something like DELETE FROM replication_queue WHERE xid < (SELECT MIN(last_mirrored_xid) FROM slave); where slave is a relation of slaves to keep track of where they are in the replication process. When transaction IDs wrap around, that logic breaks down for the new xids of 1, 2, 3, ... , and more accounting is required. In our case, under considerable load, it probably wouldn't wrap for a year or three, but it'd be nice to avoid that little timebomb it if it's easy. > > Further, I am uncertain those 32-bit transaction IDs represent a > > correctly ordered sequence of transactions. > > XIDs are issued in transaction start order, not transaction commit > order. Dunno if that's what's concerning you. Yes, I'm interested in any reliable ordering of transaction commits for the purpose of serial asyncronous replication. Is that possible? Ed
"Ed L." <pgsql@bluepolka.net> writes: >> XIDs are issued in transaction start order, not transaction commit >> order. Dunno if that's what's concerning you. > Yes, I'm interested in any reliable ordering of transaction commits for the > purpose of serial asyncronous replication. Is that possible? The order in which commit records are inserted into the WAL log would work (and if you just need an ID number, the commit records' WAL-log addresses are 64-bits and would serve nicely). But you'll have to add code to the backend to get at that. regards, tom lane
On Friday March 21 2003 11:53, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > >> XIDs are issued in transaction start order, not transaction commit > >> order. Dunno if that's what's concerning you. > > > > Yes, I'm interested in any reliable ordering of transaction commits for > > the purpose of serial asyncronous replication. Is that possible? > > The order in which commit records are inserted into the WAL log would > work (and if you just need an ID number, the commit records' WAL-log > addresses are 64-bits and would serve nicely). But you'll have to add > code to the backend to get at that. A 64-bit serial ID number would suffice. I could spend a little time on that. Any starter hints on files/functions/structure of such an addition? Something akin to GetTransactionId()? Ed
Ed Loehr <ed@LoehrTech.com> writes: > A 64-bit serial ID number would suffice. I could spend a little time on > that. Any starter hints on files/functions/structure of such an addition? > Something akin to GetTransactionId()? You'd have to capture the WAL "recptr" returned to RecordTransactionCommit by XLogInsert. What you do with it afterwards, I have no idea. regards, tom lane
On Friday March 21 2003 2:51, you wrote: > Ed Loehr <ed@LoehrTech.com> writes: > > A 64-bit serial ID number would suffice. I could spend a little time > > on that. Any starter hints on files/functions/structure of such an > > addition? Something akin to GetTransactionId()? > > You'd have to capture the WAL "recptr" returned to > RecordTransactionCommit by XLogInsert. What you do with it afterwards, > I have no idea. Well, I'm trying to capture a 64-bit psuedo-transaction ID, just like the WAL record number, but to do it within a C trigger so I can queue it into another table and have all-or-none semantics. Am I correct in assuming the XLogInsert() call is made after the transaction is guaranteed to completed? If so, wouldn't this be beyond the triggered function's reach? Ed
"Ed L." <pgsql@bluepolka.net> writes: > On Friday March 21 2003 2:51, you wrote: >> You'd have to capture the WAL "recptr" returned to >> RecordTransactionCommit by XLogInsert. What you do with it afterwards, >> I have no idea. > Well, I'm trying to capture a 64-bit psuedo-transaction ID, just like the > WAL record number, but to do it within a C trigger so I can queue it into > another table and have all-or-none semantics. Am I correct in assuming the > XLogInsert() call is made after the transaction is guaranteed to completed? > If so, wouldn't this be beyond the triggered function's reach? It's certainly out of reach of anything executed within the transaction, since by definition the commit record is only written after the transaction is done. It seems to me to be a contradiction in terms to expect within-transaction actions to have any information about commit order of their transaction. regards, tom lane
On Friday March 21 2003 11:53, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > >> XIDs are issued in transaction start order, not transaction commit > >> order. Dunno if that's what's concerning you. > > > > Yes, I'm interested in any reliable ordering of transaction commits for > > the purpose of serial asyncronous replication. Is that possible? > > The order in which commit records are inserted into the WAL log would > work (and if you just need an ID number, the commit records' WAL-log > addresses are 64-bits and would serve nicely). But you'll have to add > code to the backend to get at that. A 64-bit serial ID number would suffice. I could spend a little time on that. Any starter hints on files/functions/structure of such an addition? Something akin to GetTransactionId()? Ed
On Friday March 21 2003 10:07, Tom Lane wrote: > > > > Well, I'm trying to capture a 64-bit psuedo-transaction ID, just like > > the WAL record number, but to do it within a C trigger so I can queue > > it into another table and have all-or-none semantics. Am I correct in > > assuming the XLogInsert() call is made after the transaction is > > guaranteed to completed? If so, wouldn't this be beyond the triggered > > function's reach? > > It's certainly out of reach of anything executed within the transaction, > since by definition the commit record is only written after the > transaction is done. It seems to me to be a contradiction in terms to > expect within-transaction actions to have any information about commit > order of their transaction. I see your point. Maybe it's not possible to get perfect ordering from any information available within a transaction? Using the transaction ID for ordering seems problematic given the variability of transaction lifetimes, not to mention the 32-bit issue. I wonder if it'd be advisable to make WAL data available in a (system?) table, maybe mapping the transaction ID to the WAL record number? Just looking for some way to make the true commit order visible to a SQL query in order to leverage existing replication code ... Ed
On Friday March 21 2003 11:12, Ed L. wrote: > > Using the transaction ID for ordering seems problematic given the > variability of transaction lifetimes, not to mention the 32-bit issue. I > wonder if it'd be advisable to make WAL data available in a (system?) > table, maybe mapping the transaction ID to the WAL record number? Just > looking for some way to make the true commit order visible to a SQL query > in order to leverage existing replication code ... This is admittedly a half-baked idea from someone with little knowledge of the pg code, but more specifically, I'm imagining a system table that looks like this (I know this is not how system tables are specified): create table pg_xlog ( xid int4 unique not null, recno int8 unique not null, ); -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff This would map transaction IDs to WAL log record numbers. It seems straight-forward to get the right data into this table. But of course, this table would get a gazillion inserts, and the appropriate logic for clearing/truncating and the potential performance and memory impacts are unclear to me. Still, it does have the appeal of allowing a combination of a trigger and SQL to reliably determine the transaction order, which would seem to allow asyncronous trigger-based replication schemes to get the right replay order. Any other ideas as to how to get the guaranteed correct transaction order via triggers? Ed
"Ed L." <pgsql@bluepolka.net> writes: > create table pg_xlog ( > xid int4 unique not null, > recno int8 unique not null, > ); > -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff > This would map transaction IDs to WAL log record numbers. It seems > straight-forward to get the right data into this table. ... except that once a transaction has committed, it can't go around making more database entries. You can only modify the DB within a transaction. regards, tom lane
On Saturday March 22 2003 8:44, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > create table pg_xlog ( > > xid int4 unique not null, > > recno int8 unique not null, > > ); > > -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff > > > > This would map transaction IDs to WAL log record numbers. It seems > > straight-forward to get the right data into this table. > > ... except that once a transaction has committed, it can't go around > making more database entries. You can only modify the DB within a > transaction. There may be other good reasons not to do it, but it appears it does not all need to be done within the trigger: 1. Within the transaction, a user trigger records the transaction ID into a table along with tuple data via GetCurrentTransactionId() just as dbmirror and rserv do now, essentially like insert into replication_queue (xid, data, ...) 2. In xact.c, RecordTransactionCommit() inserts the (xid,recno) pair into pg_xlog just after XLogInsert() for the commit. It doesn't matter if it's within reach of the user's transaction-constrained trigger code or not because it's done by the system. 3. An asyncronous slave replicator comes along later and match the xid to to the replication tuples, as in select from pg_xlog l, replication_queue q where l.xid = q.xid order by l.recno; ...and gets a guaranteed ordering. I'm also wondering if my concerns are unfounded in regard to the impact of variable transaction lifetimes on the xid. Maybe MVCC maybe takes care of any concerns from variable transaction lifetimes so that the xid, even though it comes from the beginning of the transaction, is a reliable ordering? Ed
On Saturday March 22 2003 9:15, Ed L. wrote: > On Saturday March 22 2003 8:44, Tom Lane wrote: > > "Ed L." <pgsql@bluepolka.net> writes: > > > create table pg_xlog ( > > > xid int4 unique not null, > > > recno int8 unique not null, > > > ); > > > -- recno = (XLogRecPtr.xlogid) << 32) + XLogRecPtr.xrecoff > > > > > > This would map transaction IDs to WAL log record numbers. It seems > > > straight-forward to get the right data into this table. > > > > ... except that once a transaction has committed, it can't go around > > making more database entries. You can only modify the DB within a > > transaction. > ... > > 2. In xact.c, RecordTransactionCommit() inserts the (xid,recno) pair > into pg_xlog just after XLogInsert() for the commit. It doesn't matter > if it's within reach of the user's transaction-constrained trigger code > or not because it's done by the system. Am I correct in assuming the system can modify system tables outside a transaction?
"Ed L." <pgsql@bluepolka.net> writes: > Am I correct in assuming the system can modify system tables outside a > transaction? No, it can't. *Any* DB change has to be done within a transaction. regards, tom lane
On Saturday March 22 2003 9:26, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > Am I correct in assuming the system can modify system tables outside a > > transaction? > > No, it can't. *Any* DB change has to be done within a transaction. D'oh. Horse...dead. You didn't say, but do you think the xid is a reliable replication replay ordering even though it comes at the start of the transaction? Ed
Hi, I'm just testing insertion of about 600 000 records inside 3 tables. Just making a big text file with 3 inserts each time (for my 3 tables) like insert into xx (yy) values ('data'); so I have 3 x 600 000 inserts inside the file. Table N°2 have a reference on the Table N°1 with the primary key ... It's not a transaction ... I have only a primary key on each 3 tables ... The tables where empty at the beginning and all the database vacuum full analysed before the test. Same result on PostgreSQL 7.2.3 and 7.3.2 ... So I get about 1000 inserts done in 10 seconds at the beginning of the file, then after one hour I get 25 seconds for 1000 inserts ... and it's going slower and slower ... Before making a text file ... I have done the same thing with a Perl script with DBI to make the 3 inserts in a transaction (of the 600 000 records x3), and it was really slower (I stopped the script after 10 hours and 819 000 inserts done.) And same thing it was going slower and slower ... it was about 2000 inserts for 10 seconds the 10 first minutes... then after 10 hours I was at 40 inserts for 10 seconds ... I'm using for my tests as server with Bi-Pentium III 1.8 Ghz - 2Mb of RAM and Linux 2.4.18 ... the memory do not move ... the swap was never used ... Why PostgreSQL's going slower ? Do you have any idea to have a better and quicker insert method ? Thanks per advance for your help :o) regards, -- Hervé
"Ed L." <pgsql@bluepolka.net> writes: > You didn't say, but do you think the xid is a reliable replication replay > ordering even though it comes at the start of the transaction? Dunno. What requirements have you really got? regards, tom lane
On Saturday March 22 2003 9:55, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > You didn't say, but do you think the xid is a reliable replication > > replay ordering even though it comes at the start of the transaction? > > Dunno. What requirements have you really got? Again, the context is asyncronous trigger-based master-slave replication ala contrib/dbmirror. In this context, AFAICS, the primary requirement is to queue updates, inserts, and deletes on a master db for later SQL retrieval and subsequent serial "replay" into a slave in the "right" order. The master-queued tuple changes should be groupable as transactions so that any replication can enforce all-or-none semantics on the slave, though normally it should never hit a snag since the master and slave are initialized as identical copies. The queue should be ordered the "right" way for serial replay into a slave, whatever that "right" way is in order to maintain consistency with the master. I assume triggers will have to be disabled during replay on the slave to avoid time-sensitive side-effects. DBMirror basically already does all of this, except for disabling triggers and my uncertainty about the ordering issue. Ed
=?iso-8859-15?q?Herv=E9=20Piedvache?= <herve@elma.fr> writes: > Table N�2 have a reference on the Table N�1 with the primary key ... > It's not a transaction ... I have only a primary key on each 3 tables ... > The tables where empty at the beginning and all the database vacuum full > analysed before the test. > Same result on PostgreSQL 7.2.3 and 7.3.2 ... > So I get about 1000 inserts done in 10 seconds at the beginning of the file, > then after one hour I get 25 seconds for 1000 inserts ... and it's going > slower and slower ... You probably should *not* have vacuum analyzed while the tables were empty. The planner generated plans for the foreign-key checks based on the assumption that the tables are tiny --- so it's probably using seqscans to make the checks. The actual inserts won't slow down materially as the tables fill, but seqscan foreign-key checks will. You could vacuum now to update the stats, but that won't affect the foreign-key plans that the inserting process has already cached. You'll have to stop the inserting process and start a new one in any case. Another possibility is to start again with freshly created tables and no initial vacuum. If the planner has no stats it should default to indexscans. regards, tom lane
=?iso-8859-15?q?Herv=E9=20Piedvache?= <herve@elma.fr> writes: > Hi, > > I'm just testing insertion of about 600 000 records inside 3 tables. > > Just making a big text file with 3 inserts each time (for my 3 tables) like > insert into xx (yy) values ('data'); so I have 3 x 600 000 inserts inside the > file. > > Table N�2 have a reference on the Table N�1 with the primary key ... > It's not a transaction ... I have only a primary key on each 3 tables ... You're getting killed by transaction overhead. Make it so you do 1000 or so (or even more, big transactions don't hurt) inserts in a single transaction and you'll see much better performance. Also (under 7.3) run VACUUM during the insertion process--it may help. You could also disable the foreign key triggers during the run if you're sure the data is consistent. -Doug
"Ed L." <pgsql@bluepolka.net> writes: > On Saturday March 22 2003 9:55, Tom Lane wrote: >> Dunno. What requirements have you really got? > Again, the context is asyncronous trigger-based master-slave replication ala > contrib/dbmirror. AFAICS that technique does not need to worry about transaction ordering, as long as updates are sent in the order they are inserted into the pending-updates table. When two transactions update the same row, one is going to be forced to wait till the other commits; so the waiter's report will appear second. For nonconflicting updates you might not apply them in the same order they were physically made at the master, but I can't see how it matters. (I'm not entirely certain, but I think this works only if the reports are made by AFTER triggers. However, you'd have to do that anyway, since a BEFORE trigger can't be certain it's the last BEFORE trigger. Sending a report from a BEFORE trigger might send a report containing a row value that's not what ultimately gets inserted, due to mods made by later BEFORE triggers.) However, this may just move the problem somewhere else --- how will you be sure that you transmit entries in the update table in the correct order? regards, tom lane
On Saturday March 22 2003 11:05, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > > > Again, the context is asyncronous trigger-based master-slave > > replication ala contrib/dbmirror. > > AFAICS that technique does not need to worry about transaction ordering, > as long as updates are sent in the order they are inserted into the > pending-updates table... [snip] > > However, this may just move the problem somewhere else --- how will you > be sure that you transmit entries in the update table in the correct > order? That is my question. How it is done now in dbmirror is by including the xid and using a sequence object, essentially similar to insert into queue (xid, tuple_id, data, ...) values (GetCurrentTransactionId(), nextval('tuplesequence'), ...); Actually, dbmirror does it in a couple of steps, but I think this is the basic idea. So the queued tuple changes are groupable (and thus replayable) by xid and order of queueing. Then the slave gets them in the hopefully-correct order via select ... from queue q where q.xid is in some appropriate range order by xid asc, tuple_id asc; Ed
"Ed L." <pgsql@bluepolka.net> writes: > Actually, dbmirror does it in a couple of steps, but I think this is the > basic idea. So the queued tuple changes are groupable (and thus > replayable) by xid and order of queueing. Then the slave gets them in the > hopefully-correct order via > select ... > from queue q > where q.xid is in some appropriate range > order by xid asc, tuple_id asc; I think this last part is wrong. It shouldn't be using the xid as part of the ordering, only the sequence value. regards, tom lane
On Saturday March 22 2003 11:29, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > Actually, dbmirror does it in a couple of steps, but I think this is > > the basic idea. So the queued tuple changes are groupable (and thus > > replayable) by xid and order of queueing. Then the slave gets them in > > the hopefully-correct order via > > select ... > > from queue q > > where q.xid is in some appropriate range > > order by xid asc, tuple_id asc; > > I think this last part is wrong. It shouldn't be using the xid as part > of the ordering, only the sequence value. Why not? How would you replay them on the slave in the same transaction groupings and order that occurred on the master?
"Ed L." <pgsql@bluepolka.net> writes: > On Saturday March 22 2003 11:29, Tom Lane wrote: >> I think this last part is wrong. It shouldn't be using the xid as part >> of the ordering, only the sequence value. > Why not? How would you replay them on the slave in the same transaction > groupings and order that occurred on the master? Why not is easy: begin xact 1; update tuple X; ... begin xact 2; update tuple Y; commit; ... update tuple Y; commit; (Note that this is only possible in read-committed mode, else xact 1 wouldn't be allowed to update tuple Y like this.) Here, you must replay xact 1's update of Y after xact 2's update of Y, else you'll get the wrong final state on the slave. On the other hand, it really does not matter whether you replay the tuple X update before or after you replay xact 2, because xact 2 didn't touch tuple X. If the existing DBmirror code is sorting as you say, then it will fail in this scenario --- unless it always manages to execute a propagation step in between the commits of xacts 2 and 1, which doesn't seem very trustworthy. What I'm envisioning is that you should just send updates in the order of their insertion sequence numbers and *not* try to force them into transactional grouping. In the above scenario, depending on when propagation runs it might send X update, Y update, second Y update all in one batch; or it might send Y update in one batch and then X update and second Y update in a later batch. Both of these are legal and consistent orderings. The only ordering constraint is that the second Y update must be applied second --- but that is certain as long as we sort the contents of a batch by insertion order. (The pending-update report for the second Y update cannot have been made until after xact 2 commits, because the MVCC semantics will force xact 1 to wait for 2 to commit before it can update Y.) Note that all of a transaction's updates will become visible in the pending-update table simultaneously when it commits, so (as long as you grab batches in single SELECTs, or with a serializable transaction) there's no problems with partial transactions being applied by a batch. A batch will contain all the updates of some specific set of transactions, namely all those that committed since the prior batch for that slave. But you can't order the updates done by a batch in xid order, they have to be applied in insertion order. regards, tom lane
On Saturday March 22 2003 12:00, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > On Saturday March 22 2003 11:29, Tom Lane wrote: > >> I think this last part is wrong. It shouldn't be using the xid as > >> part of the ordering, only the sequence value. > > > > Why not? How would you replay them on the slave in the same > > transaction groupings and order that occurred on the master? > > Why not is easy: > > begin xact 1; > update tuple X; > ... > begin xact 2; > update tuple Y; > commit; > ... > update tuple Y; > commit; > > (Note that this is only possible in read-committed mode, else xact 1 > wouldn't be allowed to update tuple Y like this.) Here, you must > replay xact 1's update of Y after xact 2's update of Y, else you'll > get the wrong final state on the slave. On the other hand, it really > does not matter whether you replay the tuple X update before or after > you replay xact 2, because xact 2 didn't touch tuple X. > > If the existing DBmirror code is sorting as you say, then it will fail > in this scenario --- unless it always manages to execute a propagation > step in between the commits of xacts 2 and 1, which doesn't seem very > trustworthy. Well, I'm not absolutely certain, but I think this problem may indeed exist in dbmirror. If I'm reading it correctly, dbmirror basically has the following: create table xact_queue (xid int, seqid serial, ...); create table tuple_queue (seqid int, data, ...); The dbmirror trigger does this: myXid = GetCurrentTransactionId(); insert into xact_queue (myXid, nextval(seqid_seq)); insert into tuple_queue (seqid, data, ...) values (currval(seqid_seq), ...); The slave then grabs all queued xids in order of the max seqid within each transaction. Essentially, SELECT xid, MAX(seqid) FROM xact_queue GROUP BY xid ORDER BY MAX(seqid); In your scenario it would order them xact1, then xact2, since xact 1's update of Y would have the max seqid. For each xact, it replays the tuples for that xact in seqid order. SELECT t.seqid, t.data, ... FROM tuple_queue t, xact_queue x WHERE t.seqid = x.seqid AND x.xid = $XID ORDER BY t.seqid; So the actual replay order would be xact1: update X xact1: update Y xact2: update Y leading to slave inconsistency. > What I'm envisioning is that you should just send updates in the order > of their insertion sequence numbers and *not* try to force them into > transactional grouping. ... Very good. Makes perfect sense to me now. That also apparently obviates the need for 64-bit transactions since the sequence can be a BIGINT. Thanks, Ed
"Ed L." <pgsql@bluepolka.net> writes: > Well, I'm not absolutely certain, but I think this problem may indeed exist > in dbmirror. Sounds that way to me too. Can anyone try the experiment? I don't have dbmirror set up here ... regards, tom lane
On Sat, Mar 22, 2003 at 02:00:16PM -0500, Tom Lane wrote: [stuff about how xid is not a good guide to what committed first] FWIW, I think this is why rserv applies current-state snapshots, and doesn't try to replay everything. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Saturday March 22 2003 12:00, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > On Saturday March 22 2003 11:29, Tom Lane wrote: > >> I think this last part is wrong. It shouldn't be using the xid as > >> part of the ordering, only the sequence value. > > > > Why not? How would you replay them on the slave in the same > > transaction groupings and order that occurred on the master? > > Why not is easy: > > begin xact 1; > update tuple X; > ... > > begin xact 2; > update tuple Y; > commit; > > ... > update tuple Y; > commit; > > (Note that this is only possible in read-committed mode, else xact 1 > wouldn't be allowed to update tuple Y like this.) Here, you must > replay xact 1's update of Y after xact 2's update of Y, else you'll > get the wrong final state on the slave. On the other hand, it really > does not matter whether you replay the tuple X update before or after > you replay xact 2, because xact 2 didn't touch tuple X. > > If the existing DBmirror code is sorting as you say, then it will fail > in this scenario --- unless it always manages to execute a propagation > step in between the commits of xacts 2 and 1, which doesn't seem very > trustworthy. > > What I'm envisioning is that you should just send updates in the order > of their insertion sequence numbers and *not* try to force them into > transactional grouping. In the above scenario, depending on when > propagation runs it might send X update, Y update, second Y update > all in one batch; or it might send Y update in one batch and then X > update and second Y update in a later batch. Both of these are legal > and consistent orderings. The only ordering constraint is that the > second Y update must be applied second --- but that is certain as > long as we sort the contents of a batch by insertion order. (The > pending-update report for the second Y update cannot have been made > until after xact 2 commits, because the MVCC semantics will force > xact 1 to wait for 2 to commit before it can update Y.) > > Note that all of a transaction's updates will become visible in the > pending-update table simultaneously when it commits, so (as long as > you grab batches in single SELECTs, or with a serializable transaction) > there's no problems with partial transactions being applied by a batch. > A batch will contain all the updates of some specific set of > transactions, namely all those that committed since the prior batch > for that slave. But you can't order the updates done by a batch in xid > order, they have to be applied in insertion order. If you grab everything in the queue with a single SELECT, this works. Depending on the queue length, that's not always practical, and as noted above, committed batches could result in partial transactions on the slave. So the riddle is how to get a consistent but batchable replication order. Suppose you have a replication queue, with dbmirror-like trigger-based insertions, that looks something like this: create table replication_queue ( xid integer, seqid serial primary key, data... ); Here, 'xid' is the transaction ID, 'seqid' is the queue insertion order, and 'data' has all the info needed to replicate the update. Every row update/delete/inserte results in a row in this queue. Suppose further you wish to limit the number of updates replicated in a particular cycle. For example, suppose you have a million changes queued for replication on the master, but for obvious reasons you don't want to select a million rows to replicate all at once. Suppose you also don't want to grab them one tuple or one transaction at a time, preferring to avoid hammering the master. Rather, you want to grab them in batches of no more than N transactions, replicate them all to a slave and commit on the slave, take a breather, repeating until the slave is caught up. And during each breather, you want to have committed only complete transactions so that any slave clients see consistent data. My question: Is there an ordering algorithm that would make a consistent but limited batchsize replication possible? The algorithm I'm considering right now is the following: select xid, max(seqid) as "max_seqid" into temp replication_order from replication_queue group by xid order by max(seqid) limit N; Then, to get the actual queue replication order, select q.xid, q.seqid, q.data from replication_queue q, replication_order o where q.xid = o.xid order by o.max_seqid, q.seqid; [This is a batched variation of dbmirror's original algorithm.] Ed
"Ed L." <pgsql@bluepolka.net> writes: > On Saturday March 22 2003 12:00, Tom Lane wrote: >> Note that all of a transaction's updates will become visible in the >> pending-update table simultaneously when it commits, so (as long as >> you grab batches in single SELECTs, or with a serializable transaction) >> there's no problems with partial transactions being applied by a batch. > If you grab everything in the queue with a single SELECT, this works. > Depending on the queue length, that's not always practical, and as noted > above, committed batches could result in partial transactions on the slave. > So the riddle is how to get a consistent but batchable replication order. You don't have to do anything special if you pull the contents of a batch in a single serializable transaction. I see no reason to think that using a serializable transaction is "hammering the master"; so you are asking for a solution to a non-problem. regards, tom lane
On Thursday April 10 2003 4:11, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > On Saturday March 22 2003 12:00, Tom Lane wrote: > >> Note that all of a transaction's updates will become visible in the > >> pending-update table simultaneously when it commits, so (as long as > >> you grab batches in single SELECTs, or with a serializable > >> transaction) there's no problems with partial transactions being > >> applied by a batch. > > > > If you grab everything in the queue with a single SELECT, this works. > > Depending on the queue length, that's not always practical, and as > > noted above, committed batches could result in partial transactions on > > the slave. So the riddle is how to get a consistent but batchable > > replication order. > > You don't have to do anything special if you pull the contents of a > batch in a single serializable transaction. I see no reason to think > that using a serializable transaction is "hammering the master"; so > you are asking for a solution to a non-problem. I don't think so. Can you imagine a replication queue big enough to that someone might not want to process it entirely in one transaction? I sure can. Consider the following sequence: 1. Replication queueing is begun for a particular slave on the master. A dump is taken with which to initialize the slave. 2. A sufficient number of updates are queued such that the total amount of data exceeds the amount one wants to process before giving the master, or the slave, or the network, a break). This could easily happen in our case if there were delays in getting the slave setup for whatever reason. 3. The slave is finally setup for replication. By this time, the queue is bigger than we want to process in one round. Ed
On Thursday April 10 2003 4:44, Ed L. wrote: > On Thursday April 10 2003 4:11, Tom Lane wrote: > > "Ed L." <pgsql@bluepolka.net> writes: > > > On Saturday March 22 2003 12:00, Tom Lane wrote: > > >> Note that all of a transaction's updates will become visible in the > > >> pending-update table simultaneously when it commits, so (as long as > > >> you grab batches in single SELECTs, or with a serializable > > >> transaction) there's no problems with partial transactions being > > >> applied by a batch. > > > > > > If you grab everything in the queue with a single SELECT, this works. > > > Depending on the queue length, that's not always practical, and as > > > noted above, committed batches could result in partial transactions > > > on the slave. So the riddle is how to get a consistent but batchable > > > replication order. > > > > You don't have to do anything special if you pull the contents of a > > batch in a single serializable transaction. I see no reason to think > > that using a serializable transaction is "hammering the master"; so > > you are asking for a solution to a non-problem. > > I don't think so. Can you imagine a replication queue big enough to that > someone might not want to process it entirely in one transaction? I sure > can. Consider the following sequence: > > 1. Replication queueing is begun for a particular slave on the master. > A dump is taken with which to initialize the slave. > > 2. A sufficient number of updates are queued such that the total amount > of data exceeds the amount one wants to process before giving the master, > or the slave, or the network, a break). This could easily happen in our > case if there were delays in getting the slave setup for whatever reason. By "sufficient number of updates", I really mean a large enough number of transactions. Imagine a queue with a 100,000 transactions that would require a number of hours or days to replicate straight through. Ed > > 3. The slave is finally setup for replication. By this time, the queue > is bigger than we want to process in one round. > > Ed
"Ed L." <pgsql@bluepolka.net> writes: > I don't think so. Can you imagine a replication queue big enough to that > someone might not want to process it entirely in one transaction? No, I can't. The bigger the queue is, the further behind you are, and the more you need to catch up; twiddling your thumbs for awhile gets progressively less attractive. Also, AFAIR from prior discussion, the *slave* side doesn't need to commit the whole batch in one transaction. I can't recall if this could expose transaction intermediate states on the slave, but if you're that far behind you'd best not be having any live clients querying the slave anyway. In any case you can throttle the load by sleeping between selects while holding the transaction open. I think your concern is predicated on Oracle-ish assumptions about the cost of holding open a transaction. The only such cost in PG is that it interferes with VACUUM cleaning out old tuples --- which I'm not sure VACUUM could do anyway for stuff that still hasn't propagated to a slave. regards, tom lane
On Thursday April 10 2003 5:26, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > I don't think so. Can you imagine a replication queue big enough to > > that someone might not want to process it entirely in one transaction? > > No, I can't. The bigger the queue is, the further behind you are, and > the more you need to catch up; twiddling your thumbs for awhile gets > progressively less attractive. Well, if an arbitrarily large queue of transactions doesn't persuade you, you are not going to be persuaded. If (a) the catch-up rate is fast enough, (b) the master load periodically ebbs enough, as it does at various predictably slow times, (c) you care about lessening traffic between master and slave, and/or (d) you care about lessening constancy of the replication load on the master, having the replicator periodically back-off is very attractive. > Also, AFAIR from prior discussion, the *slave* side doesn't need to > commit the whole batch in one transaction. I can't recall if this > could expose transaction intermediate states on the slave, but if > you're that far behind you'd best not be having any live clients > querying the slave anyway. Exposing intermediate transaction states is precisely the issue and the reason for my original question. Your apparent presumption of the lack of value of querying a slave that's running significantly behind is a false blanket assumption. Of course it depends on the situation and the nature of the data. I can think of a number of past instances where some considerable lagtime in the data propagation was just fine, but inconsistency was not. If you aren't replicating to the slave and committing in one big all-inclusive batch, then there needs to be some care to commit in transaction units if you don't want to offer room for inconsistent views to slave clients. We left the conversation a while back thinking there was no need for anything other than a serial replay by insertion order, but a batching requirement seems to change that, which is why I posed the original question. > In any case you can throttle the load by sleeping between selects while > holding the transaction open. I think your concern is predicated on > Oracle-ish assumptions about the cost of holding open a transaction. > The only such cost in PG is that it interferes with VACUUM cleaning > out old tuples --- which I'm not sure VACUUM could do anyway for stuff > that still hasn't propagated to a slave. I'd have thought there might be other problems resulting from holding a transaction open for hours, like the progress you lose if the master or slave or the connection between goes down before the mondo commit. Maybe that kind of thing just never happens. Ed
Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)
От
Richard Huxton
Дата:
On Friday 11 Apr 2003 3:03 am, Ed L. wrote: > On Thursday April 10 2003 5:26, Tom Lane wrote: [snip][ > > Also, AFAIR from prior discussion, the *slave* side doesn't need to > > commit the whole batch in one transaction. I can't recall if this > > could expose transaction intermediate states on the slave, but if > > you're that far behind you'd best not be having any live clients > > querying the slave anyway. > > Exposing intermediate transaction states is precisely the issue and the > reason for my original question. Your apparent presumption of the lack of > value of querying a slave that's running significantly behind is a false > blanket assumption. Of course it depends on the situation and the nature > of the data. I can think of a number of past instances where some > considerable lagtime in the data propagation was just fine, but > inconsistency was not. If you aren't replicating to the slave and > committing in one big all-inclusive batch, then there needs to be some care > to commit in transaction units if you don't want to offer room for > inconsistent views to slave clients. Surely the batching should be happening at the "master" end? That's the only place with the context to mark a "determinate state". Batch things as fine as you like at that end, and just have the slave process multiple batches if it can/wants to. -- Richard Huxton
Tom Lane wrote: > > "Ed L." <pgsql@bluepolka.net> writes: > > I don't think so. Can you imagine a replication queue big enough to that > > someone might not want to process it entirely in one transaction? > > No, I can't. The bigger the queue is, the further behind you are, and > the more you need to catch up; twiddling your thumbs for awhile gets > progressively less attractive. That is absolutely sure in an asynchronous multi-master situation, where "twiddling" only leads to conflicts ... not making your situation any easier. But in a pure master slave situation? There I can imagine this. What I cannot imagine is why one would want to try to make batches any other size than the original transaction. Committing smaller "chunks" of the masters transactions at the slave side would allow a client there to see an inconsistent snapshot - that is bad (tm). Committing bigger groups contains the risk that the slave run's out of resources that the master didn't need - not any better. > Also, AFAIR from prior discussion, the *slave* side doesn't need to > commit the whole batch in one transaction. I can't recall if this > could expose transaction intermediate states on the slave, but if > you're that far behind you'd best not be having any live clients > querying the slave anyway. I'm not aware of any "COMMIT BUT HIDE AND RESUME;" There are scenarios where your online processes have clear priority over the master to multislave replication. Think of a load balanced multisite search engine ... does it really matter that all the sites stay as sync as possible? Do people expect Google to be up to date on a per minute base? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
I'm not really very cognizant of these slave/master things. I know I will have to be someday. but the case I heard in the conversation that seemed to most be appropriate to batching was when a slave starts from a clean slate. It should not be able to be accessed until it has caught up, true. But catching up a blank slave would most definitely hammer on a master, I think. Jan Wieck wrote: > Tom Lane wrote: > >>"Ed L." <pgsql@bluepolka.net> writes: >> >>>I don't think so. Can you imagine a replication queue big enough to that >>>someone might not want to process it entirely in one transaction? >> >>No, I can't. The bigger the queue is, the further behind you are, and >>the more you need to catch up; twiddling your thumbs for awhile gets >>progressively less attractive. > > > That is absolutely sure in an asynchronous multi-master situation, where > "twiddling" only leads to conflicts ... not making your situation any > easier. > > But in a pure master slave situation? There I can imagine this. > > What I cannot imagine is why one would want to try to make batches any > other size than the original transaction. Committing smaller "chunks" of > the masters transactions at the slave side would allow a client there to > see an inconsistent snapshot - that is bad (tm). Committing bigger > groups contains the risk that the slave run's out of resources that the > master didn't need - not any better. > > >>Also, AFAIR from prior discussion, the *slave* side doesn't need to >>commit the whole batch in one transaction. I can't recall if this >>could expose transaction intermediate states on the slave, but if >>you're that far behind you'd best not be having any live clients >>querying the slave anyway. > > > I'm not aware of any "COMMIT BUT HIDE AND RESUME;" > > There are scenarios where your online processes have clear priority over > the master to multislave replication. Think of a load balanced multisite > search engine ... does it really matter that all the sites stay as sync > as possible? Do people expect Google to be up to date on a per minute > base? > > > Jan >
On Friday April 11 2003 5:48, Jan Wieck wrote: > Tom Lane wrote: > > "Ed L." <pgsql@bluepolka.net> writes: > > > I don't think so. Can you imagine a replication queue big enough to > > > that someone might not want to process it entirely in one > > > transaction? > > > > No, I can't. The bigger the queue is, the further behind you are, and > > the more you need to catch up; twiddling your thumbs for awhile gets > > progressively less attractive. > > That is absolutely sure in an asynchronous multi-master situation, where > "twiddling" only leads to conflicts ... not making your situation any > easier. > > But in a pure master slave situation? There I can imagine this. The context of my question is strictly master slave. > What I cannot imagine is why one would want to try to make batches any > other size than the original transaction. Committing smaller "chunks" of > the masters transactions at the slave side would allow a client there to > see an inconsistent snapshot - that is bad (tm). Committing bigger > groups contains the risk that the slave run's out of resources that the > master didn't need - not any better. To what slave resources are you referring? Ed
On Friday April 11 2003 2:38, Richard Huxton wrote: > Surely the batching should be happening at the "master" end? That's the > only place with the context to mark a "determinate state". Batch things > as fine as you like at that end, and just have the slave process multiple > batches if it can/wants to. Huh? I don't understand this comment. How to retrieve properly ordered but limited batches of transactions and process them on the slave is the question. Ed
"Ed L." wrote: > > On Friday April 11 2003 5:48, Jan Wieck wrote: > > Tom Lane wrote: > > > "Ed L." <pgsql@bluepolka.net> writes: > > > > I don't think so. Can you imagine a replication queue big enough to > > > > that someone might not want to process it entirely in one > > > > transaction? > > > > > > No, I can't. The bigger the queue is, the further behind you are, and > > > the more you need to catch up; twiddling your thumbs for awhile gets > > > progressively less attractive. > > > > That is absolutely sure in an asynchronous multi-master situation, where > > "twiddling" only leads to conflicts ... not making your situation any > > easier. > > > > But in a pure master slave situation? There I can imagine this. > > The context of my question is strictly master slave. > > > What I cannot imagine is why one would want to try to make batches any > > other size than the original transaction. Committing smaller "chunks" of > > the masters transactions at the slave side would allow a client there to > > see an inconsistent snapshot - that is bad (tm). Committing bigger > > groups contains the risk that the slave run's out of resources that the > > master didn't need - not any better. > > To what slave resources are you referring? Clearly a bug, but we had memory leaks that clear up at transaction end. One of the "leaks" we still have: Constraint trigger queue. Be sure, you don't want to find out that you have that kind of problem by loosing slave by slave because your workload got worse ;-) Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Friday April 11 2003 10:08, Jan Wieck wrote: > Clearly a bug, but we had memory leaks that clear up at transaction end. That seems like yet another reason for constraining the size of a batch of transactions. > One of the "leaks" we still have: Constraint trigger queue. What is that about? Or if you don't want to re-explain, what would I search for in the archive? Ed
"Ed L." wrote: > > On Friday April 11 2003 10:08, Jan Wieck wrote: > > Clearly a bug, but we had memory leaks that clear up at transaction end. > > That seems like yet another reason for constraining the size of a batch of > transactions. Er ... what? I said: What I cannot imagine is why one would want to try to make batches any other size than the original transaction. "the original transaction" - singular!!! Not a couple, few, maybe some, part, fraction or anything in between, above or below. Exactly ONE. > > > One of the "leaks" we still have: Constraint trigger queue. > > What is that about? Or if you don't want to re-explain, what would I search > for in the archive? If you have a deferred referential integrity constraint defined (one of the reasons why half of a transaction cannot work at all), where does the backend remember the ctid's and other information for the triggers to call at commit time? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Friday April 11 2003 1:07, Jan Wieck wrote: > "Ed L." wrote: > > On Friday April 11 2003 10:08, Jan Wieck wrote: > > > Clearly a bug, but we had memory leaks that clear up at transaction > > > end. > > > > That seems like yet another reason for constraining the size of a batch > > of transactions. > > What I cannot imagine is why one would want to try to make batches any > other size than the original transaction. Again, if there are a large number of transactions queued on the master, possibly constituted by an even much larger number of actual row changes, a batch size of 1 transaction means, in this context, the master will be hit once for every single transaction to retrieve the queued rows corresponding to that transaction. A batchsize of, say 1000 transactions, on the other hand, will save 999 queries. Granted, you may not share my concern about the impact of that on master or network. > "the original transaction" - singular!!! Not a couple, few, maybe some, > part, fraction or anything in between, above or below. Exactly ONE. Repeated exclamation points, pedanticism, and all caps shouting duly noted. Now, suppose there are in fact memory leaks that only clear up at transaction end. Suppose the memory leak increases with the size of the transaction (which nobody asserted but I assume may be true). Suppose one wishes to avoid a per-transaction hit on the master during replication. Suppose one needs to process a replication queue in batches due to an overly large replication queue. Then, I claim, one may quite reasonably wish to retrieve replication tuples from the master in batches containing rows constituting more than one transaction. All complete transactions, yes, but more than one. > If you have a deferred referential integrity constraint defined (one of > the reasons why half of a transaction cannot work at all), where does > the backend remember the ctid's and other information for the triggers > to call at commit time? Again, I have no interest in partial transactions, only groups of multiple transactions. === For those of you who do see the validity in batching multiple transactions, my question is restated here: My question: Is there an ordering algorithm that would make a consistent but limited batchsize replication possible? I propose one below. Suppose you have a replication queue on the master, with dbmirror-like trigger-based insertions, that looks something like this: create table replication_queue ( xid integer, seqid serial primary key, data... ); Here, 'xid' is the transaction ID, 'seqid' is the queue insertion order, and 'data' has all the info needed to replicate the update. Every row update/delete/inserte results in a row in this queue, and a transaction may consist of one to many rows in the queue. Suppose further you wish to limit the number of updates replicated in a particular cycle. For example, suppose you have a million changes queued for replication on the master, but for obvious reasons you don't want to select a million rows to replicate all at once. Suppose you also don't want to grab them one tuple or one transaction at a time, preferring to avoid hammering the master. Rather, you want to grab them in batches of no more than N transactions, replicate them all to a slave and commit on the slave, take a breather, repeating until the slave is caught up. And during each breather, you want to have committed only complete transactions so that any slave clients see consistent data. The algorithm I'm considering right now is the following: select xid, max(seqid) as "max_seqid" into temp replication_order from replication_queue group by xid order by max(seqid) limit N; Then, to get the actual queue replication order, select q.xid, q.seqid, q.data from replication_queue q, replication_order o where q.xid = o.xid order by o.max_seqid, q.seqid; [This is a batched variation of dbmirror's original algorithm.] So, replication is done by transaction groupings, in ascending order according to the maximum seqid in each transaction. I'm hoping someone can poke holes in this algorithm if they exist. Ed
"Ed L." wrote: > > On Friday April 11 2003 1:07, Jan Wieck wrote: > > > "the original transaction" - singular!!! Not a couple, few, maybe some, > > part, fraction or anything in between, above or below. Exactly ONE. > > Repeated exclamation points, pedanticism, and all caps shouting duly noted. That's not nice, it's so seldom that I have a chance to do that and you won't even let me. ;-( I apologize. Wasn't meant that way. > Again, I have no interest in partial transactions, only groups of multiple > transactions. Okay, you want to lower the impact on the master by slurping the effects of multiple transactions (across all tables they hit) in one transaction. Now you can do two basic things. A) You apply those changes in the order you read them out of the master on the slave. This requires that you do it all in one big transaction on the slave side and that you can set all foreign key constraints to deferred, getting you into the deferred trigger queue overflow risk mentioned. B) You read all the changes across all tables, but regroup them into their correct order and original transaction boundaries for playback on the slaves. Now you need some staging area where you collect and sort it all running the risk to run out of disk space before you even begin the first transaction on the slave side. B2) You read all the changes across all tables simultaneously via cursors. Worst case you need as many cursors as you have tables and it's a bit complicated to keep track of all the group changes all over the place, but you can rebuild the transaction groups on the fly. Certainly doable, but I wouldn't want to maintain that software. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
If the items going to the slave are ordered and complete transactions, does it matter whether or not they are applied as complete transactions on the slave in groups or one at a time? It is clear that incomplete or partial transactions should not be queued. What I'm not sure about is the overlapping transactions where the second completes before the first. begin (xact 1) select ... for update ... begin (xact 2) update ... end (xact 2) end (xact 1) In this case, on the slave, xact2 will be commited before xact 1 is begun. This seems OK to me, but I'd like some confirmation to understand this. Or is this the wrong question? Thanks, elein@varlena.com On Thursday 10 April 2003 19:03, Ed L. wrote: > On Thursday April 10 2003 5:26, Tom Lane wrote: > > "Ed L." <pgsql@bluepolka.net> writes: > > > I don't think so. Can you imagine a replication queue big enough to > > > that someone might not want to process it entirely in one transaction? > > > > No, I can't. The bigger the queue is, the further behind you are, and > > the more you need to catch up; twiddling your thumbs for awhile gets > > progressively less attractive. > > Well, if an arbitrarily large queue of transactions doesn't persuade you, > you are not going to be persuaded. If (a) the catch-up rate is fast > enough, (b) the master load periodically ebbs enough, as it does at various > predictably slow times, (c) you care about lessening traffic between master > and slave, and/or (d) you care about lessening constancy of the replication > load on the master, having the replicator periodically back-off is very > attractive. > > > Also, AFAIR from prior discussion, the *slave* side doesn't need to > > commit the whole batch in one transaction. I can't recall if this > > could expose transaction intermediate states on the slave, but if > > you're that far behind you'd best not be having any live clients > > querying the slave anyway. > > Exposing intermediate transaction states is precisely the issue and the > reason for my original question. Your apparent presumption of the lack of > value of querying a slave that's running significantly behind is a false > blanket assumption. Of course it depends on the situation and the nature > of the data. I can think of a number of past instances where some > considerable lagtime in the data propagation was just fine, but > inconsistency was not. If you aren't replicating to the slave and > committing in one big all-inclusive batch, then there needs to be some care > to commit in transaction units if you don't want to offer room for > inconsistent views to slave clients. We left the conversation a while back > thinking there was no need for anything other than a serial replay by > insertion order, but a batching requirement seems to change that, which is > why I posed the original question. > > > In any case you can throttle the load by sleeping between selects while > > holding the transaction open. I think your concern is predicated on > > Oracle-ish assumptions about the cost of holding open a transaction. > > The only such cost in PG is that it interferes with VACUUM cleaning > > out old tuples --- which I'm not sure VACUUM could do anyway for stuff > > that still hasn't propagated to a slave. > > I'd have thought there might be other problems resulting from holding a > transaction open for hours, like the progress you lose if the master or > slave or the connection between goes down before the mondo commit. Maybe > that kind of thing just never happens. > > Ed > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- ---------------------------------------------------------------------------------------- elein@varlena.com Database Consulting www.varlena.com I have always depended on the [QA] of strangers.
On Friday April 11 2003 4:34, Jan Wieck wrote: > > > > Repeated exclamation points, pedanticism, and all caps shouting duly > > noted. > > I apologize. Wasn't meant that way. Apology accepted, thanks. > > Again, I have no interest in partial transactions, only groups of > > multiple transactions. > > Okay, you want to lower the impact on the master by slurping the effects > of multiple transactions (across all tables they hit) in one > transaction. Your phrase "across all the tables they hit" makes me wonder if you're talking about dbmirror-style replication. Replication, in this modified dbmirror context, is essentially reading one table on the master, and replaying it on the slave, not counting a couple of bookkeeping tables. During normal master activity, when table T1 has a row updated, a trigger inserts a new row into the queue table Q with enough information for a replicator to update the same row in table T1 on the slave. When table T2 gets an insert or delete, the same queue table Q again gets a new row via the trigger. Then, during master-to-slave replication, the replicator comes along and reads rows from Q, not all tables. The replicator does not read any user tables besides Q and a bookkeeping table or two. > Now you can do two basic things. > > A) You apply those changes in the order you read them out of the master > on the slave. This requires that you do it all in one big transaction on > the slave side and that you can set all foreign key constraints to > deferred, getting you into the deferred trigger queue overflow risk > mentioned. Well, this leads back to my original question. If it is possible to order them in transaction groupings, then intermediate commits are possible and one seems less likely to slam into the deferred trigger queue issue. The ordering under examination would allow you to replay them in transaction order, committing every M transactions with consistency intact, and even resetting the connection if you wish. But is it a correct ordering? I think so, but I dunno. > B) You read all the changes across all tables, but regroup them into > their correct order and original transaction boundaries for playback on > the slaves. Now you need some staging area where you collect and sort it > all running the risk to run out of disk space before you even begin the > first transaction on the slave side. If the ordering I seek exists via the algorithm I've described, I don't think you need to stage the entire queue at once. The overhead I'd use is minimally, I think, about 56 bytes per transaction (int xid, bigint max_seqid, int slaveid + 40 Pg overhead bytes), not counting my own audit/debug columns. If I add my 40 bytes of debug/audit column bloat, that's 96 bytes/row in ordering. Since we only need N transactions, the staging area only needs to have N rows of 96 bytes or less. In the bloat row case, a batchsize of 100,000 transactions would only require 10MB, and I'd expect to use a much smaller batchsize. And no need to keep the batch ordering around once it has been replicated. ISTM, the key is whether or not one can group and replay according to transaction groups, with the groups ordered by max seqid within each transaction. Ed
On Friday April 11 2003 4:44, elein wrote: > If the items going to the slave are ordered and complete transactions, > does it matter whether or not they are applied as complete transactions > on the slave in groups or one at a time? I don't think it really matters as long as the order is correct and the commits occur on the transactional boundaries. > What I'm not sure about is the overlapping transactions where the > second completes before the first. > > begin (xact 1) > select ... for update ... > begin (xact 2) > update ... > end (xact 2) > > end (xact 1) > > In this case, on the slave, xact2 will be commited before xact 1 is > begun. This seems OK to me, but I'd like some confirmation to understand > this. Or is this the wrong question? IMO, this is exactly what makes the analysis a little tricky for people like me who don't regularly live in the world of transaction minutiae. If xact2 tries to update a row already updated by xact1, xact2 will block until xact1 commits or rolls back, and cannot end before xact1 due to MVCC. If xact1 tries to update a row already updated by xact2, xact1 will block until xact2 is resolved. My question is, given all the ways in which these orderings can occur, which I think simplifies to about 3 in the case of 2 transactions updating the same row, is there a fool-proof ordering for replaying in batches? Ed
On Friday April 11 2003 5:29, Ed L. wrote: > queue issue. The ordering under examination would allow you to replay > them in transaction order, committing every M transactions with A more accurate phrase: "The ordering under examination would allow you to replay them in transaction *groupings*, ordered by insertion order within each transaction, committing every M transactions..." Ed
On Fri, 11 Apr 2003, Ed L. wrote: > For those of you who do see the validity in batching multiple transactions, > my question is restated here: > > My question: Is there an ordering algorithm that would make a consistent > but limited batchsize replication possible? I propose one below. > > Suppose you have a replication queue on the master, with dbmirror-like > trigger-based insertions, that looks something like this: > > create table replication_queue ( > xid integer, > seqid serial primary key, > data... > ); > > Here, 'xid' is the transaction ID, 'seqid' is the queue insertion order, and > 'data' has all the info needed to replicate the update. Every row > update/delete/inserte results in a row in this queue, and a transaction may > consist of one to many rows in the queue. > The algorithm I'm considering right now is the following: > > select xid, max(seqid) as "max_seqid" > into temp replication_order > from replication_queue > group by xid > order by max(seqid) > limit N; > > Then, to get the actual queue replication order, > > select q.xid, q.seqid, q.data > from replication_queue q, replication_order o > where q.xid = o.xid > order by o.max_seqid, q.seqid; > > [This is a batched variation of dbmirror's original algorithm.] > > So, replication is done by transaction groupings, in ascending order > according to the maximum seqid in each transaction. I'm hoping someone can > poke holes in this algorithm if they exist. Does it matter if transactions that do not affect each other are committed on the slave in a different order than they were on the master? I don't think that's guaranteed by the above (unless the inserts into replication_queue were deferred to transaction end), but I'm also uncertain if it's a constraint you're concerned with.
On Friday April 11 2003 6:08, Stephan Szabo wrote: > On Fri, 11 Apr 2003, Ed L. wrote: > > For those of you who do see the validity in batching multiple > > transactions, my question is restated here: > > > > My question: Is there an ordering algorithm that would make a > > consistent but limited batchsize replication possible? I propose one > > below. > > > > Suppose you have a replication queue on the master, with dbmirror-like > > trigger-based insertions, that looks something like this: > > > > create table replication_queue ( > > xid integer, > > seqid serial primary key, > > data... > > ); > > > > Here, 'xid' is the transaction ID, 'seqid' is the queue insertion > > order, and 'data' has all the info needed to replicate the update. > > Every row update/delete/inserte results in a row in this queue, and a > > transaction may consist of one to many rows in the queue. > > > > The algorithm I'm considering right now is the following: > > > > select xid, max(seqid) as "max_seqid" > > into temp replication_order > > from replication_queue > > group by xid > > order by max(seqid) > > limit N; > > > > Then, to get the actual queue replication order, > > > > select q.xid, q.seqid, q.data > > from replication_queue q, replication_order o > > where q.xid = o.xid > > order by o.max_seqid, q.seqid; > > > > [This is a batched variation of dbmirror's original algorithm.] > > > > So, replication is done by transaction groupings, in ascending order > > according to the maximum seqid in each transaction. I'm hoping someone > > can poke holes in this algorithm if they exist. > > Does it matter if transactions that do not affect each other are > committed on the slave in a different order than they were on the master? > I don't think that's guaranteed by the above (unless the inserts into > replication_queue were deferred to transaction end), but I'm also > uncertain if it's a constraint you're concerned with. I appreciate your pointing that out. It is pretty undesirable for data to appear on the slave in an order different from the one in which it appears on the master. I guess that's another downside to batching. I'm not sure this approach can do any better than approximating the order since there is no knowledge of the commit order. Ed
"Ed L." wrote: > If the ordering I seek exists via the algorithm I've described, I don't The right ordering is commit order, because that is the order that played back serialized (without possible lock conflicts) guarantees the same tuples hit. Unfortunately I know of no sure way to tell the commit order. The max(seqid) per transaction in ascending order would be a good approximation at least. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Fri, Apr 11, 2003 at 07:32:15PM -0600, Ed L. wrote: > I appreciate your pointing that out. It is pretty undesirable for data to > appear on the slave in an order different from the one in which it appears > on the master. I guess that's another downside to batching. I'm not sure > this approach can do any better than approximating the order since there is > no knowledge of the commit order. I know this will probably require more work than you'd like, but it seems like it might be very useful/important for the replication queue to have definitive information about when commits occur. BTW, I don't know how this would apply to pgsql, but both DB2 and Oracle handle replication via the transaction logs. AFAICT they don't keep seperate replication tables or anything; they just ship whole transaction logs off to the slaves (a bit of a simplification, but my point is that all the data the slaves get is in the form of the transaction logs that are normally kept by the master anyway). -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Fri, Apr 11, 2003 at 02:18:41PM -0600, Ed L. wrote: > Suppose further you wish to limit the number of updates replicated in a > particular cycle. For example, suppose you have a million changes queued > for replication on the master, but for obvious reasons you don't want to > select a million rows to replicate all at once. Suppose you also don't > want to grab them one tuple or one transaction at a time, preferring to > avoid hammering the master. Rather, you want to grab them in batches of no > more than N transactions, replicate them all to a slave and commit on the > slave, take a breather, repeating until the slave is caught up. And during > each breather, you want to have committed only complete transactions so > that any slave clients see consistent data. I know I'm a bit ignorant to how dbmirror works, but why do the select from the master and the actual commits on the slave(s) have to occupy a single transaction? I would expect that there was some kind of book-keeping tables on both the master and the slaves to keep track of how far along mirroring/replication was; the master would need to know what it could remove out of the replication table (table Q as it's called elsewhere in this thread), and the slave needs to know where it left off. If this is the case, isn't it acceptable to select a chunk of data out of the queue table on the master, and run through it on the slave, with the slave committing as it pleases? Of course every time the slave commits it will have to update the book-keeping tables, but that seems to be a given... or is this exactly what you're worried about hammering the master with? -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Saturday April 12 2003 9:54, Jim C. Nasby wrote: > On Fri, Apr 11, 2003 at 07:32:15PM -0600, Ed L. wrote: > > I appreciate your pointing that out. It is pretty undesirable for data > > to appear on the slave in an order different from the one in which it > > appears on the master. I guess that's another downside to batching. > > I'm not sure this approach can do any better than approximating the > > order since there is no knowledge of the commit order. > > I know this will probably require more work than you'd like, but it > seems like it might be very useful/important for the replication queue > to have definitive information about when commits occur. Yes, just hard/impossible to get commit data. There is work in progress on a log-based replication. 1 > BTW, I don't know how this would apply to pgsql, but both DB2 and Oracle > handle replication via the transaction logs. AFAICT they don't keep > seperate replication tables or anything; they just ship whole > transaction logs off to the slaves (a bit of a simplification, but my > point is that all the data the slaves get is in the form of the > transaction logs that are normally kept by the master anyway). Yes, sounds like such an approach might be more robust in some ways. Its just not ready, and it may be a release or two before it shows up, last I heard. Dbmirror, in my view, is reasonably close, free and open, and ready to go until something better comes along. Ed
On Saturday April 12 2003 9:55, Jim C. Nasby wrote: > On Fri, Apr 11, 2003 at 02:18:41PM -0600, Ed L. wrote: > > Suppose further you wish to limit the number of updates replicated in a > > particular cycle. For example, suppose you have a million changes > > queued for replication on the master, but for obvious reasons you don't > > want to select a million rows to replicate all at once. Suppose you > > also don't want to grab them one tuple or one transaction at a time, > > preferring to avoid hammering the master. Rather, you want to grab > > them in batches of no more than N transactions, replicate them all to a > > slave and commit on the slave, take a breather, repeating until the > > slave is caught up. And during each breather, you want to have > > committed only complete transactions so that any slave clients see > > consistent data. > > I know I'm a bit ignorant to how dbmirror works, but why do the select > from the master and the actual commits on the slave(s) have to occupy a > single transaction? I would expect that there was some kind of > book-keeping tables on both the master and the slaves to keep track of > how far along mirroring/replication was; the master would need to know > what it could remove out of the replication table (table Q as it's > called elsewhere in this thread), and the slave needs to know where it > left off. If this is the case, isn't it acceptable to select a chunk of > data out of the queue table on the master, and run through it on the > slave, with the slave committing as it pleases? Of course every time the > slave commits it will have to update the book-keeping tables, but that > seems to be a given... or is this exactly what you're worried about > hammering the master with? You are correct, IMO. Master transaction and slave transaction do not and cannot occupy the same transaction (since they're on different dbs). They may be independent, AFAIC, as long as the ordering is "close enough". Slave may commit every 5 transactions while replaying a batch of 1000 transactions. If there is slave and master bookkeeping, that seems to work ok. My interest in batching is that if you have many, many transactions in the queue, it's a bit of hammering and traffic to go back and update the bookkeeping tables and get the next transaction if you do it on every transaction. So mostly, its just seeing if I can't minimize the load on the master and traffic on the net. Ed
Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)
От
Martijn van Oosterhout
Дата:
On Fri, Apr 11, 2003 at 05:33:00PM -0600, Ed L. wrote: > On Friday April 11 2003 4:44, elein wrote: > > In this case, on the slave, xact2 will be commited before xact 1 is > > begun. This seems OK to me, but I'd like some confirmation to understand > > this. Or is this the wrong question? > > IMO, this is exactly what makes the analysis a little tricky for people like > me who don't regularly live in the world of transaction minutiae. > > If xact2 tries to update a row already updated by xact1, xact2 will block > until xact1 commits or rolls back, and cannot end before xact1 due to MVCC. > If xact1 tries to update a row already updated by xact2, xact1 will block > until xact2 is resolved. > > My question is, given all the ways in which these orderings can occur, which > I think simplifies to about 3 in the case of 2 transactions updating the > same row, is there a fool-proof ordering for replaying in batches? Well, it seems to me that you're trying to simulate the results of READ COMMITTED mode in SERIALISABLE mode, which would seem to me to be impossible (in the general case) if you ask me. You'd have to get some kind of save points where the are no outstanding writable commits and use that to divide. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
Вложения
On Fri, 11 Apr 2003, Ed L. wrote: > > Again, if there are a large number of transactions queued on the master, > possibly constituted by an even much larger number of actual row changes, a > batch size of 1 transaction means, in this context, the master will be hit > once for every single transaction to retrieve the queued rows corresponding > to that transaction. A batchsize of, say 1000 transactions, on the other > hand, will save 999 queries. Granted, you may not share my concern about > the impact of that on master or network. How you query things on the master and when you commit things on the slave aren't all that related. As far as I can tell your batch version of the query using the temporary table will give you the same ordering of the updates as the current dbmirror implementation. As your reading the result of the second query(that gives data to make the SQL statements to send to the slave) you can decide when to do a comit on teh slave. Reading the discussion it sounds like your talking about sending this batch of 1000 transactions to teh slave in 1 transaction. Why not issue a commit on the slave everytime the xid changes? From the slaves point of view the behaviour is the smae as in teh current dbmirror(the original transactions are preserved with an approximation of the ordering based on the last update/insert in each transaction) I still don't see why you would want to only do 1 commit per group on the slave though. > > The algorithm I'm considering right now is the following: > > select xid, max(seqid) as "max_seqid" > into temp replication_order > from replication_queue > group by xid > order by max(seqid) > limit N; > > Then, to get the actual queue replication order, > > select q.xid, q.seqid, q.data > from replication_queue q, replication_order o > where q.xid = o.xid > order by o.max_seqid, q.seqid; > > [This is a batched variation of dbmirror's original algorithm.] > > So, replication is done by transaction groupings, in ascending order > according to the maximum seqid in each transaction. I'm hoping someone can > poke holes in this algorithm if they exist. > > Ed > -- Steven Singer ssinger@navtechinc.com Dispatch Systems Phone: 519-747-1170 ext 282 Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR Waterloo, Ontario ARINC: YKFNSCR
> > Does it matter if transactions that do not affect each other are committed > on the slave in a different order than they were on the master? I don't > think that's guaranteed by the above (unless the inserts into > replication_queue were deferred to transaction end), but I'm also > uncertain if it's a constraint you're concerned with. Well it all depends on what you mean by "Does not accect each other". Do 2 transactions not affect each other if they touch different rows,tables, or deal with data that is unrelated from the point of view of an application query the database. Consider the following transactions on the master T1: Insert DeptId=1 INTO Department T2: Insert EmpId=1,DeptId=1 INTO Employee Those 2 transactions update different tables but if T2 is sent to slaves before T2 then A) You might have foreign key violations and B) Even if you don't have foreign keys someone querying the slave will see a database that never existed(at any point in time) on the master. But if you to ensuer that T1 and T2 are sent to the slave in the same transaction(and you defered any constraint checks to the end of your combined transaction) you wouldn't have this problem but if the two transactions where to update the same row then the order would be important and you would still need your code to deal with this case. -- Steven Singer ssinger@navtechinc.com Dispatch Systems Phone: 519-747-1170 ext 282 Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR Waterloo, Ontario ARINC: YKFNSCR
On Fri, Apr 11, 2003 at 06:34:53PM -0400, Jan Wieck wrote: > > A) You apply those changes in the order you read them out of the master > on the slave. This requires that you do it all in one big transaction on > B) You read all the changes across all tables, but regroup them into > their correct order and original transaction boundaries for playback on > B2) You read all the changes across all tables simultaneously via > cursors. Worst case you need as many cursors as you have tables and it's What I am confused about is why one needs to apply now-superseded transactions on the slave at all. Don't you just want a (serializable, mind) snapshot of the master to be applied to the slave? A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Monday April 14 2003 12:20, Andrew Sullivan wrote: > On Fri, Apr 11, 2003 at 06:34:53PM -0400, Jan Wieck wrote: > > A) You apply those changes in the order you read them out of the master > > on the slave. This requires that you do it all in one big transaction > > on > > > > B) You read all the changes across all tables, but regroup them into > > their correct order and original transaction boundaries for playback on > > > > B2) You read all the changes across all tables simultaneously via > > cursors. Worst case you need as many cursors as you have tables and > > it's > > What I am confused about is why one needs to apply now-superseded > transactions on the slave at all. Don't you just want a > (serializable, mind) snapshot of the master to be applied to the > slave? I'd say, yes, if the process of creating such a snapshot is not overly intensive or lengthy. IMO, this is one potentially significant drawback of the dbmirror approach in general. The upside to dbmirror is that its pretty straight-forward, works pretty well for certain situations, it's open source, and it's free. I know rserv/erServer are reported to use the snapshot approach. But rserv didn't work at all for me without mods and looks very much like an abandoned prototype for eRServer. ERServer, its successor, is done in part by Vadim M. who if I'm not mistaken did an excellent job with MVCC. But, at least as of Feb 28, 2003, eRServer was $10,000 minimum, closed source, doesn't replicate DDL either, doesn't release trial versions, and has no plans to support Redhat 8.0. For us, that was more than enough incentive to investigate the alternatives. I think that'd be a great improvement for dbmirror, along with DDL replication. Opinions on a better *currently available* option? Ed
I saw somewhere on this thread that there is no way to tell when something is committed. that really doesn't make sense, since postgres is a MVCC system. Doesn't the versioning automatically supply commit times? "Ed L." wrote: > > On Monday April 14 2003 12:20, Andrew Sullivan wrote: > > On Fri, Apr 11, 2003 at 06:34:53PM -0400, Jan Wieck wrote: > > > A) You apply those changes in the order you read them out of the master > > > on the slave. This requires that you do it all in one big transaction > > > on > > > > > > B) You read all the changes across all tables, but regroup them into > > > their correct order and original transaction boundaries for playback on > > > > > > B2) You read all the changes across all tables simultaneously via > > > cursors. Worst case you need as many cursors as you have tables and > > > it's > > > > What I am confused about is why one needs to apply now-superseded > > transactions on the slave at all. Don't you just want a > > (serializable, mind) snapshot of the master to be applied to the > > slave? > > I'd say, yes, if the process of creating such a snapshot is not overly > intensive or lengthy. IMO, this is one potentially significant drawback of > the dbmirror approach in general. The upside to dbmirror is that its > pretty straight-forward, works pretty well for certain situations, it's > open source, and it's free. > > I know rserv/erServer are reported to use the snapshot approach. But rserv > didn't work at all for me without mods and looks very much like an > abandoned prototype for eRServer. ERServer, its successor, is done in part > by Vadim M. who if I'm not mistaken did an excellent job with MVCC. But, > at least as of Feb 28, 2003, eRServer was $10,000 minimum, closed source, > doesn't replicate DDL either, doesn't release trial versions, and has no > plans to support Redhat 8.0. For us, that was more than enough incentive > to investigate the alternatives. > > I think that'd be a great improvement for dbmirror, along with DDL > replication. > > Opinions on a better *currently available* option? > > Ed > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
On Mon, Apr 14, 2003 at 01:01:37PM -0600, Ed L. wrote: > Opinions on a better *currently available* option? Sorry, no. (And I'm afraid I can't help suggest what you could do to help the dbmirror code towards the snapshot approach, either, since I participated in the design of the current eRServer code. :-( It was my understanding at the time that it would eventually become free software, but I am not sure what the timetable is, and I don't own the code.) A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Monday April 14 2003 1:16, Dennis Gearon wrote: > I saw somewhere on this thread that there is no way to tell when > something is committed. that really doesn't make sense, since postgres > is a MVCC system. Doesn't the versioning automatically supply commit > times? I believe the context of those comments was regarding what information was (not) available from within a transaction. Of course, commit ordering is available in the WAL logs. Ed