Index: doc/src/sgml/config.sgml =================================================================== RCS file: /projects/cvsroot/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.119 diff -c -r1.119 config.sgml *** doc/src/sgml/config.sgml 2 Apr 2007 15:27:02 -0000 1.119 --- doc/src/sgml/config.sgml 5 Apr 2007 21:51:18 -0000 *************** *** 1314,1372 **** Settings ! ! fsync configuration parameter - fsync (boolean) ! If this parameter is on, the PostgreSQL server ! will try to make sure that updates are physically written to ! disk, by issuing fsync() system calls or various ! equivalent methods (see ). ! This ensures that the database cluster can recover to a ! consistent state after an operating system or hardware crash. ! ! ! ! However, using fsync results in a ! performance penalty: when a transaction is committed, ! PostgreSQL must wait for the ! operating system to flush the write-ahead log to disk. When ! fsync is disabled, the operating system is ! allowed to do its best in buffering, ordering, and delaying ! writes. This can result in significantly improved performance. ! However, if the system crashes, the results of the last few ! committed transactions might be lost in part or whole. In the ! worst case, unrecoverable data corruption might occur. ! (Crashes of the database software itself are not ! a risk factor here. Only an operating-system-level crash ! creates a risk of corruption.) ! ! ! ! Due to the risks involved, there is no universally correct ! setting for fsync. Some administrators ! always disable fsync, while others only ! turn it off during initial bulk data loads, where there is a clear ! restart point if something goes wrong. Others ! always leave fsync enabled. The default is ! to enable fsync, for maximum reliability. ! If you trust your operating system, your hardware, and your ! utility company (or your battery backup), you can consider ! disabling fsync. ! This parameter can only be set in the postgresql.conf ! file or on the server command line. ! If you turn this parameter off, also consider turning off ! . ! wal_sync_method (string) --- 1314,1344 ---- Settings ! ! wal_buffers (integer) ! wal_buffers configuration parameter ! The amount of memory used in shared memory for WAL data. The ! default is 64 kilobytes (64kB). The setting need only ! be large enough to hold the amount of WAL data generated by one ! typical transaction, since the data is written out to disk at ! every transaction commit. This parameter can only be set at server ! start. ! Increasing this parameter might cause PostgreSQL ! to request more System V shared ! memory than your operating system's default configuration ! allows. See for information on how to ! adjust those parameters, if necessary. ! wal_sync_method (string) *************** *** 1445,1451 **** Turning this parameter off speeds normal operation, but might lead to a corrupt database after an operating system crash or power failure. The risks are similar to turning off ! fsync, though smaller. It might be safe to turn off this parameter if you have hardware (such as a battery-backed disk controller) or file-system software that reduces the risk of partial page writes to an acceptably low level (e.g., ReiserFS 4). --- 1417,1423 ---- Turning this parameter off speeds normal operation, but might lead to a corrupt database after an operating system crash or power failure. The risks are similar to turning off ! transaction_guarantee. It might be safe to turn off this parameter if you have hardware (such as a battery-backed disk controller) or file-system software that reduces the risk of partial page writes to an acceptably low level (e.g., ReiserFS 4). *************** *** 1465,1495 **** ! ! wal_buffers (integer) ! wal_buffers configuration parameter ! The amount of memory used in shared memory for WAL data. The ! default is 64 kilobytes (64kB). The setting need only ! be large enough to hold the amount of WAL data generated by one ! typical transaction, since the data is written out to disk at ! every transaction commit. This parameter can only be set at server ! start. ! Increasing this parameter might cause PostgreSQL ! to request more System V shared ! memory than your operating system's default configuration ! allows. See for information on how to ! adjust those parameters, if necessary. ! commit_delay (integer) --- 1437,1546 ---- ! ! wal_writer_delay configuration parameter + fsync (integer) ! If this parameter greater than zero, the PostgreSQL ! server will start a separate server process called the ! WAL writer, whose sole function is to issue writes ! of dirty WAL buffers. This enables functionality ! that is new in PostgreSQL 8.3, that replaces ! and supercedes the previous fsync parameter. ! (see ). ! ! ! ! The WAL Writer will flush the write-ahead log to disk every ! wal_writer_delay milliseconds. Typical ! settings would be in the range 50ms - 250ms, though the ! allowed range is from 0 - 1000ms. The default value is 0, ! meaning this process is disabled by default. Note that on many ! systems, the effective resolution of sleep delays is 10 ! milliseconds; setting wal_writer_delay to a value ! that is not a multiple of 10 might have the same results as ! setting it to the next higher multiple of 10. This parameter ! can only be set in the postgresql.conf file or on ! the server command line. ! ! ! ! ! ! ! transaction_guarantee configuration parameter ! ! transaction_guarantee (boolean) ! ! ! By default, PostgreSQL ! will try to make sure that updates are physically written to ! disk, by issuing fsync() system calls or various ! equivalent methods (see ). ! This ensures that the database cluster can recover to a ! consistent state after an operating system or hardware crash. ! However, using transaction_guarantee results ! in a performance penalty: when a transaction is committed, ! PostgreSQL must wait for the ! operating system to flush the write-ahead log to disk. When ! transaction_guarantee is disabled, the user's ! process can start the next transaction. This can result in ! significantly improved performance for single or multiple sessions ! executing reasonably short write transactions. ! However, if the system crashes, the results of the last few ! committed transactions will very likely be lost in part or whole. + + + The data loss from using this parameter is the number of + unguaranteed transactions that were committed within the last + wal_writer_delay milliseconds. The data loss + is both certain and predictable. Unguaranteed transactions + that have not been written to WAL files will definitely be lost, + there is no maybe. Only those effected transactions will be lost + and the rest of the system will be in a safe, consistent state. + This parameter can only be disabled when wal_writer_delay + is set to a value higher than zero. + + + + It is safe to use a mix of transactions with + transaction_guarantee on and off. Only the + transaction_guarantee = off transactions will be + at risk. In no circumstances will the + transaction_guarantee = on transactions be at risk. + Any changes made by an unguaranteed transaction may be readable + later by guaranteed transactions, but the guaranteed commit will + also always flush the commit of the unguaranteed transaction - + so guaranteed transactions live up to their name. + The parameter affects transaction commits only, not aborts. + It also has no effect on most utility commands such as VACUUM FULL + and other commands that will not run inside a transaction block. + Any transaction that causes files to be deleted will always + be a guaranteed transaction. + + + + This parameter can be set in postgresql.conf, though + is better specified only for those users or sessions to which + the potential data loss is acceptable. General disabling of this + parameter is not recommended unless you've explained to the system + owner the full implications of their decision to use this feature. + + + + There is no legal meaning to the phrase guaranteed + and the terms of the PostgreSQL licence remain unchanged. + ! commit_delay (integer) Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /projects/cvsroot/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.43 diff -c -r1.43 wal.sgml *** doc/src/sgml/wal.sgml 31 Jan 2007 20:56:19 -0000 1.43 --- doc/src/sgml/wal.sgml 5 Apr 2007 21:51:19 -0000 *************** *** 267,273 **** performing a LogFlush. This delay allows other server processes to add their commit records to the log so as to have all of them flushed with a single log sync. No sleep will occur if ! is not enabled, nor if fewer than other sessions are currently in active transactions; this avoids sleeping when it's unlikely that any other session will commit soon. --- 267,273 ---- performing a LogFlush. This delay allows other server processes to add their commit records to the log so as to have all of them flushed with a single log sync. No sleep will occur if ! is not enabled, nor if fewer than other sessions are currently in active transactions; this avoids sleeping when it's unlikely that any other session will commit soon. Index: doc/src/sgml/ref/postgres-ref.sgml =================================================================== RCS file: /projects/cvsroot/pgsql/doc/src/sgml/ref/postgres-ref.sgml,v retrieving revision 1.50 diff -c -r1.50 postgres-ref.sgml *** doc/src/sgml/ref/postgres-ref.sgml 16 Feb 2007 02:10:07 -0000 1.50 --- doc/src/sgml/ref/postgres-ref.sgml 5 Apr 2007 21:51:21 -0000 *************** *** 183,189 **** Disables fsync calls for improved performance, at the risk of data corruption in the event of a system crash. Specifying this option is equivalent to ! disabling the configuration parameter. Read the detailed documentation before using this! --- 183,189 ---- Disables fsync calls for improved performance, at the risk of data corruption in the event of a system crash. Specifying this option is equivalent to ! disabling the configuration parameter. Read the detailed documentation before using this! Index: src/backend/access/transam/clog.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/clog.c,v retrieving revision 1.42 diff -c -r1.42 clog.c *** src/backend/access/transam/clog.c 5 Jan 2007 22:19:23 -0000 1.42 --- src/backend/access/transam/clog.c 5 Apr 2007 21:51:21 -0000 *************** *** 79,85 **** * for most uses; TransactionLogUpdate() in transam.c is the intended caller. */ void ! TransactionIdSetStatus(TransactionId xid, XidStatus status) { int pageno = TransactionIdToPage(xid); int byteno = TransactionIdToByte(xid); --- 79,85 ---- * for most uses; TransactionLogUpdate() in transam.c is the intended caller. */ void ! TransactionIdSetStatus(TransactionId xid, XidStatus status, XLogRecPtr lsn) { int pageno = TransactionIdToPage(xid); int byteno = TransactionIdToByte(xid); *************** *** 94,99 **** --- 94,112 ---- LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); + /* + * SimpleLruReadPage() calls SlruSelectLRUPage() which + * never returns until I/O has finished on a page. All I/O + * starts by holding Control lock, so this next call never + * returns until we have completed all I/O on the block. + * This assumption is important because unguaranteed + * transaction commits must *never* reach disk until + * XLogFlush() confirms flush. Allowing a page write + * concurrently with writing to the page might allow the + * committed status to reach disk ahead of a flush, so + * for unguaranteed transactions it is important that we + * never allow this to occur. Got that? + */ slotno = SimpleLruReadPage(ClogCtl, pageno, xid); byteptr = ClogCtl->shared->page_buffer[slotno] + byteno; *************** *** 110,115 **** --- 123,138 ---- ClogCtl->shared->page_dirty[slotno] = true; + /* + * Update the page LSN if the transaction completion LSN is higher. + * lsn will be invalid when supplied during InRecovery processing, + * so we don't need to do anything special to avoid LSN updates + * during recovery. After recovery completes the next clog change + * will set the LSN correctly. + */ + if (XLByteLT(ClogCtl->shared->page_lsn[slotno], lsn)) + ClogCtl->shared->page_lsn[slotno] = lsn; + LWLockRelease(CLogControlLock); } *************** *** 157,162 **** --- 180,187 ---- ClogCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(ClogCtl, "CLOG Ctl", NUM_CLOG_BUFFERS, CLogControlLock, "pg_clog"); + ClogCtl->do_wal_flush = true; + } /* Index: src/backend/access/transam/multixact.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/multixact.c,v retrieving revision 1.23 diff -c -r1.23 multixact.c *** src/backend/access/transam/multixact.c 5 Jan 2007 22:19:23 -0000 1.23 --- src/backend/access/transam/multixact.c 5 Apr 2007 21:51:23 -0000 *************** *** 418,423 **** --- 418,486 ---- } /* + * MultiXactIdIsFlushed + * Returns whether a MultiXactId is "flushed". + * + * We return false if at least one member of the given MultiXactId is not yet + * flushed. Note that a "true" result is certain not to change, + * because it is not legal to add members to an existing MultiXactId. + */ + bool + MultiXactIdIsFlushed(MultiXactId multi) + { + TransactionId *members; + int nmembers; + int i; + + debug_elog3(DEBUG2, "IsFlushed %u?", multi); + + nmembers = GetMultiXactIdMembers(multi, &members); + + if (nmembers < 0) + { + debug_elog2(DEBUG2, "IsFlushed: no members"); + return true; + } + + /* + * Checking for myself is cheap compared to looking in shared memory, + * so first do the equivalent of MultiXactIdIsCurrent(). This is not + * needed for correctness, it's just a fast path. + */ + for (i = 0; i < nmembers; i++) + { + if (TransactionIdIsCurrentTransactionId(members[i])) + { + debug_elog3(DEBUG2, "IsFlushed: I (%d) am running! So not flushed", i); + pfree(members); + return false; + } + } + + /* + * This could be made faster by having another entry point in procarray.c, + * walking the flushed array only once for all the members. But in most + * cases nmembers should be small enough that it doesn't much matter. + */ + for (i = 0; i < nmembers; i++) + { + if (!TransactionIdIsFlushed(members[i])) + { + debug_elog4(DEBUG2, "IsFlushed: member %d (%u) is not flushed", + i, members[i]); + pfree(members); + return false; + } + } + + pfree(members); + + debug_elog3(DEBUG2, "IsFlushed: %u is flushed", multi); + + return true; + } + + /* * MultiXactIdIsCurrent * Returns true if the current transaction is a member of the MultiXactId. * Index: src/backend/access/transam/slru.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/slru.c,v retrieving revision 1.40 diff -c -r1.40 slru.c *** src/backend/access/transam/slru.c 5 Jan 2007 22:19:23 -0000 1.40 --- src/backend/access/transam/slru.c 5 Apr 2007 21:51:24 -0000 *************** *** 161,166 **** --- 161,167 ---- sz += MAXALIGN(nslots * sizeof(char *)); /* page_buffer[] */ sz += MAXALIGN(nslots * sizeof(SlruPageStatus)); /* page_status[] */ sz += MAXALIGN(nslots * sizeof(bool)); /* page_dirty[] */ + sz += MAXALIGN(nslots * sizeof(XLogRecPtr));/* page_lsn[] */ sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */ sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */ sz += MAXALIGN(nslots * sizeof(LWLockId)); /* buffer_locks[] */ *************** *** 206,211 **** --- 207,214 ---- offset += MAXALIGN(nslots * sizeof(SlruPageStatus)); shared->page_dirty = (bool *) (ptr + offset); offset += MAXALIGN(nslots * sizeof(bool)); + shared->page_lsn = (XLogRecPtr *) (ptr + offset); + offset += MAXALIGN(nslots * sizeof(XLogRecPtr)); shared->page_number = (int *) (ptr + offset); offset += MAXALIGN(nslots * sizeof(int)); shared->page_lru_count = (int *) (ptr + offset); *************** *** 219,224 **** --- 222,229 ---- shared->page_buffer[slotno] = ptr; shared->page_status[slotno] = SLRU_PAGE_EMPTY; shared->page_dirty[slotno] = false; + shared->page_lsn[slotno].xlogid = 0; + shared->page_lsn[slotno].xrecoff = 0; shared->page_lru_count[slotno] = 0; shared->buffer_locks[slotno] = LWLockAssign(); ptr += BLCKSZ; *************** *** 232,238 **** * assume caller set PagePrecedes. */ ctl->shared = shared; ! ctl->do_fsync = true; /* default behavior */ StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir)); } --- 237,244 ---- * assume caller set PagePrecedes. */ ctl->shared = shared; ! ctl->do_fsync = true; /* default behavior */ ! ctl->do_wal_flush = false; /* default behavior */ StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir)); } *************** *** 620,625 **** --- 626,643 ---- int offset = rpageno * BLCKSZ; char path[MAXPGPATH]; int fd = -1; + XLogRecPtr lsn = shared->page_lsn[slotno]; + + /* + * Honour the write-WAL-before-data guarantee if we care about that the + * integrity of the slru page to be protected across a crash. This will + * return almost immediately except in rare cases where we have + * unguaranteed transactions not yet flushed because normal commits + * do an XLogFlush before updating clog. This is the same step as we do + * during FlushBuffer() in the main shared buffer manager. + */ + if (ctl->do_wal_flush && !XLogRecPtrIsInvalid(lsn)) + XLogFlush(lsn); /* * During a Flush, we may already have the desired file open. Index: src/backend/access/transam/transam.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/transam.c,v retrieving revision 1.69 diff -c -r1.69 transam.c *** src/backend/access/transam/transam.c 5 Jan 2007 22:19:23 -0000 1.69 --- src/backend/access/transam/transam.c 5 Apr 2007 21:51:25 -0000 *************** *** 27,33 **** static XidStatus TransactionLogFetch(TransactionId transactionId); static void TransactionLogUpdate(TransactionId transactionId, ! XidStatus status); /* ---------------- * Single-item cache for results of TransactionLogFetch. --- 27,33 ---- static XidStatus TransactionLogFetch(TransactionId transactionId); static void TransactionLogUpdate(TransactionId transactionId, ! XidStatus status, XLogRecPtr lsn); /* ---------------- * Single-item cache for results of TransactionLogFetch. *************** *** 97,108 **** */ static void TransactionLogUpdate(TransactionId transactionId, /* trans id to update */ ! XidStatus status) /* new trans status */ { /* * update the commit log */ ! TransactionIdSetStatus(transactionId, status); } /* --- 97,109 ---- */ static void TransactionLogUpdate(TransactionId transactionId, /* trans id to update */ ! XidStatus status, /* new trans status */ ! XLogRecPtr lsn) /* lsn of transaction completion */ { /* * update the commit log */ ! TransactionIdSetStatus(transactionId, status, lsn); } /* *************** *** 112,125 **** * Don't depend on this being atomic; it's not. */ static void ! TransactionLogMultiUpdate(int nxids, TransactionId *xids, XidStatus status) { int i; Assert(nxids != 0); for (i = 0; i < nxids; i++) ! TransactionIdSetStatus(xids[i], status); } /* ---------------------------------------------------------------- --- 113,126 ---- * Don't depend on this being atomic; it's not. */ static void ! TransactionLogMultiUpdate(int nxids, TransactionId *xids, XidStatus status, XLogRecPtr lsn) { int i; Assert(nxids != 0); for (i = 0; i < nxids; i++) ! TransactionIdSetStatus(xids[i], status, lsn); } /* ---------------------------------------------------------------- *************** *** 267,275 **** * Assumes transaction identifier is valid. */ void ! TransactionIdCommit(TransactionId transactionId) { ! TransactionLogUpdate(transactionId, TRANSACTION_STATUS_COMMITTED); } /* --- 268,276 ---- * Assumes transaction identifier is valid. */ void ! TransactionIdCommit(TransactionId transactionId, XLogRecPtr lsn) { ! TransactionLogUpdate(transactionId, TRANSACTION_STATUS_COMMITTED, lsn); } /* *************** *** 280,288 **** * Assumes transaction identifier is valid. */ void ! TransactionIdAbort(TransactionId transactionId) { ! TransactionLogUpdate(transactionId, TRANSACTION_STATUS_ABORTED); } /* --- 281,289 ---- * Assumes transaction identifier is valid. */ void ! TransactionIdAbort(TransactionId transactionId, XLogRecPtr lsn) { ! TransactionLogUpdate(transactionId, TRANSACTION_STATUS_ABORTED, lsn); } /* *************** *** 293,299 **** void TransactionIdSubCommit(TransactionId transactionId) { ! TransactionLogUpdate(transactionId, TRANSACTION_STATUS_SUB_COMMITTED); } /* --- 294,302 ---- void TransactionIdSubCommit(TransactionId transactionId) { ! XLogRecPtr lsn = {0,0}; /* Invalid XLogRecPtr */ ! ! TransactionLogUpdate(transactionId, TRANSACTION_STATUS_SUB_COMMITTED, lsn); } /* *************** *** 306,315 **** * TransactionIdDidCommit. */ void ! TransactionIdCommitTree(int nxids, TransactionId *xids) { if (nxids > 0) ! TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_COMMITTED); } /* --- 309,318 ---- * TransactionIdDidCommit. */ void ! TransactionIdCommitTree(int nxids, TransactionId *xids, XLogRecPtr lsn) { if (nxids > 0) ! TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_COMMITTED, lsn); } /* *************** *** 320,329 **** * will consider all the xacts as not-yet-committed anyway. */ void ! TransactionIdAbortTree(int nxids, TransactionId *xids) { if (nxids > 0) ! TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_ABORTED); } /* --- 323,332 ---- * will consider all the xacts as not-yet-committed anyway. */ void ! TransactionIdAbortTree(int nxids, TransactionId *xids, XLogRecPtr lsn) { if (nxids > 0) ! TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_ABORTED, lsn); } /* Index: src/backend/access/transam/twophase.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/twophase.c,v retrieving revision 1.29 diff -c -r1.29 twophase.c *** src/backend/access/transam/twophase.c 3 Apr 2007 16:34:35 -0000 1.29 --- src/backend/access/transam/twophase.c 5 Apr 2007 21:51:26 -0000 *************** *** 1711,1719 **** XLogFlush(recptr); /* Mark the transaction committed in pg_clog */ ! TransactionIdCommit(xid); /* to avoid race conditions, the parent must commit first */ ! TransactionIdCommitTree(nchildren, children); /* Checkpoint can proceed now */ MyProc->inCommit = false; --- 1711,1719 ---- XLogFlush(recptr); /* Mark the transaction committed in pg_clog */ ! TransactionIdCommit(xid, recptr); /* to avoid race conditions, the parent must commit first */ ! TransactionIdCommitTree(nchildren, children, recptr); /* Checkpoint can proceed now */ MyProc->inCommit = false; *************** *** 1790,1797 **** * Mark the transaction aborted in clog. This is not absolutely necessary * but we may as well do it while we are here. */ ! TransactionIdAbort(xid); ! TransactionIdAbortTree(nchildren, children); END_CRIT_SECTION(); } --- 1790,1797 ---- * Mark the transaction aborted in clog. This is not absolutely necessary * but we may as well do it while we are here. */ ! TransactionIdAbort(xid, recptr); ! TransactionIdAbortTree(nchildren, children, recptr); END_CRIT_SECTION(); } Index: src/backend/access/transam/xact.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xact.c,v retrieving revision 1.239 diff -c -r1.239 xact.c *** src/backend/access/transam/xact.c 3 Apr 2007 16:34:35 -0000 1.239 --- src/backend/access/transam/xact.c 5 Apr 2007 21:51:31 -0000 *************** *** 36,41 **** --- 36,42 ---- #include "pgstat.h" #include "storage/fd.h" #include "storage/lmgr.h" + #include "storage/pmsignal.h" #include "storage/procarray.h" #include "storage/smgr.h" #include "utils/combocid.h" *************** *** 58,63 **** --- 59,68 ---- int CommitDelay = 0; /* precommit delay in microseconds */ int CommitSiblings = 5; /* # concurrent xacts needed to sleep */ + bool DefaultXactCommitGuarantee = true; /* USERSET GUC: what user wants */ + static bool XactCommitGuarantee = true; /* the xact guarantee for This Xid */ + bool trace_commit = false; + bool trace_bg_flush = true; /* * transaction states - transaction state from server perspective *************** *** 203,208 **** --- 208,308 ---- static SubXactCallbackItem *SubXact_callbacks = NULL; + /* + * DeferredFsyncCache (DFC) is a shared-memory array where we keep track + * of the transactions for which deferred fsync has been requested. + * The array is divided into chunks, each of which fits within 1-2 + * cache lines so that both changes and lookups can be made quickly. + * A chunk has more than one dfc slot within it, with each dfc slot + * holding details about one deferred transaction. + * + * When we access a chunk we loop through all dfc slots in the chunk, + * designed so that loop will be unrolled. + * When we flush the DFC, we don't bother to remove transactions from it. + * When we insert new transactions we simply overwrite expired slots, + * so the bookkeeping never requires the lock to be held for any length + * of time. + * + * When a chunk is nearly full we signal the WALWriter to wake up and + * flush the DFC. When a chunk is full we flush the DFC while holding + * the lock. + * + * The DFC is striped so that consecutive transactions aren't in the same + * chunk, nor will transactions from the same backend always hit the + * same spot in the cache. + */ + #define DFC_XACTS_PER_CHUNK 8 + #define MAX_DFC_CHUNKS 128 + + #define MAX_DFC_XACTS 1024 + #define TransactionIdToDFCChunk(xid) ((int)((xid) % (TransactionId) MAX_DFC_CHUNKS)) + #define DFCChunkToDFCSlot(chunk) ((chunk) * DFC_XACTS_PER_CHUNK) + + /* Deferred Fsync tuning parameters: */ + #define DFC_SIGNAL_WALWRITER_THRESHOLD 6 + #define BUSY_NUM_XACTS_THRESHOLD 16 + + /* + * The DFC tracks the LSN and xmin of deferred transactions. + * + * - lsn refers to xlog pointers + * + * - xmin refers to the oldest known TransactionIds. When we + * flush a transaction we know that all transactions prior + * to the RecentGlobalXmin seen by that backend will also + * be known flushed. So by keeping track of the latest + * RecentGlobalXmin we can have a TransactionId to test + * known flushed state against. + * + * Pointers behave similarly to the WAL buffer because both + * xmin and lsn continually advance, so that the request point + * is always ahead of or the same as the flush point. + * When we make a new request we advance the request point. + * When we flush we advance the flush point. + */ + typedef struct DeferredFsyncTransactionData + { + TransactionId xid; + XLogRecPtr lsn; + char padding[4]; + } DeferredFsyncXactData; /* 16 bytes */ + + typedef struct + { + XLogRecPtr request_lsn; + XLogRecPtr flushed_lsn; + + TransactionId request_xmin; + TransactionId flushed_xmin; + + DeferredFsyncXactData dfccache[MAX_DFC_XACTS]; + + /* auto-tuning info */ + int numNewDeferredCommits; + + /* trace info */ + int numFlushes; + + } DeferredFsyncShmemStruct; + + struct + { + /* copies of global tuning info */ + int numNewDeferredCommits; + + /* number of xacts sharing this hash bucket */ + int numValid; + + /* copies of global trace info */ + int numFlushes; + + /* local trace info */ + int flush_test_exit_local; + int flush_test_exit_search; + } trace_dfc; + + static DeferredFsyncShmemStruct *dfc; + static TransactionId RecentFlushedXmin = InvalidTransactionId; /* local function prototypes */ static void AssignSubTransactionId(TransactionState s); *************** *** 244,249 **** --- 344,356 ---- static const char *BlockStateAsString(TBlockState blockState); static const char *TransStateAsString(TransState state); + static void TransactionDeferFsync(TransactionId xid, XLogRecPtr deferLSN); + + static void reset_trace_dfc(void); + static void dfc_trace_chunk(int slot, TransactionId xid, XLogRecPtr deferLSN); + static void dfc_trace_commit(XLogRecPtr recptr); + static void get_trace_dfc(void); + /* ---------------------------------------------------------------- * transaction state accessors *************** *** 794,814 **** if (MyXactMadeXLogEntry) { /* ! * Sleep before flush! So we can flush more than one commit ! * records per single fsync. (The idea is some other backend may ! * do the XLogFlush while we're sleeping. This needs work still, ! * because on most Unixen, the minimum select() delay is 10msec or ! * more, which is way too long.) ! * ! * We do not sleep if enableFsync is not turned on, nor if there ! * are fewer than CommitSiblings other backends with active ! * transactions. ! */ ! if (CommitDelay > 0 && enableFsync && ! CountActiveBackends() >= CommitSiblings) ! pg_usleep(CommitDelay); ! XLogFlush(recptr); } /* --- 901,934 ---- if (MyXactMadeXLogEntry) { /* ! * If we have chosen to use unguaranteed transactions and we're ! * not doing cleanup of any rels, then we can defer fsync. ! * The WAL writer acts to minimise the window of data loss, ! * and we rely on it to flush WAL soon, but not precisely now. ! */ ! if (trace_commit) ! reset_trace_dfc(); ! if (XactCommitGuarantee || nrels > 0) ! { ! /* ! * Sleep before flush! So we can flush more than one commit ! * records per single fsync. (The idea is some other backend may ! * do the XLogFlush while we're sleeping. This needs work still, ! * because on most Unixen, the minimum select() delay is 10msec or ! * more, which is way too long.) ! * ! * We do not sleep if enableFsync is not turned on, nor if there ! * are fewer than CommitSiblings other backends with active ! * transactions. ! */ ! if (CommitDelay > 0 && enableFsync && ! CountActiveBackends() >= CommitSiblings) ! pg_usleep(CommitDelay); ! XLogFlush(recptr); ! } ! else ! TransactionDeferFsync(xid, recptr); } /* *************** *** 819,836 **** * emitted an XLOG record for our commit, and so in the event of a * crash the clog update might be lost. This is okay because no one * else will ever care whether we committed. */ if (madeTCentries || MyXactMadeTempRelUpdate) { ! TransactionIdCommit(xid); /* to avoid race conditions, the parent must commit first */ ! TransactionIdCommitTree(nchildren, children); } /* Checkpoint can proceed now */ MyProc->inCommit = false; END_CRIT_SECTION(); } /* Break the chain of back-links in the XLOG records I output */ --- 939,962 ---- * emitted an XLOG record for our commit, and so in the event of a * crash the clog update might be lost. This is okay because no one * else will ever care whether we committed. + * + * The recptr here refers to the last xlog entry by this transaction + * so is the correct value to use for setting the clog. */ if (madeTCentries || MyXactMadeTempRelUpdate) { ! TransactionIdCommit(xid, recptr); /* to avoid race conditions, the parent must commit first */ ! TransactionIdCommitTree(nchildren, children, recptr); } /* Checkpoint can proceed now */ MyProc->inCommit = false; END_CRIT_SECTION(); + + if (trace_commit && madeTCentries && WALWriterActive()) + dfc_trace_commit(recptr); } /* Break the chain of back-links in the XLOG records I output */ *************** *** 1013,1018 **** --- 1139,1145 ---- if (MyLastRecPtr.xrecoff != 0 || MyXactMadeTempRelUpdate || nrels > 0) { TransactionId xid = GetCurrentTransactionId(); + XLogRecPtr recptr; /* * Catch the scenario where we aborted partway through *************** *** 1040,1046 **** XLogRecData rdata[3]; int lastrdata = 0; xl_xact_abort xlrec; - XLogRecPtr recptr; xlrec.xtime = time(NULL); xlrec.nrels = nrels; --- 1167,1172 ---- *************** *** 1074,1079 **** --- 1200,1207 ---- if (nrels > 0) XLogFlush(recptr); } + else + recptr = MyLastRecPtr; /* * Mark the transaction aborted in clog. This is not absolutely *************** *** 1084,1091 **** * subtransactions to aborted state from the point of view of * concurrent TransactionIdDidAbort calls. */ ! TransactionIdAbort(xid); ! TransactionIdAbortTree(nchildren, children); END_CRIT_SECTION(); } --- 1212,1219 ---- * subtransactions to aborted state from the point of view of * concurrent TransactionIdDidAbort calls. */ ! TransactionIdAbort(xid, recptr); ! TransactionIdAbortTree(nchildren, children, recptr); END_CRIT_SECTION(); } *************** *** 1207,1212 **** --- 1335,1342 ---- */ if (MyLastRecPtr.xrecoff != 0 || MyXactMadeTempRelUpdate || nrels > 0) { + XLogRecPtr recptr; + START_CRIT_SECTION(); /* *************** *** 1218,1224 **** XLogRecData rdata[3]; int lastrdata = 0; xl_xact_abort xlrec; - XLogRecPtr recptr; xlrec.xtime = time(NULL); xlrec.nrels = nrels; --- 1348,1353 ---- *************** *** 1252,1265 **** if (nrels > 0) XLogFlush(recptr); } /* * Mark the transaction aborted in clog. This is not absolutely * necessary but XactLockTableWait makes use of it to avoid waiting * for already-aborted subtransactions. */ ! TransactionIdAbort(xid); ! TransactionIdAbortTree(nchildren, children); END_CRIT_SECTION(); } --- 1381,1396 ---- if (nrels > 0) XLogFlush(recptr); } + else + recptr = MyLastRecPtr; /* * Mark the transaction aborted in clog. This is not absolutely * necessary but XactLockTableWait makes use of it to avoid waiting * for already-aborted subtransactions. */ ! TransactionIdAbort(xid, recptr); ! TransactionIdAbortTree(nchildren, children, recptr); END_CRIT_SECTION(); } *************** *** 1389,1394 **** --- 1520,1526 ---- FreeXactSnapshot(); XactIsoLevel = DefaultXactIsoLevel; XactReadOnly = DefaultXactReadOnly; + SetXactCommitGuarantee(true); /* * reinitialize within-transaction counters *************** *** 4094,4099 **** --- 4226,4237 ---- return "UNRECOGNIZED"; } + void + SetXactCommitGuarantee(bool RequestedXactCommitGuarantee) + { + XactCommitGuarantee = RequestedXactCommitGuarantee; + } + /* * xactGetCommittedChildren * *************** *** 4132,4137 **** --- 4270,4279 ---- /* * XLOG support routines + * + * LSN supplied for clog changes is invalid, so that we avoid + * WAL flushes while we are rebuilding clog. After recovery + * completes the next clog change will set the LSN correctly. */ static void *************** *** 4140,4151 **** TransactionId *sub_xids; TransactionId max_xid; int i; ! TransactionIdCommit(xid); /* Mark committed subtransactions as committed */ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); ! TransactionIdCommitTree(xlrec->nsubxacts, sub_xids); /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; --- 4282,4294 ---- TransactionId *sub_xids; TransactionId max_xid; int i; + XLogRecPtr lsn = {0,0}; /* Invalid XLogRecPtr */ ! TransactionIdCommit(xid, lsn); /* Mark committed subtransactions as committed */ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); ! TransactionIdCommitTree(xlrec->nsubxacts, sub_xids, lsn); /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; *************** *** 4175,4186 **** TransactionId *sub_xids; TransactionId max_xid; int i; ! TransactionIdAbort(xid); /* Mark subtransactions as aborted */ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); ! TransactionIdAbortTree(xlrec->nsubxacts, sub_xids); /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; --- 4318,4330 ---- TransactionId *sub_xids; TransactionId max_xid; int i; + XLogRecPtr lsn = {0,0}; /* Invalid XLogRecPtr */ ! TransactionIdAbort(xid, lsn); /* Mark subtransactions as aborted */ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); ! TransactionIdAbortTree(xlrec->nsubxacts, sub_xids, lsn); /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; *************** *** 4347,4349 **** --- 4491,4849 ---- else appendStringInfo(buf, "UNKNOWN"); } + + + /* + * Initialize the deferred fsync cache at server start + */ + void + DeferredFsyncShmemInit(void) + { + bool found; + + dfc = ShmemInitStruct("Deferred Fsync Cache", + DeferredFsyncShmemSize(), + &found); + + if (dfc == NULL) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("insufficient shared memory for deferred fsync cache"))); + + if (found) + return; + + MemSet(dfc, 0, DeferredFsyncShmemSize()); + } + + /* + * Estimate amount of shmem space needed for deferred fsync cache + */ + Size + DeferredFsyncShmemSize(void) + { + return sizeof(DeferredFsyncShmemStruct); + } + + /* + * TransactionDeferFsync() + * + * Register that an fsync will be needed in the future for this xact, + * stores also the LSN of the commit record in xlog, so we know + * where to flush to in order to make this commit safe. + * + * Guaranteed transactions need not register here. + */ + static void + TransactionDeferFsync(TransactionId deferXid, XLogRecPtr deferLSN) + { + int chunk = TransactionIdToDFCChunk(deferXid); + int slot = DFCChunkToDFCSlot(chunk); + bool signalWALWriter = false; + int numValid = 0; + bool found = false; + + LWLockAcquire(DeferredFsyncLock, LW_EXCLUSIVE); + + /* + * Set the global highest deferLSN and advance the request xmin + */ + if (XLByteLT(dfc->request_lsn, deferLSN)) + dfc->request_lsn = deferLSN; + + if (TransactionIdPrecedes(dfc->request_xmin, RecentGlobalXmin)) + dfc->request_xmin = RecentGlobalXmin; + + /* + * Now look for a place to record this deferred transaction + */ + for (;;) + { + bool may_retry = true; + + for (numValid = 0; numValid < DFC_XACTS_PER_CHUNK; numValid++) + { + /* + * If we find an out-of-date entry, overwrite it + */ + if (XLByteLE(dfc->dfccache[slot + numValid].lsn, dfc->flushed_lsn)) + { + dfc->dfccache[slot + numValid].xid = deferXid; + dfc->dfccache[slot + numValid].lsn = deferLSN; + + /* Keep track of how busy we are */ + dfc->numNewDeferredCommits++; + + get_trace_dfc(); + + found = true; + + break; + } + } + + /* + * If we couldn't find anywhere to store this deferXid, + * then we need to flush while holding the lock, + * then loop back around for another attempt. Only + * allow ourselves to retry once though. + */ + if (numValid >= DFC_XACTS_PER_CHUNK && may_retry) + { + FlushAnyDeferredFsyncXacts(false, true); + may_retry = false; + } + else + { + if (numValid > DFC_SIGNAL_WALWRITER_THRESHOLD) + signalWALWriter = true; + break; + } + } + + if (!found) + dfc_trace_chunk(slot, deferXid, deferLSN); + + LWLockRelease(DeferredFsyncLock); + + trace_dfc.numValid = numValid; + + if (!found) + { + dfc_trace_commit(deferLSN); + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), + errmsg("unable to locate slot in deferred transaction cache for TransactionId=%d LSN=%X/%X", + deferXid, deferLSN.xlogid, deferLSN.xrecoff))); + } + + if (signalWALWriter) + SendPostmasterSignal(PMSIGNAL_WAKEN_WALWRITER); + } + + /* + * FlushAnyDeferredFsyncXacts() + * + * Gets the current high-water mark LSN and then flushes xlog + * + * Doesn't confirm that all deferred fsync transactions have been flushed, + * unless called with DeferredFsyncLock already held. + */ + void + FlushAnyDeferredFsyncXacts(bool loop_if_busy, bool have_lock) + { + XLogRecPtr FlushLSN = {0,0}; /* InvalidXLogRecPtr */ + TransactionId FlushXmin = 0; + int num_xacts_since_last_flush; + int num_xacts_while_flushing; + int num_flushes = 0; + + /* Make sure we never loop when we have the lock */ + Assert(!(loop_if_busy && have_lock)); + + for (;;) + { + /* + * Get the current request points, then reset the + * counter so we can see how busy we are after we flush. + */ + if (!have_lock) + LWLockAcquire(DeferredFsyncLock, LW_EXCLUSIVE); + + if (!XLByteEQ(dfc->flushed_lsn, dfc->request_lsn)) + { + FlushLSN = dfc->request_lsn; + FlushXmin = dfc->request_xmin; + + num_flushes = dfc->numFlushes; + } + + num_xacts_since_last_flush = dfc->numNewDeferredCommits; + dfc->numNewDeferredCommits = 0; + + if (!have_lock) + LWLockRelease(DeferredFsyncLock); + + if (!XLogRecPtrIsInvalid(FlushLSN)) + XLogFlush(FlushLSN); + + /* + * Get the number of transactions added while we've been flushing. + * Decide whether to keep flushing if we are busy enough. + * Move the known-flushed-xmin forwards + */ + if (!have_lock) + LWLockAcquire(DeferredFsyncLock, LW_EXCLUSIVE); + + /* + * If this new FlushLSN is higher than the flushed_lsn + * then update that also, unless someone already did it + */ + if (XLByteLT(dfc->flushed_lsn, FlushLSN)) + dfc->flushed_lsn = FlushLSN; + + /* Move the known flushed pointer forwards, unless already done */ + if (TransactionIdPrecedes(dfc->flushed_xmin, FlushXmin)) + dfc->flushed_xmin = FlushXmin; + + num_xacts_while_flushing = dfc->numNewDeferredCommits; + dfc->numFlushes++; + + if (!have_lock) + LWLockRelease(DeferredFsyncLock); + + if (!loop_if_busy || num_xacts_while_flushing < BUSY_NUM_XACTS_THRESHOLD) + break; + } + + /* + * Only report the background flush if it did something... otherwise we get + * floods of messages for no purpose. We still report the background flush + * even if XLogFlush() had already occurred because of another backend + */ + if (trace_bg_flush && num_flushes > 0 && !have_lock) + ereport(LOG, + (errmsg("background flush: lsn=%X/%X xmin=%d flushId=%d commits=%d (while flushing=%d)", + FlushLSN.xlogid, FlushLSN.xrecoff, + FlushXmin, + num_flushes, + num_xacts_since_last_flush, + num_xacts_while_flushing))); + } + + /* + * TransactionIdIsFlushed -- has transaction commit been flushed? + * + * Since no guaranteed transactions are stored in the DFC this + * should always return true for guaranteed ("normal") xacts. + * Deferred fsync transactions will be placed in the cache by + * TransactionDeferFsync() though may be expired by + * FlushAnyDeferredFsyncXacts(). + */ + bool + TransactionIdIsFlushed(TransactionId xid) + { + bool result = true; + TransactionId topxid = SubTransGetTopmostTransaction(xid); + int chunk; + int slot; + int i; + + /* + * If xid is already locally known-flushed then exit quickly + * without grabbing the lock + */ + if (TransactionIdPrecedes(xid, RecentFlushedXmin)) + { + trace_dfc.flush_test_exit_local++; + return true; + } + + chunk = TransactionIdToDFCChunk(topxid); + slot = DFCChunkToDFCSlot(chunk); + + LWLockAcquire(DeferredFsyncLock, LW_SHARED); + + /* Update local state - not worth effort to recheck */ + RecentFlushedXmin = dfc->flushed_xmin; + + /* + * Search through the chunk looking for the xid, if we find + * it, check whether its lsn is flushed yet or not + */ + result = true; + for (i = 0; i < DFC_XACTS_PER_CHUNK; i++) + { + if (TransactionIdEquals(dfc->dfccache[slot + i].xid,xid)) + { + if (!XLByteLT(dfc->dfccache[slot + i].lsn, dfc->flushed_lsn)) + { + result = false; + break; + } + } + } + + LWLockRelease(DeferredFsyncLock); + trace_dfc.flush_test_exit_search++; + + /* + * If we couldn't find xid then it must have been either flushed + * and then subsequently overwritten, or it was never a + * deferred transaction at all. + */ + return result; + } + + /* + * Trace support functions for Deferred Fsync Cache + */ + + /* + * reset_trace_dfc() + * + * reset any trace information in this backend, prior to commit + */ + static void + reset_trace_dfc(void) + { + trace_dfc.numValid = 0; + } + + /* + * get_trace_dfc() + * + * Get trace information to allow this commit to be traced later. + * use with DeferredFsyncLock held, then use dfc_trace_commit() + */ + static void + get_trace_dfc(void) + { + trace_dfc.numFlushes = dfc->numFlushes; + trace_dfc.numNewDeferredCommits = dfc->numNewDeferredCommits; + } + + /* + * dfc_trace_commit() + * + * log commit trace information, for use with DeferredFsyncLock not-held + */ + static void + dfc_trace_commit(XLogRecPtr recptr) + { + if (XactCommitGuarantee) + ereport(LOG, + (errmsg(" safe commit: lsn %X/%X", + recptr.xlogid, recptr.xrecoff))); + else + ereport(LOG, + (errmsg("unsafe commit: lsn %X/%X slots=%d nFlushes=%d nCommits=%d flushTest=%d/%d", + recptr.xlogid, recptr.xrecoff, + trace_dfc.numValid, + trace_dfc.numFlushes, + trace_dfc.numNewDeferredCommits, + trace_dfc.flush_test_exit_local, + trace_dfc.flush_test_exit_search))); + } + + /* + * dfc_trace_chunk() + * + * internal diagnostic or pre-error tracing, use with DeferredFsyncLock held + */ + static void + dfc_trace_chunk(int slot, TransactionId xid, XLogRecPtr deferLSN) + { + int i; + + for (i = 0; i < DFC_XACTS_PER_CHUNK; i++) + { + ereport(LOG, + (errmsg("dfc chunk %d: TransactionId=%d LSN=%X/%X %s", + slot, + dfc->dfccache[slot + i].xid, + dfc->dfccache[slot + i].lsn.xlogid, + dfc->dfccache[slot + i].lsn.xrecoff, + (XLByteLE(dfc->dfccache[slot + i].lsn, dfc->flushed_lsn) ? "flushed" : "current")))); + } + } Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.267 diff -c -r1.267 xlog.c *** src/backend/access/transam/xlog.c 3 Apr 2007 16:34:35 -0000 1.267 --- src/backend/access/transam/xlog.c 5 Apr 2007 21:51:37 -0000 *************** *** 5393,5398 **** --- 5393,5409 ---- checkPoint.ThisTimeLineID = ThisTimeLineID; checkPoint.time = time(NULL); + /* + * Now confirm that all unguaranteed transactions are written to WAL + * before we proceed further. This may require WALWriteLock and possibly + * WALInsertLock if we need to flush. + */ + if (WALWriterActive()) + { + LWLockAcquire(DeferredFsyncLock, LW_EXCLUSIVE); + FlushAnyDeferredFsyncXacts(false, true); + } + /* * We must hold WALInsertLock while examining insert state to determine * the checkpoint REDO pointer. *************** *** 5428,5433 **** --- 5439,5446 ---- ControlFile->checkPointCopy.redo.xrecoff) { LWLockRelease(WALInsertLock); + if (WALWriterActive()) + LWLockRelease(DeferredFsyncLock); LWLockRelease(CheckpointLock); END_CRIT_SECTION(); return; *************** *** 5476,5481 **** --- 5489,5496 ---- * while we are flushing disk buffers. */ LWLockRelease(WALInsertLock); + if (WALWriterActive()) + LWLockRelease(DeferredFsyncLock); if (!shutdown) ereport(DEBUG2, Index: src/backend/commands/vacuum.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/commands/vacuum.c,v retrieving revision 1.349 diff -c -r1.349 vacuum.c *** src/backend/commands/vacuum.c 14 Mar 2007 18:48:55 -0000 1.349 --- src/backend/commands/vacuum.c 5 Apr 2007 21:51:41 -0000 *************** *** 1275,1280 **** --- 1275,1289 ---- */ vacpage = (VacPage) palloc(sizeof(VacPageData) + MaxOffsetNumber * sizeof(OffsetNumber)); + /* + * VACUUM FULL assumes that all tuple states are well-known prior to moving + * tuples around. see comment "known dead" in repair_frag(). So before + * we perform this initial scan of the heap we must ensure there are + * no unflushed deferred transactions with changes against this table. + */ + if (WALWriterActive()) + FlushAnyDeferredFsyncXacts(false, false); + for (blkno = 0; blkno < nblocks; blkno++) { Page page, Index: src/backend/postmaster/Makefile =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/Makefile,v retrieving revision 1.22 diff -c -r1.22 Makefile *** src/backend/postmaster/Makefile 20 Jan 2007 17:16:12 -0000 1.22 --- src/backend/postmaster/Makefile 5 Apr 2007 21:51:41 -0000 *************** *** 12,18 **** top_builddir = ../../.. include $(top_builddir)/src/Makefile.global ! OBJS = bgwriter.o autovacuum.o pgarch.o pgstat.o postmaster.o syslogger.o \ fork_process.o all: SUBSYS.o --- 12,18 ---- top_builddir = ../../.. include $(top_builddir)/src/Makefile.global ! OBJS = bgwriter.o walwriter.o autovacuum.o pgarch.o pgstat.o postmaster.o syslogger.o \ fork_process.o all: SUBSYS.o Index: src/backend/postmaster/postmaster.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/postmaster.c,v retrieving revision 1.527 diff -c -r1.527 postmaster.c *** src/backend/postmaster/postmaster.c 22 Mar 2007 19:53:30 -0000 1.527 --- src/backend/postmaster/postmaster.c 5 Apr 2007 21:51:45 -0000 *************** *** 107,112 **** --- 107,113 ---- #include "postmaster/pgarch.h" #include "postmaster/postmaster.h" #include "postmaster/syslogger.h" + #include "postmaster/walwriter.h" #include "storage/fd.h" #include "storage/ipc.h" #include "storage/pg_shmem.h" *************** *** 201,206 **** --- 202,208 ---- /* PIDs of special child processes; 0 when not running */ static pid_t StartupPID = 0, BgWriterPID = 0, + WALWriterPID = 0, AutoVacPID = 0, PgArchPID = 0, PgStatPID = 0; *************** *** 907,913 **** * CAUTION: when changing this list, check for side-effects on the signal * handling setup of child processes. See tcop/postgres.c, * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/autovacuum.c, ! * postmaster/pgarch.c, postmaster/pgstat.c, and postmaster/syslogger.c. */ pqinitmask(); PG_SETMASK(&BlockSig); --- 909,916 ---- * CAUTION: when changing this list, check for side-effects on the signal * handling setup of child processes. See tcop/postgres.c, * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/autovacuum.c, ! * postmaster/pgarch.c, postmaster/pgstat.c, postmaster/syslogger.c ! * and postmaster/walwriter.c */ pqinitmask(); PG_SETMASK(&BlockSig); *************** *** 1250,1255 **** --- 1253,1263 ---- start_autovac_launcher = false; /* signal successfully processed */ } + /* If we have lost the WAL writer, try to start a new one */ + if (WALWriterActive() && WALWriterPID == 0 && + StartupPID == 0 && !FatalError && Shutdown == NoShutdown) + WALWriterPID = StartWALWriter(); + /* If we have lost the archiver, try to start a new one */ if (XLogArchivingActive() && PgArchPID == 0 && StartupPID == 0 && !FatalError && Shutdown == NoShutdown) *************** *** 1822,1827 **** --- 1830,1837 ---- signal_child(BgWriterPID, SIGHUP); if (AutoVacPID != 0) signal_child(AutoVacPID, SIGHUP); + if (WALWriterPID != 0) + signal_child(WALWriterPID, SIGHUP); if (PgArchPID != 0) signal_child(PgArchPID, SIGHUP); if (SysLoggerPID != 0) *************** *** 1891,1896 **** --- 1901,1909 ---- /* And tell it to shut down */ if (BgWriterPID != 0) signal_child(BgWriterPID, SIGUSR2); + /* Tell WALWriter to shut down too; nothing left for it to do */ + if (WALWriterPID != 0) + signal_child(WALWriterPID, SIGQUIT); /* Tell pgarch to shut down too; nothing left for it to do */ if (PgArchPID != 0) signal_child(PgArchPID, SIGQUIT); *************** *** 1950,1955 **** --- 1963,1971 ---- /* And tell it to shut down */ if (BgWriterPID != 0) signal_child(BgWriterPID, SIGUSR2); + /* Tell WALWriter to shut down too; nothing left for it to do */ + if (WALWriterPID != 0) + signal_child(WALWriterPID, SIGQUIT); /* Tell pgarch to shut down too; nothing left for it to do */ if (PgArchPID != 0) signal_child(PgArchPID, SIGQUIT); *************** *** 1978,1983 **** --- 1994,2001 ---- signal_child(StartupPID, SIGQUIT); if (BgWriterPID != 0) signal_child(BgWriterPID, SIGQUIT); + if (WALWriterPID != 0) + signal_child(WALWriterPID, SIGQUIT); if (AutoVacPID != 0) signal_child(AutoVacPID, SIGQUIT); if (PgArchPID != 0) *************** *** 2079,2086 **** /* * Go to shutdown mode if a shutdown request was pending. ! * Otherwise, try to start the archiver, stats collector and ! * autovacuum launcher. */ if (Shutdown > NoShutdown && BgWriterPID != 0) signal_child(BgWriterPID, SIGUSR2); --- 2097,2104 ---- /* * Go to shutdown mode if a shutdown request was pending. ! * Otherwise, try to start the archiver, stats collector, ! * autovacuum launcher and WALWriter. */ if (Shutdown > NoShutdown && BgWriterPID != 0) signal_child(BgWriterPID, SIGUSR2); *************** *** 2090,2095 **** --- 2108,2115 ---- PgArchPID = pgarch_start(); if (PgStatPID == 0) PgStatPID = pgstat_start(); + if (WALWriterPID == 0) + WALWriterPID = StartWALWriter(); if (AutoVacuumingActive() && AutoVacPID == 0) AutoVacPID = StartAutoVacLauncher(); *************** *** 2150,2155 **** --- 2170,2189 ---- } /* + * Was it the WALWriter? Normal exit can be ignored; we'll + * start a new one at the next iteration of the postmaster's main loop, + * if necessary. Any other exit condition is treated as a crash. + */ + if (WALWriterPID != 0 && pid == WALWriterPID) + { + WALWriterPID = 0; + if (!EXIT_STATUS_0(exitstatus)) + HandleChildCrash(pid, exitstatus, + _("WALWriter process")); + continue; + } + + /* * Was it the autovacuum launcher? Normal exit can be ignored; we'll * start a new one at the next iteration of the postmaster's main loop, * if necessary. Any other exit condition is treated as a crash. *************** *** 2245,2250 **** --- 2279,2287 ---- /* And tell it to shut down */ if (BgWriterPID != 0) signal_child(BgWriterPID, SIGUSR2); + /* Tell WALWriter to shut down too; nothing left for it to do */ + if (WALWriterPID != 0) + signal_child(WALWriterPID, SIGQUIT); /* Tell pgarch to shut down too; nothing left for it to do */ if (PgArchPID != 0) signal_child(PgArchPID, SIGQUIT); *************** *** 2396,2401 **** --- 2433,2449 ---- signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT)); } + /* Force a power-cycle of the WALWriter process too */ + /* (Shouldn't be necessary, but just for luck) */ + if (WALWriterPID != 0 && !FatalError) + { + ereport(DEBUG2, + (errmsg_internal("sending %s to process %d", + "SIGQUIT", + (int) WALWriterPID))); + signal_child(WALWriterPID, SIGQUIT); + } + /* Force a power-cycle of the pgarch process too */ /* (Shouldn't be necessary, but just for luck) */ if (PgArchPID != 0 && !FatalError) *************** *** 3488,3493 **** --- 3536,3558 ---- AutoVacWorkerMain(argc - 2, argv + 2); proc_exit(0); } + if (strcmp(argv[1], "--forkwalwriter") == 0) + { + /* Close the postmaster's sockets */ + ClosePostmasterPorts(false); + + /* Restore basic shared memory pointers */ + InitShmemAccess(UsedShmemSegAddr); + + /* Need a PGPROC to run CreateSharedMemoryAndSemaphores */ + InitProcess(); + + /* Attach process to shared data structures */ + CreateSharedMemoryAndSemaphores(false, 0); + + WALWriterMain(argc, argv); + proc_exit(0); + } if (strcmp(argv[1], "--forkarch") == 0) { /* Close the postmaster's sockets */ *************** *** 3582,3587 **** --- 3647,3661 ---- signal_child(PgArchPID, SIGUSR1); } + if (CheckPostmasterSignal(PMSIGNAL_WAKEN_WALWRITER) && + WALWriterPID != 0 && Shutdown == NoShutdown) + { + /* + * Send SIGUSR1 to WALWriter process, to wake it up and begin fsyncing WAL + */ + signal_child(WALWriterPID, SIGUSR1); + } + if (CheckPostmasterSignal(PMSIGNAL_ROTATE_LOGFILE) && SysLoggerPID != 0) { Index: src/backend/storage/ipc/ipci.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/storage/ipc/ipci.c,v retrieving revision 1.91 diff -c -r1.91 ipci.c *** src/backend/storage/ipc/ipci.c 15 Feb 2007 23:23:23 -0000 1.91 --- src/backend/storage/ipc/ipci.c 5 Apr 2007 21:51:45 -0000 *************** *** 19,24 **** --- 19,25 ---- #include "access/nbtree.h" #include "access/subtrans.h" #include "access/twophase.h" + #include "access/xact.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/autovacuum.h" *************** *** 101,106 **** --- 102,108 ---- size = add_size(size, ProcGlobalShmemSize()); size = add_size(size, XLOGShmemSize()); size = add_size(size, CLOGShmemSize()); + size = add_size(size, DeferredFsyncShmemSize()); size = add_size(size, SUBTRANSShmemSize()); size = add_size(size, TwoPhaseShmemSize()); size = add_size(size, MultiXactShmemSize()); *************** *** 177,182 **** --- 179,185 ---- */ XLOGShmemInit(); CLOGShmemInit(); + DeferredFsyncShmemInit(); SUBTRANSShmemInit(); TwoPhaseShmemInit(); MultiXactShmemInit(); Index: src/backend/tcop/postgres.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/tcop/postgres.c,v retrieving revision 1.530 diff -c -r1.530 postgres.c *** src/backend/tcop/postgres.c 29 Mar 2007 19:10:10 -0000 1.530 --- src/backend/tcop/postgres.c 5 Apr 2007 21:51:49 -0000 *************** *** 2266,2271 **** --- 2266,2273 ---- ereport(DEBUG3, (errmsg_internal("CommitTransactionCommand"))); + SetXactCommitGuarantee(DefaultXactCommitGuarantee); + CommitTransactionCommand(); #ifdef MEMORY_CONTEXT_CHECKING Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.383 diff -c -r1.383 guc.c *** src/backend/utils/misc/guc.c 19 Mar 2007 23:38:30 -0000 1.383 --- src/backend/utils/misc/guc.c 5 Apr 2007 21:51:55 -0000 *************** *** 53,58 **** --- 53,59 ---- #include "postmaster/bgwriter.h" #include "postmaster/postmaster.h" #include "postmaster/syslogger.h" + #include "postmaster/walwriter.h" #include "storage/fd.h" #include "storage/freespace.h" #include "tcop/tcopprot.h" *************** *** 102,107 **** --- 103,111 ---- extern int CommitSiblings; extern char *default_tablespace; extern bool fullPageWrites; + extern bool trace_commit; + extern bool trace_bg_flush; + #ifdef TRACE_SORT extern bool trace_sort; *************** *** 149,154 **** --- 153,159 ---- static bool assign_stage_log_stats(bool newval, bool doit, GucSource source); static bool assign_log_stats(bool newval, bool doit, GucSource source); static bool assign_transaction_read_only(bool newval, bool doit, GucSource source); + static bool assign_transaction_guarantee(bool newval, bool doit, GucSource source); static const char *assign_canonical_path(const char *newval, bool doit, GucSource source); static const char *assign_backslash_quote(const char *newval, bool doit, GucSource source); static const char *assign_timezone_abbreviations(const char *newval, bool doit, GucSource source); *************** *** 317,322 **** --- 322,329 ---- gettext_noop("Write-Ahead Log"), /* WAL_SETTINGS */ gettext_noop("Write-Ahead Log / Settings"), + /* WAL_COMMITS */ + gettext_noop("Write-Ahead Log / Commit Behavior"), /* WAL_CHECKPOINTS */ gettext_noop("Write-Ahead Log / Checkpoints"), /* QUERY_TUNING */ *************** *** 573,578 **** --- 580,601 ---- false, NULL, NULL }, { + {"trace_commit", PGC_SIGHUP, DEVELOPER_OPTIONS, + gettext_noop("Shows details of commits, for use with transaction_guarantee."), + NULL + }, + &trace_commit, + false, NULL, NULL + }, + { + {"trace_bg_flush", PGC_SIGHUP, DEVELOPER_OPTIONS, + gettext_noop("Shows details of WAL Writer, for use with transaction_guarantee."), + NULL + }, + &trace_bg_flush, + true, NULL, NULL + }, + { {"log_connections", PGC_BACKEND, LOGGING_WHAT, gettext_noop("Logs each successful connection."), NULL *************** *** 883,888 **** --- 906,919 ---- true, assign_phony_autocommit, NULL }, { + {"transaction_guarantee", PGC_USERSET, WAL_COMMITS, + gettext_noop("Sets the default of wait-for-commit."), + NULL + }, + &DefaultXactCommitGuarantee, + true, assign_transaction_guarantee, NULL + }, + { {"default_transaction_read_only", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the default read-only status of new transactions."), NULL *************** *** 1165,1171 **** NULL }, &ReservedBackends, ! 3, 0, INT_MAX / 4, NULL, NULL }, { --- 1196,1202 ---- NULL }, &ReservedBackends, ! 5, 0, INT_MAX / 4, NULL, NULL }, { *************** *** 1457,1463 **** }, { ! {"commit_delay", PGC_USERSET, WAL_CHECKPOINTS, gettext_noop("Sets the delay in microseconds between transaction commit and " "flushing WAL to disk."), NULL --- 1488,1494 ---- }, { ! {"commit_delay", PGC_USERSET, WAL_COMMITS, gettext_noop("Sets the delay in microseconds between transaction commit and " "flushing WAL to disk."), NULL *************** *** 1467,1473 **** }, { ! {"commit_siblings", PGC_USERSET, WAL_CHECKPOINTS, gettext_noop("Sets the minimum concurrent open transactions before performing " "commit_delay."), NULL --- 1498,1504 ---- }, { ! {"commit_siblings", PGC_USERSET, WAL_COMMITS, gettext_noop("Sets the minimum concurrent open transactions before performing " "commit_delay."), NULL *************** *** 1477,1482 **** --- 1508,1523 ---- }, { + {"wal_writer_delay", PGC_SIGHUP, WAL_COMMITS, + gettext_noop("Sets the delay in milliseconds between regular flushing of WAL " + "to disk by the WALWriter."), + NULL, + GUC_UNIT_MS + }, + &WALWriterDelay, + 0, 0, 1000, NULL, NULL + }, + { {"extra_float_digits", PGC_USERSET, CLIENT_CONN_LOCALE, gettext_noop("Sets the number of digits displayed for floating-point values."), gettext_noop("This affects real, double precision, and geometric data types. " *************** *** 6472,6477 **** --- 6513,6537 ---- return true; } + static bool + assign_transaction_guarantee(bool newval, bool doit, GucSource source) + { + /* + * Transaction guarantee can only be disabled if the + * WALWriter has been activated. This is important since it allows + * us to place a sensible time limit on the extent of the data loss + * window for deferred fsync transactions. + */ + if (newval == false && !WALWriterActive()) + { + if (source >= PGC_S_INTERACTIVE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("cannot set transaction guarantee when server wal_writer_delay = 0"))); + } + return true; + } + static const char * assign_canonical_path(const char *newval, bool doit, GucSource source) { Index: src/backend/utils/misc/postgresql.conf.sample =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/postgresql.conf.sample,v retrieving revision 1.213 diff -c -r1.213 postgresql.conf.sample *** src/backend/utils/misc/postgresql.conf.sample 19 Mar 2007 23:38:30 -0000 1.213 --- src/backend/utils/misc/postgresql.conf.sample 5 Apr 2007 21:51:56 -0000 *************** *** 150,156 **** # - Settings - ! #fsync = on # turns forced synchronization on or off #wal_sync_method = fsync # the default is the first option # supported by the operating system: # open_datasync --- 150,156 ---- # - Settings - ! #wal_buffers = 64kB # min 32kB #wal_sync_method = fsync # the default is the first option # supported by the operating system: # open_datasync *************** *** 159,169 **** # fsync_writethrough # open_sync #full_page_writes = on # recover from partial page writes - #wal_buffers = 64kB # min 32kB # (change requires restart) ! #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # - Checkpoints - #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each --- 159,173 ---- # fsync_writethrough # open_sync #full_page_writes = on # recover from partial page writes # (change requires restart) ! ! #wal_writer_delay = 0 # range 0-1000, in milliseconds ! #transaction_guarantee = on # default: immediate fsync at commit ! ! #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 + # - Checkpoints - #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each Index: src/backend/utils/time/tqual.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/utils/time/tqual.c,v retrieving revision 1.102 diff -c -r1.102 tqual.c *** src/backend/utils/time/tqual.c 25 Mar 2007 19:45:14 -0000 1.102 --- src/backend/utils/time/tqual.c 5 Apr 2007 21:51:57 -0000 *************** *** 78,83 **** --- 78,85 ---- /* local functions */ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); + static void HeapTupleSetVisibilityInfo(HeapTupleHeader tuple, + Buffer buffer, SetTupleVisibilityAction action, uint16 infomask); /* *************** *** 122,133 **** { if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } } else if (tuple->t_infomask & HEAP_MOVED_IN) --- 124,133 ---- { if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); } } else if (tuple->t_infomask & HEAP_MOVED_IN) *************** *** 139,152 **** if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 139,148 ---- if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); else { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } } *************** *** 164,171 **** /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } --- 160,166 ---- /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX_SUBTRANS, HEAP_XMAX_INVALID); return true; } *************** *** 176,190 **** else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 171,181 ---- else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_COMMITTED); else { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_INVALID); return false; } } *************** *** 221,228 **** if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } --- 212,218 ---- if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return true; } *************** *** 230,242 **** if (tuple->t_infomask & HEAP_IS_LOCKED) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } ! tuple->t_infomask |= HEAP_XMAX_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); return false; } --- 220,230 ---- if (tuple->t_infomask & HEAP_IS_LOCKED) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return true; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_COMMITTED); return false; } *************** *** 299,310 **** { if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } } else if (tuple->t_infomask & HEAP_MOVED_IN) --- 287,296 ---- { if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); } } else if (tuple->t_infomask & HEAP_MOVED_IN) *************** *** 316,329 **** if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 302,311 ---- if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); else { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } } *************** *** 344,351 **** /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } --- 326,332 ---- /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX_SUBTRANS, HEAP_XMAX_INVALID); return true; } *************** *** 359,373 **** else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 340,350 ---- else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_COMMITTED); else { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_INVALID); return false; } } *************** *** 407,414 **** if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } --- 384,390 ---- if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return true; } *************** *** 416,428 **** if (tuple->t_infomask & HEAP_IS_LOCKED) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } ! tuple->t_infomask |= HEAP_XMAX_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); return false; } --- 392,402 ---- if (tuple->t_infomask & HEAP_IS_LOCKED) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return true; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_COMMITTED); return false; } *************** *** 469,480 **** { if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } } else if (tuple->t_infomask & HEAP_MOVED_IN) --- 443,452 ---- { if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); } } else if (tuple->t_infomask & HEAP_MOVED_IN) *************** *** 486,499 **** if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 458,467 ---- if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); else { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } } *************** *** 550,561 **** { if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleInvisible; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } } else if (tuple->t_infomask & HEAP_MOVED_IN) --- 518,527 ---- { if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return HeapTupleInvisible; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); } } else if (tuple->t_infomask & HEAP_MOVED_IN) *************** *** 567,580 **** if (TransactionIdIsInProgress(xvac)) return HeapTupleInvisible; if (TransactionIdDidCommit(xvac)) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleInvisible; } } --- 533,542 ---- if (TransactionIdIsInProgress(xvac)) return HeapTupleInvisible; if (TransactionIdDidCommit(xvac)) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); else { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return HeapTupleInvisible; } } *************** *** 595,602 **** /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleMayBeUpdated; } --- 557,563 ---- /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX_SUBTRANS, HEAP_XMAX_INVALID); return HeapTupleMayBeUpdated; } *************** *** 610,624 **** else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return HeapTupleInvisible; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleInvisible; } } --- 571,581 ---- else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return HeapTupleInvisible; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_COMMITTED); else { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_INVALID); return HeapTupleInvisible; } } *************** *** 642,649 **** if (MultiXactIdIsRunning(HeapTupleHeaderGetXmax(tuple))) return HeapTupleBeingUpdated; ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleMayBeUpdated; } --- 599,605 ---- if (MultiXactIdIsRunning(HeapTupleHeaderGetXmax(tuple))) return HeapTupleBeingUpdated; ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMULTI, HEAP_XMAX_INVALID); return HeapTupleMayBeUpdated; } *************** *** 663,670 **** if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleMayBeUpdated; } --- 619,625 ---- if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return HeapTupleMayBeUpdated; } *************** *** 672,684 **** if (tuple->t_infomask & HEAP_IS_LOCKED) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleMayBeUpdated; } ! tuple->t_infomask |= HEAP_XMAX_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); return HeapTupleUpdated; /* updated by other */ } --- 627,637 ---- if (tuple->t_infomask & HEAP_IS_LOCKED) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return HeapTupleMayBeUpdated; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_COMMITTED); return HeapTupleUpdated; /* updated by other */ } *************** *** 723,734 **** { if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } } else if (tuple->t_infomask & HEAP_MOVED_IN) --- 676,685 ---- { if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); } } else if (tuple->t_infomask & HEAP_MOVED_IN) *************** *** 740,753 **** if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 691,700 ---- if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); else { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } } *************** *** 765,772 **** /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } --- 712,718 ---- /* deleting subtransaction aborted? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX_SUBTRANS, HEAP_XMAX_INVALID); return true; } *************** *** 781,795 **** return true; /* in insertion by other */ } else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 727,737 ---- return true; /* in insertion by other */ } else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_COMMITTED); else { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_INVALID); return false; } } *************** *** 829,836 **** if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } --- 771,777 ---- if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return true; } *************** *** 838,850 **** if (tuple->t_infomask & HEAP_IS_LOCKED) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } ! tuple->t_infomask |= HEAP_XMAX_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); return false; /* updated by other */ } --- 779,789 ---- if (tuple->t_infomask & HEAP_IS_LOCKED) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return true; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_COMMITTED); return false; /* updated by other */ } *************** *** 888,899 **** { if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } } else if (tuple->t_infomask & HEAP_MOVED_IN) --- 827,836 ---- { if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); } } else if (tuple->t_infomask & HEAP_MOVED_IN) *************** *** 905,918 **** if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 842,851 ---- if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); else { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return false; } } *************** *** 934,941 **** /* FIXME -- is this correct w.r.t. the cmax of the tuple? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } --- 867,873 ---- /* FIXME -- is this correct w.r.t. the cmax of the tuple? */ if (TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple))) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX_SUBTRANS, HEAP_XMAX_INVALID); return true; } *************** *** 949,963 **** else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return false; } } --- 881,891 ---- else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_COMMITTED); else { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_INVALID); return false; } } *************** *** 998,1011 **** if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return true; } /* xmax transaction committed */ ! tuple->t_infomask |= HEAP_XMAX_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } /* --- 926,937 ---- if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) { /* it must have aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return true; } /* xmax transaction committed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_COMMITTED); } /* *************** *** 1054,1065 **** return HEAPTUPLE_DELETE_IN_PROGRESS; if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HEAPTUPLE_DEAD; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); } else if (tuple->t_infomask & HEAP_MOVED_IN) { --- 980,989 ---- return HEAPTUPLE_DELETE_IN_PROGRESS; if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return HEAPTUPLE_DEAD; } ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); } else if (tuple->t_infomask & HEAP_MOVED_IN) { *************** *** 1071,1083 **** return HEAPTUPLE_INSERT_IN_PROGRESS; if (TransactionIdDidCommit(xvac)) { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HEAPTUPLE_DEAD; } } --- 995,1005 ---- return HEAPTUPLE_INSERT_IN_PROGRESS; if (TransactionIdDidCommit(xvac)) { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_COMMITTED); ! } else { ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XVAC, HEAP_XMIN_INVALID); return HEAPTUPLE_DEAD; } } *************** *** 1091,1107 **** return HEAPTUPLE_DELETE_IN_PROGRESS; } else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! { ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { /* * Not in Progress, Not Committed, so either Aborted or crashed */ ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HEAPTUPLE_DEAD; } /* Should only get here if we set XMIN_COMMITTED */ --- 1013,1025 ---- return HEAPTUPLE_DELETE_IN_PROGRESS; } else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_COMMITTED); else { /* * Not in Progress, Not Committed, so either Aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMIN, HEAP_XMIN_INVALID); return HEAPTUPLE_DEAD; } /* Should only get here if we set XMIN_COMMITTED */ *************** *** 1143,1150 **** * We know that xmax did lock the tuple, but it did not and will * never actually update it. */ ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); } return HEAPTUPLE_LIVE; } --- 1061,1067 ---- * We know that xmax did lock the tuple, but it did not and will * never actually update it. */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMULTI, HEAP_XMAX_INVALID); } return HEAPTUPLE_LIVE; } *************** *** 1161,1177 **** if (TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple))) return HEAPTUPLE_DELETE_IN_PROGRESS; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) ! { ! tuple->t_infomask |= HEAP_XMAX_COMMITTED; ! SetBufferCommitInfoNeedsSave(buffer); ! } else { /* * Not in Progress, Not Committed, so either Aborted or crashed */ ! tuple->t_infomask |= HEAP_XMAX_INVALID; ! SetBufferCommitInfoNeedsSave(buffer); return HEAPTUPLE_LIVE; } /* Should only get here if we set XMAX_COMMITTED */ --- 1078,1090 ---- if (TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple))) return HEAPTUPLE_DELETE_IN_PROGRESS; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple))) ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_COMMITTED); else { /* * Not in Progress, Not Committed, so either Aborted or crashed */ ! HeapTupleSetVisibilityInfo(tuple, buffer, TUPLE_XMAX, HEAP_XMAX_INVALID); return HEAPTUPLE_LIVE; } /* Should only get here if we set XMAX_COMMITTED */ *************** *** 1205,1210 **** --- 1118,1185 ---- return HEAPTUPLE_DEAD; } + /* + * HeapTupleSetVisibilityInfo() + * + * Set the visibility info on a tuple, if allowable at this point in + * time, do so. + * + * We're able to set this info when we are looking at one of our own + * transaction's aborted subtransactions, or when we are examining + * the xvac field, since a VACUUM FULL is always a guaranteed transaction. + * + * Otherwise we can only set visibility information for a tuple when + * the transaction commit has been flushed, which may not yet be the + * case for unguaranteed transactions - so we check. Note that if we + * do have to check then we have already confirmed that the + * TransactionId is not in progress (see comments in this file header) + * No need to check Aborts, since those are never deferred. + * + * For Multitransactions we won't be able to mark them until all + * transactions that were part of them have been flushed. + */ + static void + HeapTupleSetVisibilityInfo(HeapTupleHeader tuple, + Buffer buffer, SetTupleVisibilityAction action, uint16 infomask) + { + if (WALWriterActive()) + { + switch (action) + { + case TUPLE_XMIN: + if (infomask == HEAP_XMIN_COMMITTED && + !TransactionIdIsFlushed(HeapTupleHeaderGetXmin(tuple))) + return; + break; + + case TUPLE_XMAX: + if (infomask == HEAP_XMAX_COMMITTED && + !TransactionIdIsFlushed(HeapTupleHeaderGetXmax(tuple))) + return; + break; + + /* Multitransactions are always xmax */ + case TUPLE_XMULTI: + if (!MultiXactIdIsFlushed(HeapTupleHeaderGetXmax(tuple))) + return; + break; + + case TUPLE_XVAC: + case TUPLE_XMAX_SUBTRANS: + break; + + default: + elog(ERROR, "invalid action for HeapTupleSetVisibilityInfo"); + break; + } + } + + /* + * We're allowed to set the info bits and mark the buffer dirty + */ + tuple->t_infomask |= infomask; + SetBufferCommitInfoNeedsSave(buffer); + } /* * GetTransactionSnapshot Index: src/include/access/clog.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/access/clog.h,v retrieving revision 1.19 diff -c -r1.19 clog.h *** src/include/access/clog.h 5 Jan 2007 22:19:50 -0000 1.19 --- src/include/access/clog.h 5 Apr 2007 21:51:57 -0000 *************** *** 32,38 **** #define NUM_CLOG_BUFFERS 8 ! extern void TransactionIdSetStatus(TransactionId xid, XidStatus status); extern XidStatus TransactionIdGetStatus(TransactionId xid); extern Size CLOGShmemSize(void); --- 32,38 ---- #define NUM_CLOG_BUFFERS 8 ! extern void TransactionIdSetStatus(TransactionId xid, XidStatus status, XLogRecPtr lsn); extern XidStatus TransactionIdGetStatus(TransactionId xid); extern Size CLOGShmemSize(void); Index: src/include/access/multixact.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/access/multixact.h,v retrieving revision 1.12 diff -c -r1.12 multixact.h *** src/include/access/multixact.h 5 Jan 2007 22:19:51 -0000 1.12 --- src/include/access/multixact.h 5 Apr 2007 21:51:58 -0000 *************** *** 45,50 **** --- 45,51 ---- extern MultiXactId MultiXactIdCreate(TransactionId xid1, TransactionId xid2); extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid); extern bool MultiXactIdIsRunning(MultiXactId multi); + extern bool MultiXactIdIsFlushed(MultiXactId multi); extern bool MultiXactIdIsCurrent(MultiXactId multi); extern void MultiXactIdWait(MultiXactId multi); extern bool ConditionalMultiXactIdWait(MultiXactId multi); Index: src/include/access/slru.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/access/slru.h,v retrieving revision 1.20 diff -c -r1.20 slru.h *** src/include/access/slru.h 5 Jan 2007 22:19:51 -0000 1.20 --- src/include/access/slru.h 5 Apr 2007 21:51:58 -0000 *************** *** 14,19 **** --- 14,20 ---- #define SLRU_H #include "storage/lwlock.h" + #include "access/xlogdefs.h" /* *************** *** 47,52 **** --- 48,54 ---- char **page_buffer; SlruPageStatus *page_status; bool *page_dirty; + XLogRecPtr *page_lsn; /* only set if do_wal_flush is true */ int *page_number; int *page_lru_count; LWLockId *buffer_locks; *************** *** 74,92 **** /* * SlruCtlData is an unshared structure that points to the active information ! * in shared memory. */ typedef struct SlruCtlData { SlruShared shared; /* ! * This flag tells whether to fsync writes (true for pg_clog, false for ! * pg_subtrans). */ bool do_fsync; /* * Decide which of two page numbers is "older" for truncation purposes. We * need to use comparison of TransactionIds here in order to do the right * thing with wraparound XID arithmetic. --- 76,101 ---- /* * SlruCtlData is an unshared structure that points to the active information ! * in shared memory. Just so its clear: this information is accessible even ! * when you do not hold the Control lock for the slru */ typedef struct SlruCtlData { SlruShared shared; /* ! * This flag tells whether to fsync writes ! * (true for pg_clog and multitrans, false for pg_subtrans). */ bool do_fsync; /* + * This flag tells whether to flush WAL before writing pages + * (true for pg_clog, false for multitrans and pg_subtrans). + */ + bool do_wal_flush; + + /* * Decide which of two page numbers is "older" for truncation purposes. We * need to use comparison of TransactionIds here in order to do the right * thing with wraparound XID arithmetic. Index: src/include/access/transam.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/access/transam.h,v retrieving revision 1.60 diff -c -r1.60 transam.h *** src/include/access/transam.h 5 Jan 2007 22:19:51 -0000 1.60 --- src/include/access/transam.h 5 Apr 2007 21:51:58 -0000 *************** *** 14,19 **** --- 14,20 ---- #ifndef TRANSAM_H #define TRANSAM_H + #include "access/xlogdefs.h" /* ---------------- * Special transaction ID values *************** *** 114,124 **** */ extern bool TransactionIdDidCommit(TransactionId transactionId); extern bool TransactionIdDidAbort(TransactionId transactionId); ! extern void TransactionIdCommit(TransactionId transactionId); ! extern void TransactionIdAbort(TransactionId transactionId); extern void TransactionIdSubCommit(TransactionId transactionId); ! extern void TransactionIdCommitTree(int nxids, TransactionId *xids); ! extern void TransactionIdAbortTree(int nxids, TransactionId *xids); extern bool TransactionIdPrecedes(TransactionId id1, TransactionId id2); extern bool TransactionIdPrecedesOrEquals(TransactionId id1, TransactionId id2); extern bool TransactionIdFollows(TransactionId id1, TransactionId id2); --- 115,125 ---- */ extern bool TransactionIdDidCommit(TransactionId transactionId); extern bool TransactionIdDidAbort(TransactionId transactionId); ! extern void TransactionIdCommit(TransactionId transactionId, XLogRecPtr lsn); ! extern void TransactionIdAbort(TransactionId transactionId, XLogRecPtr lsn); extern void TransactionIdSubCommit(TransactionId transactionId); ! extern void TransactionIdCommitTree(int nxids, TransactionId *xids, XLogRecPtr lsn); ! extern void TransactionIdAbortTree(int nxids, TransactionId *xids, XLogRecPtr lsn); extern bool TransactionIdPrecedes(TransactionId id1, TransactionId id2); extern bool TransactionIdPrecedesOrEquals(TransactionId id1, TransactionId id2); extern bool TransactionIdFollows(TransactionId id1, TransactionId id2); Index: src/include/access/xact.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/access/xact.h,v retrieving revision 1.85 diff -c -r1.85 xact.h *** src/include/access/xact.h 13 Mar 2007 00:33:42 -0000 1.85 --- src/include/access/xact.h 5 Apr 2007 21:51:58 -0000 *************** *** 16,21 **** --- 16,22 ---- #include "access/xlog.h" #include "nodes/pg_list.h" + #include "postmaster/walwriter.h" #include "storage/relfilenode.h" #include "utils/timestamp.h" *************** *** 41,46 **** --- 42,50 ---- extern bool DefaultXactReadOnly; extern bool XactReadOnly; + /* Deferred Fsync */ + extern bool DefaultXactCommitGuarantee; + extern void SetXactCommitGuarantee(bool RequestedXactCommitGuarantee); /* * start- and end-of-transaction callbacks for dynamically loaded modules */ *************** *** 145,150 **** --- 149,155 ---- extern void SetCurrentStatementStartTimestamp(void); extern int GetCurrentTransactionNestLevel(void); extern bool TransactionIdIsCurrentTransactionId(TransactionId xid); + extern bool TransactionIdIsFlushed(TransactionId xid); extern void CommandCounterIncrement(void); extern void StartTransactionCommand(void); extern void CommitTransactionCommand(void); *************** *** 179,182 **** --- 184,192 ---- extern void xact_redo(XLogRecPtr lsn, XLogRecord *record); extern void xact_desc(StringInfo buf, uint8 xl_info, char *rec); + extern void DeferredFsyncShmemInit(void); + extern Size DeferredFsyncShmemSize(void); + extern void FlushAnyDeferredFsyncXacts(bool loop_if_busy, bool have_lock); + extern bool TransactionIdIsFlushed(TransactionId xid); + #endif /* XACT_H */ Index: src/include/access/xlogdefs.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/access/xlogdefs.h,v retrieving revision 1.17 diff -c -r1.17 xlogdefs.h *** src/include/access/xlogdefs.h 14 Feb 2007 05:00:40 -0000 1.17 --- src/include/access/xlogdefs.h 5 Apr 2007 21:51:58 -0000 *************** *** 33,38 **** --- 33,40 ---- uint32 xrecoff; /* byte offset of location in log file */ } XLogRecPtr; + #define XLogRecPtrIsInvalid(p) \ + (((p).xlogid == 0 && (p).xrecoff == 0) ? true : false) /* * Macros for comparing XLogRecPtrs Index: src/include/storage/lwlock.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/storage/lwlock.h,v retrieving revision 1.35 diff -c -r1.35 lwlock.h *** src/include/storage/lwlock.h 3 Apr 2007 16:34:36 -0000 1.35 --- src/include/storage/lwlock.h 5 Apr 2007 21:51:59 -0000 *************** *** 61,66 **** --- 61,67 ---- BtreeVacuumLock, AddinShmemInitLock, AutovacuumLock, + DeferredFsyncLock, /* Individual lock IDs end here */ FirstBufMappingLock, FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS, Index: src/include/storage/pmsignal.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/storage/pmsignal.h,v retrieving revision 1.17 diff -c -r1.17 pmsignal.h *** src/include/storage/pmsignal.h 15 Feb 2007 23:23:23 -0000 1.17 --- src/include/storage/pmsignal.h 5 Apr 2007 21:51:59 -0000 *************** *** 25,30 **** --- 25,31 ---- PMSIGNAL_PASSWORD_CHANGE, /* pg_auth file has changed */ PMSIGNAL_WAKEN_CHILDREN, /* send a SIGUSR1 signal to all backends */ PMSIGNAL_WAKEN_ARCHIVER, /* send a NOTIFY signal to xlog archiver */ + PMSIGNAL_WAKEN_WALWRITER, /* send a NOTIFY signal to WAL Writer */ PMSIGNAL_ROTATE_LOGFILE, /* send SIGUSR1 to syslogger to rotate logfile */ PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */ PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */ Index: src/include/utils/guc_tables.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/utils/guc_tables.h,v retrieving revision 1.32 diff -c -r1.32 guc_tables.h *** src/include/utils/guc_tables.h 13 Mar 2007 14:32:25 -0000 1.32 --- src/include/utils/guc_tables.h 5 Apr 2007 21:51:59 -0000 *************** *** 51,56 **** --- 51,57 ---- RESOURCES_KERNEL, WAL, WAL_SETTINGS, + WAL_COMMITS, WAL_CHECKPOINTS, QUERY_TUNING, QUERY_TUNING_METHOD, Index: src/include/utils/tqual.h =================================================================== RCS file: /projects/cvsroot/pgsql/src/include/utils/tqual.h,v retrieving revision 1.66 diff -c -r1.66 tqual.h *** src/include/utils/tqual.h 25 Mar 2007 19:45:14 -0000 1.66 --- src/include/utils/tqual.h 5 Apr 2007 21:52:00 -0000 *************** *** 125,130 **** --- 125,140 ---- HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */ } HTSV_Result; + /* Action codes for HeapTupleSetVisibilityInfo */ + typedef enum + { + TUPLE_XMIN, /* check the tuple's xmin as a TransactionId */ + TUPLE_XMAX, /* check the tuple's xmax as a TransactionId */ + TUPLE_XMULTI, /* check the tuple's xmax as a MultitransactionId */ + TUPLE_XVAC, /* looking at xvac */ + TUPLE_XMAX_SUBTRANS /* looking at xmax of an aborted subtrans */ + } SetTupleVisibilityAction; + /* These are the "satisfies" test routines for the various snapshot types */ extern bool HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer);