Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.191
diff -c -r1.191 config.sgml
*** doc/src/sgml/config.sgml 30 Sep 2008 10:52:09 -0000 1.191
--- doc/src/sgml/config.sgml 27 Oct 2008 18:32:03 -0000
***************
*** 5284,5289 ****
--- 5284,5315 ----
+
+ trace_recovery_messages (string)
+
+ trace_recovery_messages> configuration parameter
+
+
+
+ Controls which message levels are written to the server log
+ for system modules needed for recovery processing. This allows
+ the user to override the normal setting of log_min_messages,
+ but only for specific messages. This is intended for use in
+ debugging Hot Standby.
+ Valid values are DEBUG5>, DEBUG4>,
+ DEBUG3>, DEBUG2>, DEBUG1>,
+ INFO>, NOTICE>, WARNING>,
+ ERROR>, LOG>, FATAL>, and
+ PANIC>. Each level includes all the levels that
+ follow it. The later the level, the fewer messages are sent
+ to the log. The default is WARNING>. Note that
+ LOG> has a different rank here than in
+ client_min_messages>.
+ Parameter should be set in the postgresql.conf only.
+
+
+
+
zero_damaged_pages (boolean)
Index: doc/src/sgml/func.sgml
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/func.sgml,v
retrieving revision 1.451
diff -c -r1.451 func.sgml
*** doc/src/sgml/func.sgml 27 Oct 2008 09:37:46 -0000 1.451
--- doc/src/sgml/func.sgml 27 Oct 2008 18:32:03 -0000
***************
*** 12419,12424 ****
--- 12419,12615 ----
.
+
+ pg_is_in_recovery
+
+
+ pg_last_completed_xact_timestamp
+
+
+ pg_last_completed_xid
+
+
+ pg_recovery_pause
+
+
+ pg_recovery_continue
+
+
+ pg_recovery_pause_cleanup
+
+
+ pg_recovery_pause_xid
+
+
+ pg_recovery_pause_time
+
+
+ pg_recovery_stop
+
+
+
+ The functions shown in assist in archive recovery.
+ Except for the first three functions, these are restricted to superusers.
+ All of these functions can only be executed during recovery.
+
+
+
+ Recovery Control Functions
+
+
+ Name Return Type Description
+
+
+
+
+
+
+ pg_is_in_recovery()
+
+ bool
+ True if recovery is still in progress.
+
+
+
+ pg_last_completed_xact_timestamp()
+
+ timestamp with time zone
+ Returns the original completion timestamp with timezone of the
+ last completed transaction in the current recovery.
+
+
+
+
+ pg_last_completed_xid()
+
+ integer
+ Returns the transaction id (32-bit) of last completed transaction
+ in the current recovery. Later numbered transaction ids may already have
+ completed. This is unrelated to transactions on the source server.
+
+
+
+
+
+ pg_recovery_pause()
+
+ void
+ Pause recovery processing, unconditionally.
+
+
+
+ pg_recovery_continue()
+
+ void
+ If recovery is paused, continue processing.
+
+
+
+ pg_recovery_stop()
+
+ void
+ End recovery and begin normal processing.
+
+
+
+ pg_recovery_pause_xid()
+
+ void
+ Continue recovery until specified xid completes, if it is ever
+ seen, then pause recovery.
+
+
+
+
+ pg_recovery_pause_time()
+
+ void
+ Continue recovery until a transaction with specified timestamp
+ completes, if one is ever seen, then pause recovery.
+
+
+
+
+ pg_recovery_pause_cleanup()
+
+ void
+ Continue recovery until the next cleanup record, then pause.
+
+
+
+ pg_recovery_pause_advance()
+
+ void
+ Advance recovery specified number of records then pause.
+
+
+
+
+
+
+ pg_recovery_pause> and pg_recovery_continue> allow
+ a superuser to control the progress of recovery on the database server.
+ While recovery is paused queries can then be executed to determine how far
+ forwards recovery should progress. Recovery can never go backwards
+ because previous values are overwritten. If the superuser wishes recovery
+ to complete and normal processing mode to start, execute
+ pg_recovery_stop>.
+
+
+
+ Variations of the pause function exist, mainly to allow PITR to dynamically
+ control where it should progress to. pg_recovery_pause_xid> and
+ pg_recovery_pause_time> allow the specification of a trial
+ recovery target, similarly to .
+ Recovery will then progress to the specified point and then pause, rather
+ than stopping permanently, allowing assessment of whether this is the
+ desired stopping point for recovery.
+
+
+
+ pg_recovery_pause_cleanup> allows recovery to progress only
+ as far as the next cleanup record. This is useful where a longer running
+ query needs to access the database in a consistent state and it is
+ more important that the query executes than it is that we keep processing
+ new WAL records. This can be used as shown:
+
+ select pg_recovery_pause_cleanup();
+
+ -- run very important query
+ select
+ from big_table1 join big_table2
+ on ...
+ where ...
+
+ select pg_recovery_continue;
+
+
+
+
+ pg_recovery_advance> allows recovery to progress record by
+ record, for very careful analysis or debugging. Step size can be 1 or
+ more records. If recovery is not yet paused then pg_recovery_advance>
+ will process the specified number of records then pause. If recovery
+ is already paused, recovery will continue for another N records before
+ pausing again.
+
+
+
+ If you pause recovery while the server is waiting for a WAL file when
+ operating in standby mode it will have apparently no effect until the
+ file arrives. Once the server begins processing WAL records again it
+ will notice the pause request and will act upon it. This is not a bug.
+ pause.
+
+
+
+ Pausing recovery will also prevent restartpoints from starting since they
+ are triggered by events in the WAL stream. In all other ways processing
+ will continue, for example the background writer will continue to clean
+ shared_buffers while paused.
+
+
The functions shown in calculate
the actual disk space usage of database objects.
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.265
diff -c -r1.265 heapam.c
*** src/backend/access/heap/heapam.c 8 Oct 2008 01:14:44 -0000 1.265
--- src/backend/access/heap/heapam.c 27 Oct 2008 18:32:03 -0000
***************
*** 4033,4039 ****
if (record->xl_info & XLR_BKP_BLOCK_1)
return;
! buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
if (!BufferIsValid(buffer))
return;
page = (Page) BufferGetPage(buffer);
--- 4033,4039 ----
if (record->xl_info & XLR_BKP_BLOCK_1)
return;
! buffer = XLogReadBufferForCleanup(xlrec->node, xlrec->block, false);
if (!BufferIsValid(buffer))
return;
page = (Page) BufferGetPage(buffer);
***************
*** 4082,4088 ****
if (record->xl_info & XLR_BKP_BLOCK_1)
return;
! buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
if (!BufferIsValid(buffer))
return;
page = (Page) BufferGetPage(buffer);
--- 4082,4088 ----
if (record->xl_info & XLR_BKP_BLOCK_1)
return;
! buffer = XLogReadBufferForCleanup(xlrec->node, xlrec->block, false);
if (!BufferIsValid(buffer))
return;
page = (Page) BufferGetPage(buffer);
Index: src/backend/access/heap/pruneheap.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/heap/pruneheap.c,v
retrieving revision 1.16
diff -c -r1.16 pruneheap.c
*** src/backend/access/heap/pruneheap.c 13 Jul 2008 20:45:47 -0000 1.16
--- src/backend/access/heap/pruneheap.c 27 Oct 2008 18:32:03 -0000
***************
*** 85,90 ****
--- 85,98 ----
return;
/*
+ * We can't write WAL in recovery mode, so there's no point trying to
+ * clean the page. The master will likely issue a cleaning WAL record
+ * soon anyway, so this is no particular loss.
+ */
+ if (IsRecoveryProcessingMode())
+ return;
+
+ /*
* We prune when a previous UPDATE failed to find enough space on the page
* for a new tuple version, or when free space falls below the relation's
* fill-factor target (but not less than 10%).
Index: src/backend/access/transam/clog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.48
diff -c -r1.48 clog.c
*** src/backend/access/transam/clog.c 20 Oct 2008 19:18:18 -0000 1.48
--- src/backend/access/transam/clog.c 27 Oct 2008 18:32:03 -0000
***************
*** 459,464 ****
--- 459,467 ----
/*
* This must be called ONCE during postmaster or standalone-backend startup,
* after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ *
+ * We access just a single clog page, so this action is atomic and safe
+ * for use if other processes are active during recovery.
*/
void
StartupCLOG(void)
Index: src/backend/access/transam/multixact.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/multixact.c,v
retrieving revision 1.28
diff -c -r1.28 multixact.c
*** src/backend/access/transam/multixact.c 1 Aug 2008 13:16:08 -0000 1.28
--- src/backend/access/transam/multixact.c 27 Oct 2008 18:32:03 -0000
***************
*** 1413,1420 ****
* MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact. Note that we
* may already have replayed WAL data into the SLRU files.
*
! * We don't need any locks here, really; the SLRU locks are taken
! * only because slru.c expects to be called with locks held.
*/
void
StartupMultiXact(void)
--- 1413,1423 ----
* MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact. Note that we
* may already have replayed WAL data into the SLRU files.
*
! * We want this operation to be atomic to ensure that other processes can
! * use MultiXact while we complete recovery. We access one page only from the
! * offset and members buffers, so once locks are acquired they will not be
! * dropped and re-acquired by SLRU code. So we take both locks at start, then
! * hold them all the way to the end.
*/
void
StartupMultiXact(void)
***************
*** 1426,1431 ****
--- 1429,1435 ----
/* Clean up offsets state */
LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
/*
* Initialize our idea of the latest page number.
***************
*** 1452,1461 ****
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(MultiXactOffsetControlLock);
-
/* And the same for members */
- LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
/*
* Initialize our idea of the latest page number.
--- 1456,1462 ----
***************
*** 1483,1488 ****
--- 1484,1490 ----
}
LWLockRelease(MultiXactMemberControlLock);
+ LWLockRelease(MultiXactOffsetControlLock);
/*
* Initialize lastTruncationPoint to invalid, ensuring that the first
***************
*** 1543,1549 ****
* SimpleLruTruncate would get confused. It seems best not to risk
* removing any data during recovery anyway, so don't truncate.
*/
! if (!InRecovery)
TruncateMultiXact();
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
--- 1545,1551 ----
* SimpleLruTruncate would get confused. It seems best not to risk
* removing any data during recovery anyway, so don't truncate.
*/
! if (!IsRecoveryProcessingMode())
TruncateMultiXact();
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
Index: src/backend/access/transam/rmgr.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/rmgr.c,v
retrieving revision 1.26
diff -c -r1.26 rmgr.c
*** src/backend/access/transam/rmgr.c 30 Sep 2008 10:52:11 -0000 1.26
--- src/backend/access/transam/rmgr.c 27 Oct 2008 18:32:03 -0000
***************
*** 21,26 ****
--- 21,27 ----
#include "commands/tablespace.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+ #include "utils/inval.h"
const RmgrData RmgrTable[RM_MAX_ID + 1] = {
***************
*** 32,38 ****
{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
{"FreeSpaceMap", fsm_redo, fsm_desc, NULL, NULL, NULL},
! {"Reserved 8", NULL, NULL, NULL, NULL, NULL},
{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
--- 33,39 ----
{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
{"FreeSpaceMap", fsm_redo, fsm_desc, NULL, NULL, NULL},
! {"Relation", relation_redo, relation_desc, NULL, NULL, NULL},
{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
Index: src/backend/access/transam/slru.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/slru.c,v
retrieving revision 1.44
diff -c -r1.44 slru.c
*** src/backend/access/transam/slru.c 1 Jan 2008 19:45:48 -0000 1.44
--- src/backend/access/transam/slru.c 27 Oct 2008 18:32:03 -0000
***************
*** 619,624 ****
--- 619,632 ----
if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
{
+ if (InRecovery)
+ {
+ ereport(LOG,
+ (errmsg("file \"%s\" doesn't exist, reading as zeroes",
+ path)));
+ MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+ return true;
+ }
slru_errcause = SLRU_SEEK_FAILED;
slru_errno = errno;
close(fd);
***************
*** 628,633 ****
--- 636,649 ----
errno = 0;
if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
{
+ if (InRecovery)
+ {
+ ereport(LOG,
+ (errmsg("file \"%s\" doesn't exist, reading as zeroes",
+ path)));
+ MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+ return true;
+ }
slru_errcause = SLRU_READ_FAILED;
slru_errno = errno;
close(fd);
Index: src/backend/access/transam/subtrans.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/subtrans.c,v
retrieving revision 1.23
diff -c -r1.23 subtrans.c
*** src/backend/access/transam/subtrans.c 1 Aug 2008 13:16:08 -0000 1.23
--- src/backend/access/transam/subtrans.c 27 Oct 2008 18:32:03 -0000
***************
*** 223,257 ****
/*
* This must be called ONCE during postmaster or standalone-backend startup,
* after StartupXLOG has initialized ShmemVariableCache->nextXid.
- *
- * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
- * if there are none.
*/
void
StartupSUBTRANS(TransactionId oldestActiveXID)
{
! int startPage;
! int endPage;
- /*
- * Since we don't expect pg_subtrans to be valid across crashes, we
- * initialize the currently-active page(s) to zeroes during startup.
- * Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
- * the new page without regard to whatever was previously on disk.
- */
LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
! startPage = TransactionIdToPage(oldestActiveXID);
! endPage = TransactionIdToPage(ShmemVariableCache->nextXid);
!
! while (startPage != endPage)
! {
! (void) ZeroSUBTRANSPage(startPage);
! startPage++;
! }
! (void) ZeroSUBTRANSPage(startPage);
LWLockRelease(SubtransControlLock);
}
/*
--- 223,244 ----
/*
* This must be called ONCE during postmaster or standalone-backend startup,
* after StartupXLOG has initialized ShmemVariableCache->nextXid.
*/
void
StartupSUBTRANS(TransactionId oldestActiveXID)
{
! TransactionId xid = ShmemVariableCache->nextXid;
! int pageno = TransactionIdToPage(xid);
LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
! /*
! * Initialize our idea of the latest page number.
! */
! SubTransCtl->shared->latest_page_number = pageno;
LWLockRelease(SubtransControlLock);
+
}
/*
Index: src/backend/access/transam/twophase.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/twophase.c,v
retrieving revision 1.46
diff -c -r1.46 twophase.c
*** src/backend/access/transam/twophase.c 20 Oct 2008 19:18:18 -0000 1.46
--- src/backend/access/transam/twophase.c 27 Oct 2008 18:32:03 -0000
***************
*** 1710,1715 ****
--- 1710,1716 ----
xlrec.crec.xact_time = GetCurrentTimestamp();
xlrec.crec.nrels = nrels;
xlrec.crec.nsubxacts = nchildren;
+ xlrec.crec.slotId = MyProc->slotId;
rdata[0].data = (char *) (&xlrec);
rdata[0].len = MinSizeOfXactCommitPrepared;
rdata[0].buffer = InvalidBuffer;
***************
*** 1788,1793 ****
--- 1789,1795 ----
xlrec.arec.xact_time = GetCurrentTimestamp();
xlrec.arec.nrels = nrels;
xlrec.arec.nsubxacts = nchildren;
+ xlrec.arec.slotId = MyProc->slotId;
rdata[0].data = (char *) (&xlrec);
rdata[0].len = MinSizeOfXactAbortPrepared;
rdata[0].buffer = InvalidBuffer;
Index: src/backend/access/transam/xact.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.266
diff -c -r1.266 xact.c
*** src/backend/access/transam/xact.c 20 Oct 2008 19:18:18 -0000 1.266
--- src/backend/access/transam/xact.c 27 Oct 2008 18:32:03 -0000
***************
*** 72,77 ****
--- 72,81 ----
*/
bool MyXactAccessedTempRel = false;
+ /*
+ * Bookkeeping for tracking emulated transactions in Recovery Procs.
+ */
+ static TransactionId latestObservedXid = InvalidTransactionId;
/*
* transaction states - transaction state from server perspective
***************
*** 139,144 ****
--- 143,150 ----
Oid prevUser; /* previous CurrentUserId setting */
bool prevSecDefCxt; /* previous SecurityDefinerContext setting */
bool prevXactReadOnly; /* entry-time xact r/o state */
+ bool xidMarkedInWAL; /* is this xid present in WAL yet? */
+ bool hasUnMarkedSubXids; /* had unmarked subxids */
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;
***************
*** 167,172 ****
--- 173,180 ----
InvalidOid, /* previous CurrentUserId setting */
false, /* previous SecurityDefinerContext setting */
false, /* entry-time xact r/o state */
+ false, /* initial state for xidMarkedInWAL */
+ false, /* hasUnMarkedSubXids */
NULL /* link to parent state block */
};
***************
*** 235,241 ****
/* local function prototypes */
! static void AssignTransactionId(TransactionState s);
static void AbortTransaction(void);
static void AtAbort_Memory(void);
static void AtCleanup_Memory(void);
--- 243,249 ----
/* local function prototypes */
! static void AssignTransactionId(TransactionState s, int recursion_level);
static void AbortTransaction(void);
static void AtAbort_Memory(void);
static void AtCleanup_Memory(void);
***************
*** 329,335 ****
GetTopTransactionId(void)
{
if (!TransactionIdIsValid(TopTransactionStateData.transactionId))
! AssignTransactionId(&TopTransactionStateData);
return TopTransactionStateData.transactionId;
}
--- 337,343 ----
GetTopTransactionId(void)
{
if (!TransactionIdIsValid(TopTransactionStateData.transactionId))
! AssignTransactionId(&TopTransactionStateData, 0);
return TopTransactionStateData.transactionId;
}
***************
*** 359,365 ****
TransactionState s = CurrentTransactionState;
if (!TransactionIdIsValid(s->transactionId))
! AssignTransactionId(s);
return s->transactionId;
}
--- 367,373 ----
TransactionState s = CurrentTransactionState;
if (!TransactionIdIsValid(s->transactionId))
! AssignTransactionId(s, 0);
return s->transactionId;
}
***************
*** 376,381 ****
--- 384,480 ----
return CurrentTransactionState->transactionId;
}
+ /*
+ * Fill in additional transaction information for an XLogRecord.
+ * We do this here so we can inspect various transaction state data,
+ * plus no need to further clutter XLogInsert().
+ */
+ void
+ GetStandbyInfoForTransaction(RmgrId rmid, uint8 info, XLogRecData *rdata,
+ TransactionId *xid2, uint16 *info2)
+ {
+ int level;
+ int slotId;
+
+ if (!MyProc)
+ *info2 |= XLR2_INVALID_SLOT_ID;
+ else
+ {
+ slotId = MyProc->slotId;
+
+ if (slotId >= XLOG_MAX_SLOT_ID)
+ *info2 |= XLR2_INVALID_SLOT_ID;
+ else
+ *info2 = ((uint16) slotId) & XLR2_INFO2_MASK;
+ }
+
+ if (rmid == RM_XACT_ID && info == XLOG_XACT_ASSIGNMENT)
+ {
+ xl_xact_assignment *xlrec = (xl_xact_assignment *) rdata->data;
+
+ /*
+ * We set the flag for records written by AssignTransactionId
+ * to allow that record type to be handled by
+ * RecordKnownAssignedTransactionIds(). This looks a little
+ * strange, but it avoids the need to alter the API of XLogInsert.
+ */
+ if (xlrec->isSubXact)
+ *info2 |= XLR2_FIRST_SUBXID_RECORD;
+ else
+ *info2 |= XLR2_FIRST_XID_RECORD;
+ }
+ else
+ {
+ TransactionState s = CurrentTransactionState;
+
+ /*
+ * If we haven't assigned an xid yet, don't flag the record.
+ * We currently assign xids when we make database entries, so
+ * things like storage creation and oid assignment does not
+ * have xids assigned on them. So no need to mark xid2 either.
+ */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return;
+
+ level = GetCurrentTransactionNestLevel();
+
+ if (level >= 1 && !s->xidMarkedInWAL)
+ {
+ if (level == 1)
+ *info2 |= XLR2_FIRST_XID_RECORD;
+ else
+ {
+ *info2 |= XLR2_FIRST_SUBXID_RECORD;
+
+ if (level == 2 &&
+ !CurrentTransactionState->parent->xidMarkedInWAL)
+ {
+ *info2 |= XLR2_FIRST_XID_RECORD;
+ CurrentTransactionState->parent->xidMarkedInWAL = true;
+ }
+ }
+ CurrentTransactionState->xidMarkedInWAL = true;
+
+ /*
+ * Decide whether we need to mark subtrans or not, for this xid.
+ * Top-level transaction is level=1, so we need to be careful to
+ * start at the right subtransaction.
+ */
+ if (level > (PGPROC_MAX_CACHED_SUBXIDS + 1))
+ *info2 |= XLR2_MARK_SUBTRANS;
+ }
+
+ /*
+ * Set the secondary TransactionId for this record
+ */
+ if (*info2 & XLR2_FIRST_SUBXID_RECORD)
+ *xid2 = CurrentTransactionState->parent->transactionId;
+ else if (rmid == RM_HEAP2_ID)
+ *xid2 = InvalidTransactionId; // XXX: GetLatestRemovedXidIfAny();
+ }
+
+ elog(trace_recovery(DEBUG3), "info2 %d xid2 %d", *info2, *xid2);
+ }
/*
* AssignTransactionId
***************
*** 387,397 ****
* following its parent's.
*/
static void
! AssignTransactionId(TransactionState s)
{
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
/* Assert that caller didn't screw up */
Assert(!TransactionIdIsValid(s->transactionId));
Assert(s->state == TRANS_INPROGRESS);
--- 486,499 ----
* following its parent's.
*/
static void
! AssignTransactionId(TransactionState s, int recursion_level)
{
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
+ if (IsRecoveryProcessingMode())
+ elog(FATAL, "cannot assign TransactionIds during recovery");
+
/* Assert that caller didn't screw up */
Assert(!TransactionIdIsValid(s->transactionId));
Assert(s->state == TRANS_INPROGRESS);
***************
*** 401,407 ****
* than its parent.
*/
if (isSubXact && !TransactionIdIsValid(s->parent->transactionId))
! AssignTransactionId(s->parent);
/*
* Generate a new Xid and record it in PG_PROC and pg_subtrans.
--- 503,509 ----
* than its parent.
*/
if (isSubXact && !TransactionIdIsValid(s->parent->transactionId))
! AssignTransactionId(s->parent, recursion_level + 1);
/*
* Generate a new Xid and record it in PG_PROC and pg_subtrans.
***************
*** 413,419 ****
*/
s->transactionId = GetNewTransactionId(isSubXact);
! if (isSubXact)
SubTransSetParent(s->transactionId, s->parent->transactionId);
/*
--- 515,528 ----
*/
s->transactionId = GetNewTransactionId(isSubXact);
! /*
! * If we have overflowed the subxid cache then we must mark subtrans
! * with the parent xid. Prior to 8.4 we marked subtrans for each
! * subtransaction, though that is no longer necessary because the
! * way snapshots are searched in XidInMVCCSnapshot() has changed to
! * allow searching of both subxid cache and subtrans, not either/or.
! */
! if (isSubXact && MyProc->subxids.overflowed)
SubTransSetParent(s->transactionId, s->parent->transactionId);
/*
***************
*** 435,442 ****
}
PG_END_TRY();
CurrentResourceOwner = currentOwner;
- }
/*
* GetCurrentSubTransactionId
--- 544,609 ----
}
PG_END_TRY();
CurrentResourceOwner = currentOwner;
+ elog(trace_recovery(DEBUG2),
+ "AssignXactId xid %d nest %d recursion %d xidMarkedInWAL %s hasParent %s",
+ s->transactionId,
+ GetCurrentTransactionNestLevel(),
+ recursion_level,
+ s->xidMarkedInWAL ? "t" : "f",
+ s->parent ? "t" : "f");
+
+ /*
+ * WAL log this assignment, if required.
+ *
+ * If we have large numbers of connections, we need to log also.
+ */
+ if (recursion_level > 1 ||
+ (recursion_level == 1 && isSubXact) ||
+ (MyProc && MyProc->slotId >= XLOG_MAX_SLOT_ID))
+ {
+ XLogRecData rdata;
+ xl_xact_assignment xlrec;
+
+ xlrec.xassign = s->transactionId;
+ xlrec.isSubXact = (s->parent != NULL);
+ xlrec.slotId = MyProc->slotId;
+
+ if (xlrec.isSubXact)
+ xlrec.xparent = s->parent->transactionId;
+ else
+ xlrec.xparent = InvalidTransactionId;
+
+ START_CRIT_SECTION();
+
+ rdata.data = (char *) (&xlrec);
+ rdata.len = sizeof(xl_xact_assignment);
+ rdata.buffer = InvalidBuffer;
+ rdata.next = NULL;
+
+ /*
+ * These WAL records look like no other. We are assigning a
+ * TransactionId to upper levels of the transaction stack. The
+ * transaction level we are looking may *not* be the *current*
+ * transaction. We have not yet assigned the xid for the current
+ * transaction, so the xid of this WAL record will be
+ * InvalidTransactionId, even though we are in a transaction.
+ * Got that?
+ *
+ * So we stuff the newly assigned xid into the WAL record and
+ * let WAL replay sort it out later.
+ */
+ (void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT, &rdata);
+
+ END_CRIT_SECTION();
+
+ /*
+ * Mark this transaction level, so we can avoid issuing WAL records
+ * for later subtransactions also.
+ */
+ s->xidMarkedInWAL = true;
+ }
+ }
/*
* GetCurrentSubTransactionId
***************
*** 884,889 ****
--- 1051,1063 ----
* This makes checkpoint's determination of which xacts are inCommit a
* bit fuzzy, but it doesn't matter.
*/
+ if (CurrentTransactionState->hasUnMarkedSubXids)
+ xlrec.flags |= XACT_COMPLETION_UNMARKED_SUBXIDS;
+ if (AtEOXact_Database_FlatFile_Update_Needed())
+ xlrec.flags |= XACT_COMPLETION_UPDATE_DB_FILE;
+ if (AtEOXact_Auth_FlatFile_Update_Needed())
+ xlrec.flags |= XACT_COMPLETION_UPDATE_AUTH_FILE;
+
START_CRIT_SECTION();
MyProc->inCommit = true;
***************
*** 891,896 ****
--- 1065,1072 ----
xlrec.xact_time = xactStopTimestamp;
xlrec.nrels = nrels;
xlrec.nsubxacts = nchildren;
+ xlrec.slotId = MyProc->slotId;
+
rdata[0].data = (char *) (&xlrec);
rdata[0].len = MinSizeOfXactCommit;
rdata[0].buffer = InvalidBuffer;
***************
*** 1204,1209 ****
--- 1380,1388 ----
nrels = smgrGetPendingDeletes(false, &rels, NULL);
nchildren = xactGetCommittedChildren(&children);
+ if (CurrentTransactionState->hasUnMarkedSubXids)
+ xlrec.flags |= XACT_COMPLETION_UNMARKED_SUBXIDS;
+
/* XXX do we really need a critical section here? */
START_CRIT_SECTION();
***************
*** 1217,1222 ****
--- 1396,1402 ----
}
xlrec.nrels = nrels;
xlrec.nsubxacts = nchildren;
+ xlrec.slotId = MyProc->slotId;
rdata[0].data = (char *) (&xlrec);
rdata[0].len = MinSizeOfXactAbort;
rdata[0].buffer = InvalidBuffer;
***************
*** 1523,1528 ****
--- 1703,1710 ----
s->childXids = NULL;
s->nChildXids = 0;
s->maxChildXids = 0;
+ s->xidMarkedInWAL = false;
+ s->hasUnMarkedSubXids = false;
GetUserIdAndContext(&s->prevUser, &s->prevSecDefCxt);
/* SecurityDefinerContext should never be set outside a transaction */
Assert(!s->prevSecDefCxt);
***************
*** 3746,3751 ****
--- 3928,3939 ----
/* Must CCI to ensure commands of subtransaction are seen as done */
CommandCounterIncrement();
+ /*
+ * Make sure we keep tracking xids that haven't marked WAL.
+ */
+ if (!s->xidMarkedInWAL || s->hasUnMarkedSubXids)
+ s->parent->hasUnMarkedSubXids = true;
+
/*
* Prior to 8.4 we marked subcommit in clog at this point. We now only
* perform that step, if required, as part of the atomic update of the
***************
*** 3865,3870 ****
--- 4053,4064 ----
s->state = TRANS_ABORT;
/*
+ * Make sure we keep tracking xids that haven't marked WAL.
+ */
+ if (!s->xidMarkedInWAL || s->hasUnMarkedSubXids)
+ s->parent->hasUnMarkedSubXids = true;
+
+ /*
* Reset user ID which might have been changed transiently. (See notes
* in AbortTransaction.)
*/
***************
*** 4206,4237 ****
return s->nChildXids;
}
/*
* XLOG support routines
*/
static void
! xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid)
{
TransactionId *sub_xids;
TransactionId max_xid;
int i;
- /* Mark the transaction committed in pg_clog */
- sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
- TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids);
-
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
! for (i = 0; i < xlrec->nsubxacts; i++)
{
! if (TransactionIdPrecedes(max_xid, sub_xids[i]))
! max_xid = sub_xids[i];
}
if (TransactionIdFollowsOrEquals(max_xid,
ShmemVariableCache->nextXid))
{
ShmemVariableCache->nextXid = max_xid;
TransactionIdAdvance(ShmemVariableCache->nextXid);
}
--- 4400,4794 ----
return s->nChildXids;
}
+ void
+ LogCurrentRunningXacts(void)
+ {
+ RunningTransactions CurrRunningXacts = GetRunningTransactionData();
+ xl_xact_running_xacts xlrec;
+ XLogRecData rdata[3];
+ int lastrdata = 0;
+ XLogRecPtr recptr;
+
+ xlrec.xcnt = CurrRunningXacts->xcnt;
+ xlrec.subxcnt = CurrRunningXacts->subxcnt;
+ xlrec.latestRunningXid = CurrRunningXacts->latestRunningXid;
+ xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
+
+ /* Header */
+ rdata[0].data = (char *) (&xlrec);
+ rdata[0].len = MinSizeOfXactRunningXacts;
+ rdata[0].buffer = InvalidBuffer;
+
+ /* array of RunningXact */
+ if (xlrec.xcnt > 0)
+ {
+ rdata[0].next = &(rdata[1]);
+ rdata[1].data = (char *) CurrRunningXacts->xrun;
+ rdata[1].len = xlrec.xcnt * sizeof(RunningXact);
+ rdata[1].buffer = InvalidBuffer;
+ lastrdata = 1;
+ }
+
+ /* array of RunningXact */
+ if (xlrec.subxcnt > 0)
+ {
+ rdata[lastrdata].next = &(rdata[2]);
+ rdata[2].data = (char *) CurrRunningXacts->subxip;
+ rdata[2].len = xlrec.subxcnt * sizeof(TransactionId);
+ rdata[2].buffer = InvalidBuffer;
+ lastrdata = 2;
+ }
+
+ rdata[lastrdata].next = NULL;
+
+ START_CRIT_SECTION();
+
+ recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_RUNNING_XACTS, rdata);
+
+ END_CRIT_SECTION();
+
+ elog(trace_recovery(DEBUG1), "captured snapshot of running xacts %X/%X", recptr.xlogid, recptr.xrecoff);
+ }
+
+ /*
+ * Is the data available to allow valid snapshots?
+ */
+ bool
+ IsRunningXactDataIsValid(void)
+ {
+ if (TransactionIdIsValid(latestObservedXid))
+ return true;
+
+ return false;
+ }
+
+ #define XACT_IS_TOP_XACT false
+ #define XACT_IS_SUBXACT true
+ /*
+ * During recovery we maintain ProcArray with incoming xids
+ * when we first observe them in use. Uses local variables, so
+ * should only be called by Startup process.
+ *
+ * We record all xids that we know have been assigned. That includes
+ * all the xids on the WAL record, plus all unobserved xids that
+ * we can deduce have been assigned. We can deduce the existence of
+ * unobserved xids because we know xids are in sequence, with no gaps.
+ *
+ * XXX Be careful of what happens when we use pg_resetxlogs.
+ */
+ void
+ RecordKnownAssignedTransactionIds(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+ TransactionId xid,
+ parent_xid;
+ int slotId;
+ PGPROC *proc;
+ TransactionId next_expected_xid = latestObservedXid;
+
+ /*
+ * Have we seen the first RunningXacts yet? If not, no need to
+ * maintain state.
+ */
+ if (!TransactionIdIsValid(latestObservedXid))
+ return;
+
+ /*
+ * If its an assignment record, we need to need extract data from
+ * the body of the record, rather than take header values. This
+ * is because an assignment record can be issued when
+ * GetCurrentTransactionIdIfAny() returns InvalidTransactionId.
+ * We also use the supplied slotId rather than the header value,
+ * so we can cope with backends above XLOG_MAX_SLOT_ID.
+ */
+ if (record->xl_rmid == RM_XACT_ID && info == XLOG_XACT_ASSIGNMENT)
+ {
+ xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
+
+ xid = xlrec->xassign;
+ parent_xid = xlrec->xparent;
+ slotId = xlrec->slotId;
+ }
+ else
+ {
+ xid = record->xl_xid;
+ parent_xid = record->xl_xid2;
+ slotId = XLogRecGetSlotId(record);
+ }
+
+ elog(trace_recovery(DEBUG4), "RecordKnown xid %d parent %d slot %d"
+ " latestObsvXid %d firstXid %s firstSubXid %s markSubtrans %s",
+ xid, parent_xid, slotId, latestObservedXid,
+ XLogRecIsFirstXidRecord(record) ? "t" : "f",
+ XLogRecIsFirstSubXidRecord(record) ? "t" : "f",
+ XLogRecMustMarkSubtrans(record) ? "t" : "f");
+
+ if (XLogRecIsFirstSubXidRecord(record))
+ Assert(TransactionIdIsValid(parent_xid) && TransactionIdPrecedes(parent_xid, xid));
+ else
+ Assert(!TransactionIdIsValid(parent_xid));
+
+ /*
+ * Identify the recovery proc that holds replay info for this xid
+ */
+ proc = SlotIdGetRecoveryProc(slotId);
+
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+ /*
+ * Record the newly observed xid onto the correct proc.
+ */
+ if (XLogRecIsFirstXidRecord(record))
+ {
+ if (XLogRecIsFirstSubXidRecord(record))
+ {
+ /*
+ * If both flags are set, then we are seeing both the
+ * subtransaction xid and its top-level parent xid
+ * for the first time. So start the top-level transaction
+ * first, then add the subtransaction.
+ *
+ * Note that we don't need locks in all cases here
+ * because it is normal to start each of these atomically,
+ * in sequence.
+ */
+ ProcArrayStartRecoveryTransaction(proc, parent_xid, lsn, XACT_IS_TOP_XACT);
+ ProcArrayStartRecoveryTransaction(proc, xid, lsn, XACT_IS_SUBXACT);
+ }
+ else
+ {
+ /*
+ * First observation of top-level xid only.
+ */
+ ProcArrayStartRecoveryTransaction(proc, xid, lsn, XACT_IS_TOP_XACT);
+ }
+ }
+ else if (XLogRecIsFirstSubXidRecord(record))
+ {
+ /*
+ * First observation of subtransaction xid.
+ */
+ ProcArrayStartRecoveryTransaction(proc, xid, lsn, XACT_IS_SUBXACT);
+ }
+
+ /*
+ * When a newly observed xid arrives, it is frequently the case
+ * that it is *not* the next xid in sequence. When this occurs, we
+ * must treat the intervening xids as running also. So we maintain
+ * a special list of these UnobservedXids, so that snapshots can
+ * see what's happening.
+ *
+ * We maintain both recovery Procs *and* UnobservedXids because we
+ * need them both. Recovery procs allow us to store top-level xids
+ * and subtransactions separately, otherwise we wouldn't know
+ * when to overflow the subxid cache. UnobservedXids allow us to
+ * make sense of the out-of-order arrival of xids.
+ *
+ * Some examples:
+ * 1) latestObservedXid = 647
+ * next xid observed in WAL = 651 (a top-level transaction)
+ * so we add 648, 649, 650 to UnobservedXids
+ *
+ * 2) latestObservedXid = 769
+ * next xid observed in WAL = 771 (a subtransaction)
+ * so we add 770 to UnobservedXids
+ *
+ * 3) latestObservedXid = 769
+ * next xid observed in WAL = 810 (a subtransaction)
+ * 810's parent had not yet recorded WAL = 807
+ * so we add 770 thru 809 inclusive to UnobservedXids
+ * then remove 807
+ *
+ * 4) latestObservedXid = 769
+ * next xid observed in WAL = 771 (a subtransaction)
+ * 771's parent had not yet recorded WAL = 770
+ * so do nothing
+ *
+ * 5) latestObservedXid = 7747
+ * next xid observed in WAL = 7748 (a subtransaction)
+ * 7748's parent had not yet recorded WAL = 7742
+ * so we add 7748 and removed 7742
+ */
+ TransactionIdAdvance(next_expected_xid);
+ if (!XLogRecIsFirstXidRecord(record) || !XLogRecIsFirstSubXidRecord(record))
+ {
+ /*
+ * Just have one xid to process, so fairly simple
+ */
+ if (next_expected_xid == xid)
+ {
+ Assert(!XidInUnobservedTransactions(xid));
+ Assert(!XLogRecIsFirstSubXidRecord(record) ||
+ !XidInUnobservedTransactions(parent_xid));
+ latestObservedXid = xid;
+ }
+ else if (TransactionIdPrecedes(next_expected_xid, xid))
+ {
+ UnobservedTransactionsAddXids(next_expected_xid, xid);
+ latestObservedXid = xid;
+ }
+ else
+ UnobservedTransactionsRemoveXid(xid);
+ }
+ else
+ {
+ TransactionId next_plus_one_xid = next_expected_xid;
+ TransactionIdAdvance(next_plus_one_xid);
+
+ /*
+ * Just remember when reading this logic that by definition we have
+ * Assert(TransactionIdPrecedes(parent_xid, xid))
+ */
+ if (next_expected_xid == parent_xid && next_plus_one_xid == xid)
+ {
+ Assert(!XidInUnobservedTransactions(xid));
+ Assert(!XidInUnobservedTransactions(parent_xid));
+ latestObservedXid = xid;
+ }
+ else if (next_expected_xid == xid)
+ {
+ latestObservedXid = xid;
+ UnobservedTransactionsRemoveXid(parent_xid);
+ }
+ else if (TransactionIdFollowsOrEquals(xid, next_plus_one_xid))
+ {
+ UnobservedTransactionsAddXids(next_expected_xid, xid);
+ latestObservedXid = xid;
+ UnobservedTransactionsRemoveXid(parent_xid);
+ }
+ else if (TransactionIdPrecedes(xid, next_expected_xid))
+ {
+ UnobservedTransactionsRemoveXid(xid);
+ UnobservedTransactionsRemoveXid(parent_xid);
+ }
+ else
+ elog(FATAL, "there are more combinations than you thought about");
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ /*
+ * Now we've upated the proc we can update subtrans, if appropriate.
+ * We must do this step last to avoid race conditions. See comments
+ * and code for AssignTransactionId().
+ */
+ if (XLogRecMustMarkSubtrans(record))
+ {
+ Assert(XLogRecIsFirstSubXidRecord(record));
+ elog(trace_recovery(DEBUG2),
+ "subtrans setting parent %d for xid %d", parent_xid, xid);
+ SubTransSetParent(xid, parent_xid);
+ }
+ }
+
/*
* XLOG support routines
*/
+ /*
+ * Before 8.4 this was a fairly short function, but now it performs many
+ * actions for which the order of execution is critical.
+ */
static void
! xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, bool preparedXact)
{
+ PGPROC *proc;
TransactionId *sub_xids;
TransactionId max_xid;
int i;
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
! sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
!
! if (xlrec->nsubxacts > 0)
! max_xid = sub_xids[xlrec->nsubxacts - 1];
!
! /*
! * Even though there is a slotId on the xlrec header we use the slotId
! * from the nody of the xlrec, to allow for cases where MaxBackends
! * is larger than can fit in the xlrec header.
! */
! proc = SlotIdGetRecoveryProc(xlrec->slotId);
!
! #ifdef USE_ASSERT_CHECKING
! if (!preparedXact)
{
! /*
! * Double check everything to make sure there's no mistakes
! * before we update the proc array.
! */
! if (xid != proc->xid)
! {
! if (XidInRecoveryProcs(xid) && !preparedXact)
! {
! elog(LOG, "xid %d slot %d proc->xid %d prep %s",
! xid, xlrec->slotId, proc->xid,
! (preparedXact ? "t" : "f"));
!
! ProcArrayDisplay(LOG);
! elog(FATAL, "accessed the wrong slot");
! }
! }
!
! if (XidInUnobservedTransactions(xid))
! {
! ProcArrayDisplay(LOG);
! UnobservedTransactionsDisplay(LOG);
! elog(FATAL, "xid %d still in UnobservedXids", xid);
! }
}
+ #endif
+
+ /*
+ * If requested, update the flat files for DB and Auth Files.
+ * These acquire AccessExclusiveLocks which will be released soon
+ * after we mark the commit in clog. These *must* be the last
+ * locks we take before updating clog to prevent deadlocks, and
+ * we also want to keep the window between this action and marking
+ * the commit as small as possible.
+ *
+ * XXXR does this handle relcache correctly, probably not.
+ */
+ if (XactCompletionUpdateDBFile(xlrec))
+ {
+ if (XactCompletionUpdateAuthFile(xlrec))
+ BuildFlatFiles(false, true, false);
+ else
+ BuildFlatFiles(true, true, false);
+ }
+
+ /* Mark the transaction committed in pg_clog */
+ TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids);
+
+ /*
+ * We must mark clog before we update the ProcArray. Only update
+ * if we have already initialised the state and we have previously
+ * added an xid to the proc. We need no lock to check xid since it
+ * is controlled by Startup process. It's possible for xids to
+ * appear that haven't been seen before. We don't need to check
+ * UnobservedXids because in the normal case this will already have
+ * happened, but there are cases where they might sneak through.
+ * Leave these for the periodic cleanup by XACT_RUNNING_XACT records.
+ */
+ if (TransactionIdIsValid(latestObservedXid) &&
+ TransactionIdIsValid(proc->xid) && !preparedXact)
+ ProcArrayEndTransaction(proc, max_xid);
+
+ /*
+ * Release locks and other resources here
+ */
+
+ /*
+ * Send any cache invalidations???
+ */
+
+ /* Make sure nextXid is beyond any XID mentioned in the record */
if (TransactionIdFollowsOrEquals(max_xid,
ShmemVariableCache->nextXid))
{
ShmemVariableCache->nextXid = max_xid;
+ ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
TransactionIdAdvance(ShmemVariableCache->nextXid);
}
***************
*** 4248,4275 ****
}
}
static void
! xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
{
TransactionId *sub_xids;
TransactionId max_xid;
int i;
- /* Mark the transaction aborted in pg_clog */
- sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
- TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids);
-
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
for (i = 0; i < xlrec->nsubxacts; i++)
{
if (TransactionIdPrecedes(max_xid, sub_xids[i]))
max_xid = sub_xids[i];
}
if (TransactionIdFollowsOrEquals(max_xid,
ShmemVariableCache->nextXid))
{
ShmemVariableCache->nextXid = max_xid;
TransactionIdAdvance(ShmemVariableCache->nextXid);
}
--- 4805,4927 ----
}
}
+ /*
+ * Be careful with the order of execution, as with xact_redo_commit().
+ * The two functions are similar but differ in key places.
+ */
static void
! xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid, bool preparedXact)
{
+ PGPROC *proc;
TransactionId *sub_xids;
TransactionId max_xid;
int i;
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
+ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+
+ /*
+ * Find the highest xid and remove unobserved xids if required.
+ */
+ if (XactCompletionHasUnMarkedSubxids(xlrec))
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
for (i = 0; i < xlrec->nsubxacts; i++)
{
if (TransactionIdPrecedes(max_xid, sub_xids[i]))
max_xid = sub_xids[i];
+ if (XactCompletionHasUnMarkedSubxids(xlrec))
+ UnobservedTransactionsRemoveXid(sub_xids[i]);
}
+
+ if (XactCompletionHasUnMarkedSubxids(xlrec))
+ LWLockRelease(ProcArrayLock);
+
+
+ if (!TransactionIdIsValid(latestObservedXid) ||
+ TransactionIdPrecedes(latestObservedXid, max_xid))
+ latestObservedXid = max_xid;
+
+ /*
+ * Even though there is a slotId on the xlrec header we use the slotId
+ * from the nody of the xlrec, to allow for cases where MaxBackends
+ * is larger than can fit in the xlrec header.
+ */
+ proc = SlotIdGetRecoveryProc(xlrec->slotId);
+
+ /*
+ * It's possible that we wrote an abort record without having written
+ * anything else. If that happens we need to handle subtransactions.
+ * If there is more than one subtransaction it should already have been
+ * handled via an assignment record. So if the counter is behind then
+ * it's an error except when we have exactly one subtransaction.
+ */
+ if (TransactionIdIsValid(latestObservedXid) &&
+ TransactionIdPrecedes(latestObservedXid, max_xid))
+ {
+ if (xlrec->nsubxacts == 1)
+ latestObservedXid = max_xid;
+ else
+ {
+ ProcArrayDisplay(LOG);
+ UnobservedTransactionsDisplay(LOG);
+ elog(FATAL, "latestObservedXid %d not moved forwards to %d", latestObservedXid, max_xid);
+ }
+ }
+
+ #ifdef USE_ASSERT_CHECKING
+ if (!preparedXact)
+ {
+ /*
+ * Double check everything to make sure there's no mistakes
+ * before we update the proc array.
+ */
+ if (xid != proc->xid)
+ {
+ if (XidInRecoveryProcs(xid) && !preparedXact)
+ {
+ elog(LOG, "xid %d slot %d proc->xid %d prep %s",
+ xid, xlrec->slotId, proc->xid,
+ (preparedXact ? "t" : "f"));
+
+ ProcArrayDisplay(LOG);
+ elog(FATAL, "accessed the wrong slot");
+ }
+ }
+
+ if (XidInUnobservedTransactions(xid))
+ {
+ ProcArrayDisplay(LOG);
+ UnobservedTransactionsDisplay(LOG);
+ elog(FATAL, "xid %d still in UnobservedXids", xid);
+ }
+ }
+ #endif
+
+ /* Mark the transaction aborted in pg_clog */
+ TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids);
+
+ /*
+ * We must mark clog before we update the ProcArray. Only update
+ * if we have already initialised the state and we have previously
+ * added an xid to the proc. We need no lock to check xid since it
+ * is controlled by Startup process. It's possible for xids to
+ * appear that haven't been seen before. We don't need to check
+ * UnobservedXids because in the normal case this will already have
+ * happened, but there are cases where they might sneak through.
+ * Leave these for the periodic cleanup by XACT_RUNNING_XACT records.
+ */
+ if (TransactionIdIsValid(latestObservedXid) &&
+ TransactionIdIsValid(proc->xid) && !preparedXact)
+ ProcArrayEndTransaction(proc, max_xid);
+
+ /* Make sure nextXid is beyond any XID mentioned in the record */
if (TransactionIdFollowsOrEquals(max_xid,
ShmemVariableCache->nextXid))
{
ShmemVariableCache->nextXid = max_xid;
+ ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
TransactionIdAdvance(ShmemVariableCache->nextXid);
}
***************
*** 4295,4307 ****
{
xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);
! xact_redo_commit(xlrec, record->xl_xid);
}
else if (info == XLOG_XACT_ABORT)
{
xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(record);
! xact_redo_abort(xlrec, record->xl_xid);
}
else if (info == XLOG_XACT_PREPARE)
{
--- 4947,4959 ----
{
xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);
! xact_redo_commit(xlrec, record->xl_xid, false);
}
else if (info == XLOG_XACT_ABORT)
{
xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(record);
! xact_redo_abort(xlrec, record->xl_xid, false);
}
else if (info == XLOG_XACT_PREPARE)
{
***************
*** 4313,4328 ****
{
xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) XLogRecGetData(record);
! xact_redo_commit(&xlrec->crec, xlrec->xid);
RemoveTwoPhaseFile(xlrec->xid, false);
}
else if (info == XLOG_XACT_ABORT_PREPARED)
{
xl_xact_abort_prepared *xlrec = (xl_xact_abort_prepared *) XLogRecGetData(record);
! xact_redo_abort(&xlrec->arec, xlrec->xid);
RemoveTwoPhaseFile(xlrec->xid, false);
}
else
elog(PANIC, "xact_redo: unknown op code %u", info);
}
--- 4965,5008 ----
{
xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) XLogRecGetData(record);
! xact_redo_commit(&xlrec->crec, xlrec->xid, true);
RemoveTwoPhaseFile(xlrec->xid, false);
}
else if (info == XLOG_XACT_ABORT_PREPARED)
{
xl_xact_abort_prepared *xlrec = (xl_xact_abort_prepared *) XLogRecGetData(record);
! xact_redo_abort(&xlrec->arec, xlrec->xid, true);
RemoveTwoPhaseFile(xlrec->xid, false);
}
+ else if (info == XLOG_XACT_ASSIGNMENT)
+ {
+ /*
+ * This is a no-op since RecordKnownAssignedTransactionIds()
+ * already did all the work on this record for us.
+ */
+ return;
+ }
+ else if (info == XLOG_XACT_RUNNING_XACTS)
+ {
+ xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) XLogRecGetData(record);
+
+ /*
+ * Initialise if we have a valid snapshot to work with
+ */
+ if (TransactionIdIsValid(xlrec->latestRunningXid) &&
+ (!TransactionIdIsValid(latestObservedXid) ||
+ TransactionIdPrecedes(latestObservedXid, xlrec->latestRunningXid)))
+ {
+ latestObservedXid = xlrec->latestRunningXid;
+ ShmemVariableCache->latestCompletedXid = xlrec->latestCompletedXid;
+ elog(trace_recovery(DEBUG1),
+ "initial snapshot created; latestObservedXid = %d latestCompletedXid = %d",
+ latestObservedXid, xlrec->latestCompletedXid);
+ }
+
+ ProcArrayUpdateRecoveryTransactions(lsn, xlrec);
+ }
else
elog(PANIC, "xact_redo: unknown op code %u", info);
}
***************
*** 4387,4392 ****
--- 5067,5110 ----
}
}
+ static void
+ xact_desc_running_xacts(StringInfo buf, xl_xact_running_xacts *xlrec)
+ {
+ int xid_index,
+ subxid_index;
+ TransactionId *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]);
+
+ appendStringInfo(buf, "nxids %u nsubxids %u latestRunningXid %d",
+ xlrec->xcnt,
+ xlrec->subxcnt,
+ xlrec->latestRunningXid);
+
+ for (xid_index = 0; xid_index < xlrec->xcnt; xid_index++)
+ {
+ RunningXact *rxact = (RunningXact *) xlrec->xrun;
+
+ appendStringInfo(buf, "; xid %d pid %d backend %d db %d role %d "
+ "vacflag %u nsubxids %u offset %d overflowed %s",
+ rxact[xid_index].xid,
+ rxact[xid_index].pid,
+ rxact[xid_index].slotId,
+ rxact[xid_index].databaseId,
+ rxact[xid_index].roleId,
+ rxact[xid_index].vacuumFlags,
+ rxact[xid_index].nsubxids,
+ rxact[xid_index].subx_offset,
+ (rxact[xid_index].overflowed ? "t" : "f"));
+
+ if (rxact[xid_index].nsubxids > 0)
+ {
+ appendStringInfo(buf, "; subxacts: ");
+ for (subxid_index = 0; subxid_index < rxact[xid_index].nsubxids; subxid_index++)
+ appendStringInfo(buf, " %u",
+ subxip[subxid_index + rxact[xid_index].subx_offset]);
+ }
+ }
+ }
+
void
xact_desc(StringInfo buf, uint8 xl_info, char *rec)
{
***************
*** 4424,4429 ****
--- 5142,5177 ----
appendStringInfo(buf, "abort %u: ", xlrec->xid);
xact_desc_abort(buf, &xlrec->arec);
}
+ else if (info == XLOG_XACT_ASSIGNMENT)
+ {
+ xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
+
+ /* ignore the main xid, it may be Invalid and misleading */
+ appendStringInfo(buf, "assignment: xid %u slotid %d",
+ xlrec->xassign, xlrec->slotId);
+ }
+ else if (info == XLOG_XACT_RUNNING_XACTS)
+ {
+ xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) rec;
+
+ appendStringInfo(buf, "running xacts: ");
+ xact_desc_running_xacts(buf, xlrec);
+ }
+ else if (info == XLOG_XACT_ASSIGNMENT)
+ {
+ xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
+
+ /* ignore the main xid, it may be Invalid and misleading */
+ appendStringInfo(buf, "assignment: xid %u slotid %d",
+ xlrec->xassign, xlrec->slotId);
+ }
+ else if (info == XLOG_XACT_RUNNING_XACTS)
+ {
+ xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) rec;
+
+ appendStringInfo(buf, "running xacts: ");
+ xact_desc_running_xacts(buf, xlrec);
+ }
else
appendStringInfo(buf, "UNKNOWN");
}
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.319
diff -c -r1.319 xlog.c
*** src/backend/access/transam/xlog.c 23 Sep 2008 09:20:35 -0000 1.319
--- src/backend/access/transam/xlog.c 27 Oct 2008 18:32:03 -0000
***************
*** 49,54 ****
--- 49,55 ----
#include "utils/guc.h"
#include "utils/ps_status.h"
+ #define WAL_DEBUG
/* File path names (all relative to $PGDATA) */
#define BACKUP_LABEL_FILE "backup_label"
***************
*** 68,74 ****
int sync_method = DEFAULT_SYNC_METHOD;
#ifdef WAL_DEBUG
! bool XLOG_DEBUG = false;
#endif
/*
--- 69,77 ----
int sync_method = DEFAULT_SYNC_METHOD;
#ifdef WAL_DEBUG
! bool XLOG_DEBUG_FLUSH = false;
! bool XLOG_DEBUG_BGFLUSH = false;
! bool XLOG_DEBUG_REDO = true;
#endif
/*
***************
*** 113,119 ****
/*
* ThisTimeLineID will be same in all backends --- it identifies current
! * WAL timeline for the database system.
*/
TimeLineID ThisTimeLineID = 0;
--- 116,123 ----
/*
* ThisTimeLineID will be same in all backends --- it identifies current
! * WAL timeline for the database system. Zero is always a bug, so we
! * start with that to allow us to spot any errors.
*/
TimeLineID ThisTimeLineID = 0;
***************
*** 123,146 ****
/* Are we recovering using offline XLOG archives? */
static bool InArchiveRecovery = false;
/* Was the last xlog file restored from archive, or local? */
static bool restoredFromArchive = false;
/* options taken from recovery.conf */
static char *recoveryRestoreCommand = NULL;
! static bool recoveryTarget = false;
static bool recoveryTargetExact = false;
static bool recoveryTargetInclusive = true;
static bool recoveryLogRestartpoints = false;
static TransactionId recoveryTargetXid;
static TimestampTz recoveryTargetTime;
static TimestampTz recoveryLastXTime = 0;
/* if recoveryStopsHere returns true, it saves actual stop xid/time here */
static TransactionId recoveryStopXid;
static TimestampTz recoveryStopTime;
static bool recoveryStopAfter;
/*
* During normal operation, the only timeline we care about is ThisTimeLineID.
* During recovery, however, things are more complicated. To simplify life
--- 127,171 ----
/* Are we recovering using offline XLOG archives? */
static bool InArchiveRecovery = false;
+ /* Local copy of shared RecoveryProcessingMode state */
+ static bool LocalRecoveryProcessingMode = true;
+ static bool knownProcessingMode = false;
+
/* Was the last xlog file restored from archive, or local? */
static bool restoredFromArchive = false;
+ /* recovery target modes */
+ #define RECOVERY_TARGET_NONE 0
+ #define RECOVERY_TARGET_PAUSE_ALL 1
+ #define RECOVERY_TARGET_PAUSE_CLEANUP 2
+ #define RECOVERY_TARGET_PAUSE_XID 3
+ #define RECOVERY_TARGET_PAUSE_TIME 4
+ #define RECOVERY_TARGET_ADVANCE 5
+ #define RECOVERY_TARGET_STOP_IMMEDIATE 6
+ #define RECOVERY_TARGET_STOP_XID 7
+ #define RECOVERY_TARGET_STOP_TIME 8
+
/* options taken from recovery.conf */
static char *recoveryRestoreCommand = NULL;
! static int recoveryTargetMode = RECOVERY_TARGET_NONE;
static bool recoveryTargetExact = false;
static bool recoveryTargetInclusive = true;
static bool recoveryLogRestartpoints = false;
static TransactionId recoveryTargetXid;
static TimestampTz recoveryTargetTime;
+ static int recoveryTargetAdvance = 0;
+
static TimestampTz recoveryLastXTime = 0;
+ static TransactionId recoveryLastXid = InvalidTransactionId;
/* if recoveryStopsHere returns true, it saves actual stop xid/time here */
static TransactionId recoveryStopXid;
static TimestampTz recoveryStopTime;
static bool recoveryStopAfter;
+ /* is the database proven consistent yet? */
+ bool reachedSafeStartPoint = false;
+
/*
* During normal operation, the only timeline we care about is ThisTimeLineID.
* During recovery, however, things are more complicated. To simplify life
***************
*** 240,249 ****
* ControlFileLock: must be held to read/update control file or create
* new log file.
*
! * CheckpointLock: must be held to do a checkpoint (ensures only one
! * checkpointer at a time; currently, with all checkpoints done by the
! * bgwriter, this is just pro forma).
! *
*----------
*/
--- 265,294 ----
* ControlFileLock: must be held to read/update control file or create
* new log file.
*
! * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring
! * we get just one of those at any time. In 8.4+ recovery, both startup and
! * bgwriter processes may take restartpoints, so this locking must be strict
! * to ensure there are no mistakes.
! *
! * In 8.4 we progress through a number of states at startup. Initially, the
! * postmaster is in PM_STARTUP state and spawns the Startup process. We then
! * progress until the database is in a consistent state, then if we are in
! * InArchiveRecovery we go into PM_RECOVERY state. The bgwriter then starts
! * up and takes over responsibility for performing restartpoints. We then
! * progress until the end of recovery when we enter PM_RUN state upon
! * termination of the Startup process. In summary:
! *
! * PM_STARTUP state: Startup process performs restartpoints
! * PM_RECOVERY state: bgwriter process performs restartpoints
! * PM_RUN state: bgwriter process performs checkpoints
! *
! * These transitions are fairly delicate, with many things that need to
! * happen at the same time in order to change state successfully throughout
! * the system. Changing PM_STARTUP to PM_RECOVERY only occurs when we can
! * prove the databases are in a consistent state. Changing from PM_RECOVERY
! * to PM_RUN happens whenever recovery ends, which could be forced upon us
! * externally or it can occur becasue of damage or termination of the WAL
! * sequence.
*----------
*/
***************
*** 285,295 ****
--- 330,347 ----
/*
* Total shared-memory state for XLOG.
+ *
+ * This small structure is accessed by many backends, so we take care to
+ * pad out the parts of the structure so they can be accessed by separate
+ * CPUs without causing false sharing cache flushes. Padding is generous
+ * to allow for a wide variety of CPU architectures.
*/
+ #define XLOGCTL_BUFFER_SPACING 128
typedef struct XLogCtlData
{
/* Protected by WALInsertLock: */
XLogCtlInsert Insert;
+ char InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)];
/* Protected by info_lck: */
XLogwrtRqst LogwrtRqst;
***************
*** 297,305 ****
--- 349,364 ----
uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */
TransactionId ckptXid;
XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */
+ /* add data structure padding for above info_lck declarations */
+ char InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst)
+ - sizeof(XLogwrtResult)
+ - sizeof(uint32)
+ - sizeof(TransactionId)
+ - sizeof(XLogRecPtr)];
/* Protected by WALWriteLock: */
XLogCtlWrite Write;
+ char WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)];
/*
* These values do not change after startup, although the pointed-to pages
***************
*** 311,316 ****
--- 370,406 ----
int XLogCacheBlck; /* highest allocated xlog buffer index */
TimeLineID ThisTimeLineID;
+ /*
+ * IsRecoveryProcessingMode shows whether the postmaster is in a
+ * postmaster state earlier than PM_RUN, or not. This is a globally
+ * accessible state to allow EXEC_BACKEND case.
+ *
+ * We also retain a local state variable InRecovery. InRecovery=true
+ * means the code is being executed by Startup process and therefore
+ * always during Recovery Processing Mode. This allows us to identify
+ * code executed *during* Recovery Processing Mode but not necessarily
+ * by Startup process itself.
+ *
+ * Protected by mode_lck
+ */
+ bool SharedRecoveryProcessingMode;
+ slock_t mode_lck;
+
+ /*
+ * recovery target control information
+ *
+ * Protected by info_lck
+ */
+ int recoveryTargetMode;
+ TransactionId recoveryTargetXid;
+ TimestampTz recoveryTargetTime;
+ int recoveryTargetAdvance;
+
+ TimestampTz recoveryLastXTime;
+ TransactionId recoveryLastXid;
+
+ char InfoLockPadding[XLOGCTL_BUFFER_SPACING];
+
slock_t info_lck; /* locks shared variables shown above */
} XLogCtlData;
***************
*** 397,404 ****
--- 487,496 ----
static void readRecoveryCommandFile(void);
static void exitArchiveRecovery(TimeLineID endTLI,
uint32 endLogId, uint32 endLogSeg);
+ static void exitRecovery(void);
static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+ static XLogRecPtr GetRedoLocationForCheckpoint(void);
static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 473,478 ****
--- 565,572 ----
XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];
XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];
XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];
+ TransactionId xl_xid2 = InvalidTransactionId;
+ uint16 xl_info2 = 0;
pg_crc32 rdata_crc;
uint32 len,
write_len;
***************
*** 480,485 ****
--- 574,587 ----
bool updrqst;
bool doPageWrites;
bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ bool isRecoveryEnd = (rmid == RM_XLOG_ID &&
+ (info == XLOG_RECOVERY_END ||
+ info == XLOG_CHECKPOINT_ONLINE));
+
+ /* cross-check on whether we should be here or not */
+ if (IsRecoveryProcessingMode() && !isRecoveryEnd)
+ elog(FATAL, "cannot make new WAL entries during recovery "
+ "(RMgrId = %d info = %d)", rmid, info);
/* info's high bits are reserved for use by me */
if (info & XLR_INFO_MASK)
***************
*** 628,633 ****
--- 730,740 ----
if (len == 0 && !isLogSwitch)
elog(PANIC, "invalid xlog record length %u", len);
+ /*
+ * Get standby information before we do lock and critical section.
+ */
+ GetStandbyInfoForTransaction(rmid, info, rdata, &xl_xid2, &xl_info2);
+
START_CRIT_SECTION();
/* Now wait to get insert lock */
***************
*** 816,821 ****
--- 923,930 ----
record->xl_len = len; /* doesn't include backup blocks */
record->xl_info = info;
record->xl_rmid = rmid;
+ record->xl_xid2 = xl_xid2;
+ record->xl_info2 = xl_info2;
/* Now we can finish computing the record's CRC */
COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
***************
*** 823,847 ****
FIN_CRC32(rdata_crc);
record->xl_crc = rdata_crc;
- #ifdef WAL_DEBUG
- if (XLOG_DEBUG)
- {
- StringInfoData buf;
-
- initStringInfo(&buf);
- appendStringInfo(&buf, "INSERT @ %X/%X: ",
- RecPtr.xlogid, RecPtr.xrecoff);
- xlog_outrec(&buf, record);
- if (rdata->data != NULL)
- {
- appendStringInfo(&buf, " - ");
- RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
- }
- elog(LOG, "%s", buf.data);
- pfree(buf.data);
- }
- #endif
-
/* Record begin of record in appropriate places */
ProcLastRecPtr = RecPtr;
Insert->PrevRecord = RecPtr;
--- 932,937 ----
***************
*** 1720,1727 ****
XLogRecPtr WriteRqstPtr;
XLogwrtRqst WriteRqst;
! /* Disabled during REDO */
! if (InRedo)
return;
/* Quick exit if already known flushed */
--- 1810,1816 ----
XLogRecPtr WriteRqstPtr;
XLogwrtRqst WriteRqst;
! if (IsRecoveryProcessingMode())
return;
/* Quick exit if already known flushed */
***************
*** 1729,1735 ****
return;
#ifdef WAL_DEBUG
! if (XLOG_DEBUG)
elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
record.xlogid, record.xrecoff,
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
--- 1818,1824 ----
return;
#ifdef WAL_DEBUG
! if (XLOG_DEBUG_FLUSH)
elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
record.xlogid, record.xrecoff,
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
***************
*** 1809,1817 ****
* the bad page is encountered again during recovery then we would be
* unable to restart the database at all! (This scenario has actually
* happened in the field several times with 7.1 releases. Note that we
! * cannot get here while InRedo is true, but if the bad page is brought in
! * and marked dirty during recovery then CreateCheckPoint will try to
! * flush it at the end of recovery.)
*
* The current approach is to ERROR under normal conditions, but only
* WARNING during recovery, so that the system can be brought up even if
--- 1898,1906 ----
* the bad page is encountered again during recovery then we would be
* unable to restart the database at all! (This scenario has actually
* happened in the field several times with 7.1 releases. Note that we
! * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
! * brought in and marked dirty during recovery then if a checkpoint were
! * performed at the end of recovery it will try to flush it.
*
* The current approach is to ERROR under normal conditions, but only
* WARNING during recovery, so that the system can be brought up even if
***************
*** 1821,1827 ****
* and so we will not force a restart for a bad LSN on a data page.
*/
if (XLByteLT(LogwrtResult.Flush, record))
! elog(InRecovery ? WARNING : ERROR,
"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
record.xlogid, record.xrecoff,
LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
--- 1910,1916 ----
* and so we will not force a restart for a bad LSN on a data page.
*/
if (XLByteLT(LogwrtResult.Flush, record))
! elog(ERROR,
"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
record.xlogid, record.xrecoff,
LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
***************
*** 1879,1885 ****
return;
#ifdef WAL_DEBUG
! if (XLOG_DEBUG)
elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
WriteRqstPtr.xlogid, WriteRqstPtr.xrecoff,
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
--- 1968,1974 ----
return;
#ifdef WAL_DEBUG
! if (XLOG_DEBUG_BGFLUSH)
elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
WriteRqstPtr.xlogid, WriteRqstPtr.xrecoff,
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
***************
*** 2094,2100 ****
unlink(tmppath);
}
! elog(DEBUG2, "done creating and filling new WAL file");
/* Set flag to tell caller there was no existent file */
*use_existent = false;
--- 2183,2190 ----
unlink(tmppath);
}
! XLogFileName(tmppath, ThisTimeLineID, log, seg);
! elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
/* Set flag to tell caller there was no existent file */
*use_existent = false;
***************
*** 2400,2405 ****
--- 2490,2517 ----
xlogfname);
set_ps_display(activitymsg, false);
+ /*
+ * Calculate and write out a new safeStartPoint. This defines
+ * the latest LSN that might appear on-disk while we apply
+ * the WAL records in this file. If we crash during recovery
+ * we must reach this point again before we can prove
+ * database consistency. Not a restartpoint! Restart points
+ * define where we should start recovery from, if we crash.
+ */
+ if (InArchiveRecovery)
+ {
+ uint32 nextLog = log;
+ uint32 nextSeg = seg;
+
+ NextLogSeg(nextLog, nextSeg);
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->minSafeStartPoint.xlogid = nextLog;
+ ControlFile->minSafeStartPoint.xrecoff = nextSeg * XLogSegSize;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ }
+
return fd;
}
if (errno != ENOENT) /* unexpected failure? */
***************
*** 2866,2871 ****
--- 2978,3003 ----
}
/*
+ * RecordIsCleanupRecord() determines whether or not the record
+ * will remove rows from data blocks. This is important because
+ * applying these records could effect the validity of MVCC snapshots,
+ * so there are various controls over replaying such records.
+ */
+ static bool
+ RecordIsCleanupRecord(XLogRecord *record)
+ {
+ RmgrId rmid = record->xl_rmid;
+ // uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ // if (rmid == RM_HEAP2_ID )||
+ // (rmid == RM_BTREE_ID && btree_needs_cleanup_lock(info)))
+ if (rmid == RM_HEAP2_ID )
+ return true;
+
+ return false;
+ }
+
+ /*
* Restore the backup blocks present in an XLOG record, if any.
*
* We assume all of the record has been read into memory at *record.
***************
*** 2887,2892 ****
--- 3019,3033 ----
BkpBlock bkpb;
char *blk;
int i;
+ int mode;
+
+ /*
+ * What kind of lock do we need to apply the backup blocks?
+ */
+ if (RecordIsCleanupRecord(record))
+ mode = BUFFER_LOCK_CLEANUP;
+ else
+ mode = BUFFER_LOCK_EXCLUSIVE;
blk = (char *) XLogRecGetData(record) + record->xl_len;
for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
***************
*** 2898,2904 ****
blk += sizeof(BkpBlock);
buffer = XLogReadBufferWithFork(bkpb.node, bkpb.fork, bkpb.block,
! true);
Assert(BufferIsValid(buffer));
page = (Page) BufferGetPage(buffer);
--- 3039,3045 ----
blk += sizeof(BkpBlock);
buffer = XLogReadBufferWithFork(bkpb.node, bkpb.fork, bkpb.block,
! true, mode);
Assert(BufferIsValid(buffer));
page = (Page) BufferGetPage(buffer);
***************
*** 4228,4233 ****
--- 4369,4375 ----
XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
+ SpinLockInit(&XLogCtl->mode_lck);
/*
* If we are not in bootstrap mode, pg_control should already exist. Read
***************
*** 4311,4316 ****
--- 4453,4459 ----
record->xl_prev.xlogid = 0;
record->xl_prev.xrecoff = 0;
record->xl_xid = InvalidTransactionId;
+ record->xl_xid2 = InvalidTransactionId;
record->xl_tot_len = SizeOfXLogRecord + sizeof(checkPoint);
record->xl_len = sizeof(checkPoint);
record->xl_info = XLOG_CHECKPOINT_SHUTDOWN;
***************
*** 4494,4500 ****
ereport(LOG,
(errmsg("recovery_target_xid = %u",
recoveryTargetXid)));
! recoveryTarget = true;
recoveryTargetExact = true;
}
else if (strcmp(tok1, "recovery_target_time") == 0)
--- 4637,4643 ----
ereport(LOG,
(errmsg("recovery_target_xid = %u",
recoveryTargetXid)));
! recoveryTargetMode = RECOVERY_TARGET_STOP_XID;
recoveryTargetExact = true;
}
else if (strcmp(tok1, "recovery_target_time") == 0)
***************
*** 4505,4511 ****
*/
if (recoveryTargetExact)
continue;
! recoveryTarget = true;
recoveryTargetExact = false;
/*
--- 4648,4654 ----
*/
if (recoveryTargetExact)
continue;
! recoveryTargetMode = RECOVERY_TARGET_STOP_TIME;
recoveryTargetExact = false;
/*
***************
*** 4678,4700 ****
unlink(recoveryPath); /* ignore any error */
/*
! * Rename the config file out of the way, so that we don't accidentally
! * re-enter archive recovery mode in a subsequent crash.
*/
- unlink(RECOVERY_COMMAND_DONE);
- if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
- ereport(FATAL,
- (errcode_for_file_access(),
- errmsg("could not rename file \"%s\" to \"%s\": %m",
- RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
ereport(LOG,
(errmsg("archive recovery complete")));
}
/*
! * For point-in-time recovery, this function decides whether we want to
! * stop applying the XLOG at or after the current record.
*
* Returns TRUE if we are stopping, FALSE otherwise. On TRUE return,
* *includeThis is set TRUE if we should apply this record before stopping.
--- 4821,4877 ----
unlink(recoveryPath); /* ignore any error */
/*
! * As of 8.4 we no longer rename the recovery.conf file out of the
! * way until after we have performed a full checkpoint. This ensures
! * that any crash between now and the end of the checkpoint does not
! * attempt to restart from a WAL file that is no longer available to us.
! * As soon as we remove recovery.conf we lose our recovery_command and
! * cannot reaccess WAL files from the archive.
*/
ereport(LOG,
(errmsg("archive recovery complete")));
}
+ #ifdef DEBUG_RECOVERY_CONTROL
+ static void
+ LogRecoveryTargetModeInfo(void)
+ {
+ int lrecoveryTargetMode;
+ TransactionId lrecoveryTargetXid;
+ TimestampTz lrecoveryTargetTime;
+ int lrecoveryTargetAdvance;
+
+ TimestampTz lrecoveryLastXTime;
+ TransactionId lrecoveryLastXid;
+
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+
+ lrecoveryTargetMode = xlogctl->recoveryTargetMode;
+ lrecoveryTargetXid = xlogctl->recoveryTargetXid;
+ lrecoveryTargetTime = xlogctl->recoveryTargetTime;
+ lrecoveryTargetAdvance = xlogctl->recoveryTargetAdvance;
+ lrecoveryLastXTime = xlogctl->recoveryLastXTime;
+ lrecoveryLastXid = xlogctl->recoveryLastXid;
+
+ SpinLockRelease(&xlogctl->info_lck);
+ }
+
+ elog(LOG, "mode %d xid %u time %s adv %d",
+ lrecoveryTargetMode,
+ lrecoveryTargetXid,
+ timestamptz_to_str(lrecoveryTargetTime),
+ lrecoveryTargetAdvance);
+ }
+ #endif
+
/*
! * For archive recovery, this function decides whether we want to
! * pause or stop applying the XLOG at or after the current record.
*
* Returns TRUE if we are stopping, FALSE otherwise. On TRUE return,
* *includeThis is set TRUE if we should apply this record before stopping.
***************
*** 4704,4775 ****
static bool
recoveryStopsHere(XLogRecord *record, bool *includeThis)
{
! bool stopsHere;
! uint8 record_info;
! TimestampTz recordXtime;
/* We only consider stopping at COMMIT or ABORT records */
! if (record->xl_rmid != RM_XACT_ID)
! return false;
! record_info = record->xl_info & ~XLR_INFO_MASK;
! if (record_info == XLOG_XACT_COMMIT)
{
! xl_xact_commit *recordXactCommitData;
! recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record);
! recordXtime = recordXactCommitData->xact_time;
}
! else if (record_info == XLOG_XACT_ABORT)
{
! xl_xact_abort *recordXactAbortData;
! recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record);
! recordXtime = recordXactAbortData->xact_time;
! }
! else
! return false;
! /* Remember the most recent COMMIT/ABORT time for logging purposes */
! recoveryLastXTime = recordXtime;
! /* Do we have a PITR target at all? */
! if (!recoveryTarget)
! return false;
- if (recoveryTargetExact)
- {
/*
! * there can be only one transaction end record with this exact
! * transactionid
! *
! * when testing for an xid, we MUST test for equality only, since
! * transactions are numbered in the order they start, not the order
! * they complete. A higher numbered xid will complete before you about
! * 50% of the time...
! */
! stopsHere = (record->xl_xid == recoveryTargetXid);
! if (stopsHere)
! *includeThis = recoveryTargetInclusive;
! }
! else
! {
/*
! * there can be many transactions that share the same commit time, so
! * we stop after the last one, if we are inclusive, or stop at the
! * first one if we are exclusive
*/
! if (recoveryTargetInclusive)
! stopsHere = (recordXtime > recoveryTargetTime);
! else
! stopsHere = (recordXtime >= recoveryTargetTime);
! if (stopsHere)
! *includeThis = false;
}
if (stopsHere)
{
recoveryStopXid = record->xl_xid;
! recoveryStopTime = recordXtime;
recoveryStopAfter = *includeThis;
if (record_info == XLOG_XACT_COMMIT)
--- 4881,5123 ----
static bool
recoveryStopsHere(XLogRecord *record, bool *includeThis)
{
! bool stopsHere = false;
! bool pauseHere = false;
! bool paused = false;
! uint8 record_info = 0; /* valid iff (is_xact_completion_record) */
! TimestampTz recordXtime = 0;
! bool is_xact_completion_record = false;
/* We only consider stopping at COMMIT or ABORT records */
! if (record->xl_rmid == RM_XACT_ID)
{
! record_info = record->xl_info & ~XLR_INFO_MASK;
! if (record_info == XLOG_XACT_COMMIT)
! {
! xl_xact_commit *recordXactCommitData;
!
! recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record);
! recordXtime = recordXactCommitData->xact_time;
! is_xact_completion_record = true;
! }
! else if (record_info == XLOG_XACT_ABORT)
! {
! xl_xact_abort *recordXactAbortData;
!
! recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record);
! recordXtime = recordXactAbortData->xact_time;
! is_xact_completion_record = true;
! }
! /* Remember the most recent COMMIT/ABORT time for logging purposes */
! if (is_xact_completion_record)
! {
! recoveryLastXTime = recordXtime;
! recoveryLastXid = record->xl_xid;
! }
}
!
! do
{
! int prevRecoveryTargetMode = recoveryTargetMode;
! /*
! * Let's see if user has updated our recoveryTargetMode.
! */
! {
! /* use volatile pointer to prevent code rearrangement */
! volatile XLogCtlData *xlogctl = XLogCtl;
!
! SpinLockAcquire(&xlogctl->info_lck);
! recoveryTargetMode = xlogctl->recoveryTargetMode;
! if (recoveryTargetMode != RECOVERY_TARGET_NONE)
! {
! recoveryTargetXid = xlogctl->recoveryTargetXid;
! recoveryTargetTime = xlogctl->recoveryTargetTime;
! recoveryTargetAdvance = xlogctl->recoveryTargetAdvance;
! }
! if (is_xact_completion_record)
! {
! xlogctl->recoveryLastXTime = recordXtime;
! xlogctl->recoveryLastXid = record->xl_xid;
! }
! SpinLockRelease(&xlogctl->info_lck);
! }
! /* Decide how to act on any pause target */
! switch (recoveryTargetMode)
! {
! case RECOVERY_TARGET_NONE:
! /*
! * If we aren't paused and we're not looking to stop,
! * just exit out quickly and get on with recovery.
! */
! if (paused)
! ereport(LOG,
! (errmsg("recovery restarting")));
! return false;
! case RECOVERY_TARGET_PAUSE_ALL:
! pauseHere = true;
! break;
!
! case RECOVERY_TARGET_ADVANCE:
! if (paused)
! {
! if (recoveryTargetAdvance > 0)
! return false;
! }
! else if (recoveryTargetAdvance-- <= 0)
! pauseHere = true;
! break;
!
! case RECOVERY_TARGET_STOP_IMMEDIATE:
! case RECOVERY_TARGET_STOP_XID:
! case RECOVERY_TARGET_STOP_TIME:
! paused = false;
! break;
!
! /*
! * If we're paused, and mode has changed reset to allow new settings
! * to apply and maybe allow us to continue.
! */
! if (paused && prevRecoveryTargetMode != recoveryTargetMode)
! paused = false;
!
! case RECOVERY_TARGET_PAUSE_CLEANUP:
! /*
! * Advance until we see a cleanup record.
! */
! if (RecordIsCleanupRecord(record))
! pauseHere = true;
! break;
!
! case RECOVERY_TARGET_PAUSE_XID:
! /*
! * there can be only one transaction end record with this exact
! * transactionid
! *
! * when testing for an xid, we MUST test for equality only, since
! * transactions are numbered in the order they start, not the order
! * they complete. A higher numbered xid will complete before you about
! * 50% of the time...
! */
! if (is_xact_completion_record)
! pauseHere = (record->xl_xid == recoveryTargetXid);
! break;
!
! case RECOVERY_TARGET_PAUSE_TIME:
! /*
! * there can be many transactions that share the same commit time, so
! * we pause after the last one, if we are inclusive, or pause at the
! * first one if we are exclusive
! */
! if (is_xact_completion_record)
! {
! if (recoveryTargetInclusive)
! pauseHere = (recoveryLastXTime > recoveryTargetTime);
! else
! pauseHere = (recoveryLastXTime >= recoveryTargetTime);
! }
! break;
!
! default:
! ereport(WARNING,
! (errmsg("unknown recovery mode %d, continuing recovery",
! recoveryTargetMode)));
! return false;
! }
!
! if (pauseHere && !paused)
! {
! if (is_xact_completion_record)
! {
! if (record_info == XLOG_XACT_COMMIT)
! ereport(LOG,
! (errmsg("recovery pausing before commit of transaction %u, time %s",
! record->xl_xid,
! timestamptz_to_str(recoveryLastXTime))));
! else
! ereport(LOG,
! (errmsg("recovery pausing before abort of transaction %u, time %s",
! record->xl_xid,
! timestamptz_to_str(recoveryLastXTime))));
! }
! else
! ereport(LOG,
! (errmsg("recovery pausing; last completed transaction %u, time %s",
! recoveryLastXid,
! timestamptz_to_str(recoveryLastXTime))));
!
! set_ps_display("recovery paused", false);
!
! paused = true;
! }
/*
! * Pause for a while before rechecking mode at top of loop.
! */
! if (paused)
! pg_usleep(200000L);
!
/*
! * We leave the loop at the bottom only if our recovery mode is
! * set (or has been recently reset) to one of the stop options.
*/
! } while (paused);
!
! /*
! * Decide how to act if stop target mode set. We run this separately from
! * pause to allow user to reset their stop target while paused.
! */
! switch (recoveryTargetMode)
! {
! case RECOVERY_TARGET_STOP_IMMEDIATE:
! ereport(LOG,
! (errmsg("recovery stopping immediately")));
! return true;
!
! case RECOVERY_TARGET_STOP_XID:
! /*
! * there can be only one transaction end record with this exact
! * transactionid
! *
! * when testing for an xid, we MUST test for equality only, since
! * transactions are numbered in the order they start, not the order
! * they complete. A higher numbered xid will complete before you about
! * 50% of the time...
! */
! if (is_xact_completion_record)
! {
! stopsHere = (record->xl_xid == recoveryTargetXid);
! if (stopsHere)
! *includeThis = recoveryTargetInclusive;
! }
! break;
!
! case RECOVERY_TARGET_STOP_TIME:
! /*
! * there can be many transactions that share the same commit time, so
! * we stop after the last one, if we are inclusive, or stop at the
! * first one if we are exclusive
! */
! if (is_xact_completion_record)
! {
! if (recoveryTargetInclusive)
! stopsHere = (recoveryLastXTime > recoveryTargetTime);
! else
! stopsHere = (recoveryLastXTime >= recoveryTargetTime);
! if (stopsHere)
! *includeThis = false;
! }
! break;
}
if (stopsHere)
{
+ Assert(is_xact_completion_record);
recoveryStopXid = record->xl_xid;
! recoveryStopTime = recoveryLastXTime;
recoveryStopAfter = *includeThis;
if (record_info == XLOG_XACT_COMMIT)
***************
*** 4804,4809 ****
--- 5152,5319 ----
}
/*
+ * Utility function used by various user functions to set the recovery
+ * target mode. This allows user control over the progress of recovery.
+ */
+ static void
+ SetRecoveryTargetMode(int mode, TransactionId xid, TimestampTz ts, int advance)
+ {
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser to control recovery")));
+
+ if (!IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("WAL control functions can only be executed during recovery.")));
+
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->recoveryTargetMode = mode;
+
+ if (mode == RECOVERY_TARGET_STOP_XID ||
+ mode == RECOVERY_TARGET_PAUSE_XID)
+ xlogctl->recoveryTargetXid = xid;
+ else if (mode == RECOVERY_TARGET_STOP_TIME ||
+ mode == RECOVERY_TARGET_PAUSE_TIME)
+ xlogctl->recoveryTargetTime = ts;
+ else if (mode == RECOVERY_TARGET_ADVANCE)
+ xlogctl->recoveryTargetAdvance = advance;
+
+ SpinLockRelease(&xlogctl->info_lck);
+ }
+
+ return;
+ }
+
+ /*
+ * Forces recovery mode to reset to unfrozen.
+ * Returns void.
+ */
+ Datum
+ pg_recovery_continue(PG_FUNCTION_ARGS)
+ {
+ SetRecoveryTargetMode(RECOVERY_TARGET_NONE, InvalidTransactionId, 0, 0);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Pause recovery immediately. Stays paused until asked to play again.
+ * Returns void.
+ */
+ Datum
+ pg_recovery_pause(PG_FUNCTION_ARGS)
+ {
+ SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_ALL, InvalidTransactionId, 0, 0);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Pause recovery at the next cleanup record. Stays paused until asked to
+ * play again.
+ */
+ Datum
+ pg_recovery_pause_cleanup(PG_FUNCTION_ARGS)
+ {
+ SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_CLEANUP, InvalidTransactionId, 0, 0);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Pause recovery at stated xid, if ever seen. Once paused, stays paused
+ * until asked to play again.
+ */
+ Datum
+ pg_recovery_pause_xid(PG_FUNCTION_ARGS)
+ {
+ int xidi = PG_GETARG_INT32(0);
+ TransactionId xid = (TransactionId) xidi;
+
+ if (xid < 3)
+ elog(ERROR, "cannot specify special values for transaction id");
+
+ SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_XID, xid, 0, 0);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Pause recovery at stated timestamp, if ever reached. Once paused, stays paused
+ * until asked to play again.
+ */
+ Datum
+ pg_recovery_pause_time(PG_FUNCTION_ARGS)
+ {
+ TimestampTz ts = PG_GETARG_TIMESTAMPTZ(0);
+
+ SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_TIME, InvalidTransactionId, ts, 0);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * If paused, advance N records.
+ */
+ Datum
+ pg_recovery_advance(PG_FUNCTION_ARGS)
+ {
+ int adv = PG_GETARG_INT32(0);
+
+ if (adv < 1)
+ elog(ERROR, "recovery advance must be greater than or equal to 1");
+
+ SetRecoveryTargetMode(RECOVERY_TARGET_ADVANCE, InvalidTransactionId, 0, adv);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Forces recovery to stop now if paused, or at end of next record if playing.
+ */
+ Datum
+ pg_recovery_stop(PG_FUNCTION_ARGS)
+ {
+ SetRecoveryTargetMode(RECOVERY_TARGET_STOP_IMMEDIATE, InvalidTransactionId, 0, 0);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Returns bool with current recovery mode
+ */
+ Datum
+ pg_is_in_recovery(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_BOOL(IsRecoveryProcessingMode());
+ }
+
+ /*
+ * Returns timestamp of last completed transaction
+ */
+ Datum
+ pg_last_completed_xact_timestamp(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_TIMESTAMPTZ(recoveryLastXTime);
+ }
+
+ /*
+ * Returns xid of last completed transaction
+ */
+ Datum
+ pg_last_completed_xid(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_INT32(recoveryLastXid);
+ }
+
+ /*
* This must be called ONCE during postmaster or standalone-backend startup
*/
void
***************
*** 4813,4818 ****
--- 5323,5329 ----
CheckPoint checkPoint;
bool wasShutdown;
bool reachedStopPoint = false;
+ bool performedRecovery = false;
bool haveBackupLabel = false;
XLogRecPtr RecPtr,
LastRec,
***************
*** 4825,4830 ****
--- 5336,5343 ----
uint32 freespace;
TransactionId oldestActiveXID;
+ XLogCtl->SharedRecoveryProcessingMode = true;
+
/*
* Read control file and check XLOG status looks valid.
*
***************
*** 5038,5046 ****
--- 5551,5565 ----
if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
ControlFile->minRecoveryPoint = minRecoveryLoc;
ControlFile->time = (pg_time_t) time(NULL);
+ /* No need to hold ControlFileLock yet, we aren't up far enough */
UpdateControlFile();
/*
+ * Reset pgstat data, because it may be invalid after recovery.
+ */
+ pgstat_reset_all();
+
+ /*
* If there was a backup label file, it's done its job and the info
* has now been propagated into pg_control. We must get rid of the
* label file so that if we crash during recovery, we'll pick up at
***************
*** 5097,5103 ****
do
{
#ifdef WAL_DEBUG
! if (XLOG_DEBUG)
{
StringInfoData buf;
--- 5616,5627 ----
do
{
#ifdef WAL_DEBUG
! int loglevel = DEBUG3;
!
! if (XLogRecIsFirstUseOfXid(record) || rmid == RM_XACT_ID)
! loglevel = DEBUG2;
!
! if (loglevel >= trace_recovery_messages)
{
StringInfoData buf;
***************
*** 5143,5148 ****
--- 5667,5675 ----
if (record->xl_info & XLR_BKP_BLOCK_MASK)
RestoreBkpBlocks(record, EndRecPtr);
+ if (XLogRecIsFirstUseOfXid(record))
+ RecordKnownAssignedTransactionIds(EndRecPtr, record);
+
RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);
/* Pop the error context stack */
***************
*** 5150,5155 ****
--- 5677,5712 ----
LastRec = ReadRecPtr;
+ /*
+ * Can we signal Postmaster to enter consistent recovery mode?
+ *
+ * There are two points in the log that we must pass. The first
+ * is minRecoveryPoint, which is the LSN at the time the
+ * base backup was taken that we are about to rollfoward from.
+ * If recovery has ever crashed or was stopped there is also
+ * another point also: minSafeStartPoint, which we know the
+ * latest LSN that recovery could have reached prior to crash.
+ *
+ * We must also have assembled sufficient information about
+ * transaction state to allow valid snapshots to be taken.
+ */
+ if (!reachedSafeStartPoint &&
+ IsRunningXactDataIsValid() &&
+ XLByteLE(ControlFile->minSafeStartPoint, EndRecPtr) &&
+ XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr))
+ {
+ reachedSafeStartPoint = true;
+ if (InArchiveRecovery)
+ {
+ ereport(LOG,
+ (errmsg("database has now reached consistent state at %X/%X",
+ EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+ StartCleanupDelayStats();
+ if (IsUnderPostmaster)
+ SendPostmasterSignal(PMSIGNAL_RECOVERY_START);
+ }
+ }
+
record = ReadRecord(NULL, LOG);
} while (record != NULL && recoveryContinue);
***************
*** 5171,5176 ****
--- 5728,5734 ----
/* there are no WAL records following the checkpoint */
ereport(LOG,
(errmsg("redo is not required")));
+ reachedSafeStartPoint = true;
}
}
***************
*** 5184,5192 ****
/*
* Complain if we did not roll forward far enough to render the backup
! * dump consistent.
*/
! if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
{
if (reachedStopPoint) /* stopped because of stop request */
ereport(FATAL,
--- 5742,5750 ----
/*
* Complain if we did not roll forward far enough to render the backup
! * dump consistent and start safely.
*/
! if (InRecovery && !reachedSafeStartPoint)
{
if (reachedStopPoint) /* stopped because of stop request */
ereport(FATAL,
***************
*** 5308,5346 ****
XLogCheckInvalidPages();
/*
! * Reset pgstat data, because it may be invalid after recovery.
*/
! pgstat_reset_all();
! /*
! * Perform a checkpoint to update all our recovery activity to disk.
! *
! * Note that we write a shutdown checkpoint rather than an on-line
! * one. This is not particularly critical, but since we may be
! * assigning a new TLI, using a shutdown checkpoint allows us to have
! * the rule that TLI only changes in shutdown checkpoints, which
! * allows some extra error checking in xlog_redo.
! */
! CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
- /*
- * Preallocate additional log files, if wanted.
- */
- PreallocXlogFiles(EndOfLog);
-
- /*
- * Okay, we're officially UP.
- */
- InRecovery = false;
-
- ControlFile->state = DB_IN_PRODUCTION;
- ControlFile->time = (pg_time_t) time(NULL);
- UpdateControlFile();
-
- /* start the archive_timeout timer running */
- XLogCtl->Write.lastSegSwitchTime = ControlFile->time;
-
/* initialize shared-memory copy of latest checkpoint XID/epoch */
XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
--- 5866,5879 ----
XLogCheckInvalidPages();
/*
! * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote
! * a shutdown checkpoint here, but we ask bgwriter to do that now.
*/
! exitRecovery();
! performedRecovery = true;
}
/* initialize shared-memory copy of latest checkpoint XID/epoch */
XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
***************
*** 5349,5354 ****
--- 5882,5889 ----
ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+ ProcArrayClearRecoveryTransactions();
+
/* Start up the commit log and related stuff, too */
StartupCLOG();
StartupSUBTRANS(oldestActiveXID);
***************
*** 5374,5379 ****
--- 5909,6010 ----
readRecordBuf = NULL;
readRecordBufSize = 0;
}
+
+ /*
+ * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
+ * This could add minutes to the startup time, so we want bgwriter
+ * to perform it. This then frees the Startup process to complete so we can
+ * allow transactions and WAL inserts. We still write a checkpoint, but
+ * it will be an online checkpoint. Online checkpoints have a redo
+ * location that can be prior to the actual checkpoint record. So we want
+ * to derive that redo location *before* we let anybody else write WAL,
+ * otherwise we might miss some WAL records if we crash.
+ */
+ if (performedRecovery)
+ {
+ XLogRecPtr redo;
+
+ /*
+ * We must grab the pointer before anybody writes WAL
+ */
+ redo = GetRedoLocationForCheckpoint();
+
+ /*
+ * Set up information for the bgwriter, but if it is not active
+ * for whatever reason, perform the checkpoint ourselves.
+ */
+ if (SetRedoLocationForArchiveCheckpoint(redo))
+ {
+ /*
+ * Okay, we can come up now. Allow others to write WAL.
+ */
+ XLogCtl->SharedRecoveryProcessingMode = false;
+ elog(trace_recovery(DEBUG1), "WAL inserts enabled");
+
+ /*
+ * Now request checkpoint from bgwriter.
+ */
+ RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+ }
+ else
+ {
+ /*
+ * Startup process performs the checkpoint, but defers
+ * the change in processing mode until afterwards.
+ */
+ CreateCheckPoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+ }
+ }
+ else
+ {
+ /*
+ * No recovery, so lets just get on with it.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->state = DB_IN_PRODUCTION;
+ ControlFile->time = (pg_time_t) time(NULL);
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ }
+
+ /*
+ * Okay, we can come up now. Allow others to write WAL.
+ */
+ XLogCtl->SharedRecoveryProcessingMode = false;
+
+ /* start the archive_timeout timer running */
+ XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);
+ }
+
+ /*
+ * IsRecoveryProcessingMode()
+ *
+ * Fast test for whether we're still in recovery or not. We test the shared
+ * state each time only until we leave recovery mode. After that we never
+ * look again, relying upon the settings of our local state variables. This
+ * is designed to avoid the need for a separate initialisation step.
+ */
+ bool
+ IsRecoveryProcessingMode(void)
+ {
+ if (knownProcessingMode && !LocalRecoveryProcessingMode)
+ return false;
+
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ if (xlogctl == NULL)
+ return false;
+
+ SpinLockAcquire(&xlogctl->mode_lck);
+ LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode;
+ SpinLockRelease(&xlogctl->mode_lck);
+ }
+
+ knownProcessingMode = true;
+
+ return LocalRecoveryProcessingMode;
}
/*
***************
*** 5631,5650 ****
static void
LogCheckpointStart(int flags)
{
! elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! (flags & CHECKPOINT_FORCE) ? " force" : "",
! (flags & CHECKPOINT_WAIT) ? " wait" : "",
! (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
}
/*
* Log end of a checkpoint.
*/
static void
! LogCheckpointEnd(void)
{
long write_secs,
sync_secs,
--- 6262,6285 ----
static void
LogCheckpointStart(int flags)
{
! if (flags & CHECKPOINT_RESTARTPOINT)
! elog(LOG, "restartpoint starting:%s",
! (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "");
! else
! elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! (flags & CHECKPOINT_FORCE) ? " force" : "",
! (flags & CHECKPOINT_WAIT) ? " wait" : "",
! (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
}
/*
* Log end of a checkpoint.
*/
static void
! LogCheckpointEnd(int flags)
{
long write_secs,
sync_secs,
***************
*** 5667,5683 ****
CheckpointStats.ckpt_sync_end_t,
&sync_secs, &sync_usecs);
! elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! "%d transaction log file(s) added, %d removed, %d recycled; "
! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! CheckpointStats.ckpt_bufs_written,
! (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! CheckpointStats.ckpt_segs_added,
! CheckpointStats.ckpt_segs_removed,
! CheckpointStats.ckpt_segs_recycled,
! write_secs, write_usecs / 1000,
! sync_secs, sync_usecs / 1000,
! total_secs, total_usecs / 1000);
}
/*
--- 6302,6327 ----
CheckpointStats.ckpt_sync_end_t,
&sync_secs, &sync_usecs);
! if (flags & CHECKPOINT_RESTARTPOINT)
! elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! CheckpointStats.ckpt_bufs_written,
! (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! write_secs, write_usecs / 1000,
! sync_secs, sync_usecs / 1000,
! total_secs, total_usecs / 1000);
! else
! elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! "%d transaction log file(s) added, %d removed, %d recycled; "
! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! CheckpointStats.ckpt_bufs_written,
! (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! CheckpointStats.ckpt_segs_added,
! CheckpointStats.ckpt_segs_removed,
! CheckpointStats.ckpt_segs_recycled,
! write_secs, write_usecs / 1000,
! sync_secs, sync_usecs / 1000,
! total_secs, total_usecs / 1000);
}
/*
***************
*** 5702,5718 ****
XLogRecPtr recptr;
XLogCtlInsert *Insert = &XLogCtl->Insert;
XLogRecData rdata;
- uint32 freespace;
uint32 _logId;
uint32 _logSeg;
TransactionId *inCommitXids;
int nInCommit;
/*
* Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! * (This is just pro forma, since in the present system structure there is
! * only one process that is allowed to issue checkpoints at any given
! * time.)
*/
LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
--- 6346,6361 ----
XLogRecPtr recptr;
XLogCtlInsert *Insert = &XLogCtl->Insert;
XLogRecData rdata;
uint32 _logId;
uint32 _logSeg;
TransactionId *inCommitXids;
int nInCommit;
+ bool leavingArchiveRecovery = false;
/*
* Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! * That shouldn't be happening, but checkpoints are an important aspect
! * of our resilience, so we take no chances.
*/
LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
***************
*** 5727,5741 ****
--- 6370,6393 ----
CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
/*
+ * Find out if this is the first checkpoint after archive recovery.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ leavingArchiveRecovery = (ControlFile->state == DB_IN_ARCHIVE_RECOVERY);
+ LWLockRelease(ControlFileLock);
+
+ /*
* Use a critical section to force system panic if we have trouble.
*/
START_CRIT_SECTION();
if (shutdown)
{
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->state = DB_SHUTDOWNING;
ControlFile->time = (pg_time_t) time(NULL);
UpdateControlFile();
+ LWLockRelease(ControlFileLock);
}
/*
***************
*** 5791,5840 ****
}
}
! /*
! * Compute new REDO record ptr = location of next XLOG record.
! *
! * NB: this is NOT necessarily where the checkpoint record itself will be,
! * since other backends may insert more XLOG records while we're off doing
! * the buffer flush work. Those XLOG records are logically after the
! * checkpoint, even though physically before it. Got that?
! */
! freespace = INSERT_FREESPACE(Insert);
! if (freespace < SizeOfXLogRecord)
! {
! (void) AdvanceXLInsertBuffer(false);
! /* OK to ignore update return flag, since we will do flush anyway */
! freespace = INSERT_FREESPACE(Insert);
! }
! INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
!
! /*
! * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! * must be done while holding the insert lock AND the info_lck.
! *
! * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
! * pointing past where it really needs to point. This is okay; the only
! * consequence is that XLogInsert might back up whole buffers that it
! * didn't really need to. We can't postpone advancing RedoRecPtr because
! * XLogInserts that happen while we are dumping buffers must assume that
! * their buffer changes are not included in the checkpoint.
! */
{
! /* use volatile pointer to prevent code rearrangement */
! volatile XLogCtlData *xlogctl = XLogCtl;
! SpinLockAcquire(&xlogctl->info_lck);
! RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
! SpinLockRelease(&xlogctl->info_lck);
}
/*
- * Now we can release WAL insert lock, allowing other xacts to proceed
- * while we are flushing disk buffers.
- */
- LWLockRelease(WALInsertLock);
-
- /*
* If enabled, log checkpoint start. We postpone this until now so as not
* to log anything if we decided to skip the checkpoint.
*/
--- 6443,6470 ----
}
}
! if (leavingArchiveRecovery)
! checkPoint.redo = GetRedoLocationForArchiveCheckpoint();
! else
{
! /*
! * Compute new REDO record ptr = location of next XLOG record.
! *
! * NB: this is NOT necessarily where the checkpoint record itself will be,
! * since other backends may insert more XLOG records while we're off doing
! * the buffer flush work. Those XLOG records are logically after the
! * checkpoint, even though physically before it. Got that?
! */
! checkPoint.redo = GetRedoLocationForCheckpoint();
! /*
! * Now we can release WAL insert lock, allowing other xacts to proceed
! * while we are flushing disk buffers.
! */
! LWLockRelease(WALInsertLock);
}
/*
* If enabled, log checkpoint start. We postpone this until now so as not
* to log anything if we decided to skip the checkpoint.
*/
***************
*** 5941,5951 ****
XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
/*
! * Update the control file.
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (shutdown)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->prevCheckPoint = ControlFile->checkPoint;
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint;
--- 6571,6588 ----
XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
/*
! * Update the control file. In 8.4, this routine becomes the primary
! * point for recording changes of state in the control file at the
! * end of recovery. Postmaster state already shows us being in
! * normal running mode, but it is only after this point that we
! * are completely free of reperforming a recovery if we crash. Note
! * that this is executed by bgwriter after the death of Startup process.
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (shutdown)
ControlFile->state = DB_SHUTDOWNED;
+ else
+ ControlFile->state = DB_IN_PRODUCTION;
ControlFile->prevCheckPoint = ControlFile->checkPoint;
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint;
***************
*** 5953,5958 ****
--- 6590,6610 ----
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ if (leavingArchiveRecovery)
+ {
+ /*
+ * Rename the config file out of the way, so that we don't accidentally
+ * re-enter archive recovery mode in a subsequent crash. Prior to
+ * 8.4 this step was performed at end of exitArchiveRecovery().
+ */
+ unlink(RECOVERY_COMMAND_DONE);
+ if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
+ }
+
/* Update shared-memory copy of checkpoint XID/epoch */
{
/* use volatile pointer to prevent code rearrangement */
***************
*** 5996,6012 ****
* Truncate pg_subtrans if possible. We can throw away all data before
* the oldest XMIN of any running transaction. No future transaction will
* attempt to reference any pg_subtrans entry older than that (see Asserts
! * in subtrans.c). During recovery, though, we mustn't do this because
! * StartupSUBTRANS hasn't been called yet.
*/
! if (!InRecovery)
TruncateSUBTRANS(GetOldestXmin(true, false));
/* All real work is done, but log before releasing lock. */
if (log_checkpoints)
! LogCheckpointEnd();
LWLockRelease(CheckpointLock);
}
/*
--- 6648,6719 ----
* Truncate pg_subtrans if possible. We can throw away all data before
* the oldest XMIN of any running transaction. No future transaction will
* attempt to reference any pg_subtrans entry older than that (see Asserts
! * in subtrans.c).
*/
! if (!shutdown)
TruncateSUBTRANS(GetOldestXmin(true, false));
/* All real work is done, but log before releasing lock. */
if (log_checkpoints)
! LogCheckpointEnd(flags);
LWLockRelease(CheckpointLock);
+
+ /*
+ * Take a snapshot of running transactions and write this to WAL.
+ * This allows us to reconstruct the state of running transactions
+ * during archive recovery, if required.
+ *
+ * If we are shutting down, or Startup process is completing crash
+ * recovery we don't need to write running xact data.
+ */
+ if (!shutdown && !IsRecoveryProcessingMode())
+ LogCurrentRunningXacts();
+ }
+
+ /*
+ * GetRedoLocationForCheckpoint()
+ *
+ * When !IsRecoveryProcessingMode() this must be called while holding
+ * WALInsertLock().
+ */
+ static XLogRecPtr
+ GetRedoLocationForCheckpoint()
+ {
+ XLogCtlInsert *Insert = &XLogCtl->Insert;
+ uint32 freespace;
+ XLogRecPtr redo;
+
+ freespace = INSERT_FREESPACE(Insert);
+ if (freespace < SizeOfXLogRecord)
+ {
+ (void) AdvanceXLInsertBuffer(false);
+ /* OK to ignore update return flag, since we will do flush anyway */
+ freespace = INSERT_FREESPACE(Insert);
+ }
+ INSERT_RECPTR(redo, Insert, Insert->curridx);
+
+ /*
+ * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+ * must be done while holding the insert lock AND the info_lck.
+ *
+ * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
+ * pointing past where it really needs to point. This is okay; the only
+ * consequence is that XLogInsert might back up whole buffers that it
+ * didn't really need to. We can't postpone advancing RedoRecPtr because
+ * XLogInserts that happen while we are dumping buffers must assume that
+ * their buffer changes are not included in the checkpoint.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ RedoRecPtr = xlogctl->Insert.RedoRecPtr = redo;
+ SpinLockRelease(&xlogctl->info_lck);
+ }
+
+ return redo;
}
/*
***************
*** 6065,6071 ****
if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
if (!(RmgrTable[rmid].rm_safe_restartpoint()))
{
! elog(DEBUG2, "RM %d not safe to record restart point at %X/%X",
rmid,
checkPoint->redo.xlogid,
checkPoint->redo.xrecoff);
--- 6772,6778 ----
if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
if (!(RmgrTable[rmid].rm_safe_restartpoint()))
{
! elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X",
rmid,
checkPoint->redo.xlogid,
checkPoint->redo.xrecoff);
***************
*** 6073,6103 ****
}
}
/*
! * OK, force data out to disk
*/
! CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);
/*
! * Update pg_control so that any subsequent crash will restart from this
! * checkpoint. Note: ReadRecPtr gives the XLOG address of the checkpoint
! * record itself.
*/
- ControlFile->prevCheckPoint = ControlFile->checkPoint;
- ControlFile->checkPoint = ReadRecPtr;
- ControlFile->checkPointCopy = *checkPoint;
- ControlFile->time = (pg_time_t) time(NULL);
- UpdateControlFile();
ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! (errmsg("recovery restart point at %X/%X",
! checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
! if (recoveryLastXTime)
! ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! (errmsg("last completed transaction was at log time %s",
! timestamptz_to_str(recoveryLastXTime))));
! }
/*
* Write a NEXTOID log record
*/
--- 6780,6852 ----
}
}
+ RequestRestartPoint(ReadRecPtr, checkPoint, reachedSafeStartPoint);
+ }
+
+ /*
+ * As of 8.4, RestartPoints are always created by the bgwriter
+ * once we have reachedSafeStartPoint. We use bgwriter's shared memory
+ * area wherever we call it from, to keep better code structure.
+ */
+ void
+ CreateRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, int flags)
+ {
+ if (recoveryLogRestartpoints || log_checkpoints)
+ {
+ /*
+ * Prepare to accumulate statistics.
+ */
+
+ MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+ CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+
+ LogCheckpointStart(CHECKPOINT_RESTARTPOINT | flags);
+ }
+
+ /*
+ * Acquire CheckpointLock to ensure only one restartpoint happens at a time.
+ * We rely on this lock to ensure that the startup process doesn't exit
+ * Recovery while we are half way through a restartpoint.
+ */
+ LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+
+ CheckPointGuts(restartPoint->redo, CHECKPOINT_RESTARTPOINT | flags);
+
/*
! * Update pg_control, using current time
*/
! LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
! ControlFile->prevCheckPoint = ControlFile->checkPoint;
! ControlFile->checkPoint = ReadPtr;
! ControlFile->checkPointCopy = *restartPoint;
! ControlFile->time = (pg_time_t) time(NULL);
! UpdateControlFile();
! LWLockRelease(ControlFileLock);
/*
! * Currently, there is no need to truncate pg_subtrans during recovery.
! * If we did do that, we will need to have called StartupSUBTRANS()
! * already and then TruncateSUBTRANS() would go here.
*/
+ /* All real work is done, but log before releasing lock. */
+ if (recoveryLogRestartpoints || log_checkpoints)
+ LogCheckpointEnd(CHECKPOINT_RESTARTPOINT);
+
ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! (errmsg("recovery restart point at %X/%X",
! restartPoint->redo.xlogid, restartPoint->redo.xrecoff)));
!
! ReportCleanupDelayStats();
+ if (recoveryLastXTime)
+ ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
+ (errmsg("last completed transaction was at log time %s",
+ timestamptz_to_str(recoveryLastXTime))));
+
+ LWLockRelease(CheckpointLock);
+ }
+
/*
* Write a NEXTOID log record
*/
***************
*** 6160,6166 ****
}
/*
! * XLOG resource manager's routines
*/
void
xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 6909,6971 ----
}
/*
! * exitRecovery()
! *
! * Exit recovery state and write a XLOG_RECOVERY_END record. This is the
! * only record type that can record a change of timelineID. We assume
! * caller has already set ThisTimeLineID, if appropriate.
! */
! static void
! exitRecovery(void)
! {
! XLogRecData rdata;
!
! rdata.buffer = InvalidBuffer;
! rdata.data = (char *) (&ThisTimeLineID);
! rdata.len = sizeof(TimeLineID);
! rdata.next = NULL;
!
! /*
! * If a restartpoint is in progress, we will not be able to successfully
! * acquire CheckpointLock. If bgwriter is still in progress then send
! * a second signal to nudge bgwriter to go faster so we can avoid delay.
! * Then wait for lock, so we know the restartpoint has completed. We do
! * this because we don't want to interrupt the restartpoint half way
! * through, which might leave us in a mess and we want to be robust. We're
! * going to checkpoint soon anyway, so not it's not wasted effort.
! */
! if (LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
! LWLockRelease(CheckpointLock);
! else
! {
! RequestRestartPointCompletion();
! ereport(trace_recovery(DEBUG1),
! (errmsg("startup process waiting for restartpoint to complete")));
! LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
! LWLockRelease(CheckpointLock);
! }
!
! /*
! * This is the only type of WAL message that can be inserted during
! * recovery. This ensures that we don't allow others to get access
! * until after we have changed state.
! */
! (void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata);
!
! /*
! * We don't XLogFlush() here otherwise we'll end up zeroing the WAL
! * file ourselves. So just let bgwriter's forthcoming checkpoint do
! * that for us.
! */
!
! InRecovery = false;
! }
!
! /*
! * XLOG resource manager's routines.
! *
! * Definitions of message info are in include/catalog/pg_control.h,
! * though not all messages relate to control file processing.
*/
void
xlog_redo(XLogRecPtr lsn, XLogRecord *record)
***************
*** 6190,6216 ****
MultiXactSetNextMXact(checkPoint.nextMulti,
checkPoint.nextMultiOffset);
/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
! /*
! * TLI may change in a shutdown checkpoint, but it shouldn't decrease
*/
- if (checkPoint.ThisTimeLineID != ThisTimeLineID)
- {
- if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
- !list_member_int(expectedTLIs,
- (int) checkPoint.ThisTimeLineID))
- ereport(PANIC,
- (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
- checkPoint.ThisTimeLineID, ThisTimeLineID)));
- /* Following WAL records should be run with new TLI */
- ThisTimeLineID = checkPoint.ThisTimeLineID;
- }
RecoveryRestartPoint(&checkPoint);
}
else if (info == XLOG_CHECKPOINT_ONLINE)
{
CheckPoint checkPoint;
--- 6995,7041 ----
MultiXactSetNextMXact(checkPoint.nextMulti,
checkPoint.nextMultiOffset);
+ /* We know nothing was running on the master after this point */
+ ProcArrayClearRecoveryTransactions();
+
/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
! /*
! * TLI no longer changes at shutdown checkpoint, since as of 8.4,
! * shutdown checkpoints only occur at shutdown. Much less confusing.
*/
RecoveryRestartPoint(&checkPoint);
}
+ else if (info == XLOG_RECOVERY_END)
+ {
+ TimeLineID tli;
+
+ memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID));
+
+ /*
+ * TLI may change when recovery ends, but it shouldn't decrease.
+ *
+ * This is the only WAL record that can tell us to change timelineID
+ * while we process WAL records.
+ *
+ * We can *choose* to stop recovery at any point, generating a
+ * new timelineID which is recorded using this record type.
+ */
+ if (tli != ThisTimeLineID)
+ {
+ if (tli < ThisTimeLineID ||
+ !list_member_int(expectedTLIs,
+ (int) tli))
+ ereport(PANIC,
+ (errmsg("unexpected timeline ID %u (after %u) at recovery end record",
+ tli, ThisTimeLineID)));
+ /* Following WAL records should be run with new TLI */
+ ThisTimeLineID = tli;
+ }
+ }
else if (info == XLOG_CHECKPOINT_ONLINE)
{
CheckPoint checkPoint;
***************
*** 6232,6238 ****
ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
! /* TLI should not change in an on-line checkpoint */
if (checkPoint.ThisTimeLineID != ThisTimeLineID)
ereport(PANIC,
(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
--- 7057,7063 ----
ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
! /* TLI must not change at a checkpoint */
if (checkPoint.ThisTimeLineID != ThisTimeLineID)
ereport(PANIC,
(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
***************
*** 6300,6305 ****
--- 7125,7137 ----
record->xl_prev.xlogid, record->xl_prev.xrecoff,
record->xl_xid);
+ appendStringInfo(buf, "; pxid %u %s %s len %u slot %d",
+ record->xl_xid2,
+ (XLogRecIsFirstXidRecord(record) ? "t" : "f"),
+ (XLogRecIsFirstSubXidRecord(record) ? "t" : "f"),
+ record->xl_len,
+ XLogRecGetSlotId(record));
+
for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
{
if (record->xl_info & XLR_SET_BKP_BLOCK(i))
***************
*** 6468,6473 ****
--- 7300,7311 ----
errhint("archive_command must be defined before "
"online backups can be made safely.")));
+ if (IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
backupidstr = text_to_cstring(backupid);
/*
***************
*** 6631,6636 ****
--- 7469,7480 ----
errmsg("WAL archiving is not active"),
errhint("archive_mode must be enabled at server start.")));
+ if (IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
/*
* OK to clear forcePageWrites
*/
***************
*** 6782,6787 ****
--- 7626,7637 ----
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
(errmsg("must be superuser to switch transaction log files"))));
+ if (IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
switchpoint = RequestXLogSwitch();
/*
***************
*** 6804,6809 ****
--- 7654,7665 ----
{
char location[MAXFNAMELEN];
+ if (IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
/* Make sure we have an up-to-date local LogwrtResult */
{
/* use volatile pointer to prevent code rearrangement */
***************
*** 6831,6836 ****
--- 7687,7698 ----
XLogRecPtr current_recptr;
char location[MAXFNAMELEN];
+ if (IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
/*
* Get the current end-of-WAL position ... shared lock is sufficient
*/
Index: src/backend/access/transam/xlogutils.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.59
diff -c -r1.59 xlogutils.c
*** src/backend/access/transam/xlogutils.c 30 Sep 2008 10:52:11 -0000 1.59
--- src/backend/access/transam/xlogutils.c 27 Oct 2008 18:32:03 -0000
***************
*** 228,234 ****
Buffer
XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
{
! return XLogReadBufferWithFork(rnode, MAIN_FORKNUM, blkno, init);
}
/*
--- 228,240 ----
Buffer
XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
{
! return XLogReadBufferWithFork(rnode, MAIN_FORKNUM, blkno, init, BUFFER_LOCK_EXCLUSIVE);
! }
!
! Buffer
! XLogReadBufferForCleanup(RelFileNode rnode, BlockNumber blkno, bool init)
! {
! return XLogReadBufferWithFork(rnode, MAIN_FORKNUM, blkno, init, BUFFER_LOCK_CLEANUP);
}
/*
***************
*** 238,244 ****
*/
Buffer
XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! BlockNumber blkno, bool init)
{
BlockNumber lastblock;
Buffer buffer;
--- 244,250 ----
*/
Buffer
XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! BlockNumber blkno, bool init, int mode)
{
BlockNumber lastblock;
Buffer buffer;
***************
*** 289,295 ****
Assert(BufferGetBlockNumber(buffer) == blkno);
}
! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
if (!init)
{
--- 295,306 ----
Assert(BufferGetBlockNumber(buffer) == blkno);
}
! if (mode == BUFFER_LOCK_EXCLUSIVE)
! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! else if (mode == BUFFER_LOCK_CLEANUP)
! LockBufferForCleanup(buffer);
! else
! elog(FATAL, "Invalid buffer lock mode %d", mode);
if (!init)
{
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.246
diff -c -r1.246 bootstrap.c
*** src/backend/bootstrap/bootstrap.c 30 Sep 2008 10:52:11 -0000 1.246
--- src/backend/bootstrap/bootstrap.c 27 Oct 2008 18:32:03 -0000
***************
*** 418,424 ****
case StartupProcess:
bootstrap_signals();
StartupXLOG();
! BuildFlatFiles(false);
proc_exit(0); /* startup done */
case BgWriterProcess:
--- 418,424 ----
case StartupProcess:
bootstrap_signals();
StartupXLOG();
! BuildFlatFiles(false, true, true);
proc_exit(0); /* startup done */
case BgWriterProcess:
Index: src/backend/commands/discard.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/discard.c,v
retrieving revision 1.4
diff -c -r1.4 discard.c
*** src/backend/commands/discard.c 1 Jan 2008 19:45:49 -0000 1.4
--- src/backend/commands/discard.c 27 Oct 2008 18:32:03 -0000
***************
*** 65,71 ****
ResetAllOptions();
DropAllPreparedStatements();
PortalHashTableDeleteAll();
! Async_UnlistenAll();
ResetPlanCache();
ResetTempTableNamespace();
}
--- 65,72 ----
ResetAllOptions();
DropAllPreparedStatements();
PortalHashTableDeleteAll();
! if (!IsRecoveryProcessingMode())
! Async_UnlistenAll();
ResetPlanCache();
ResetTempTableNamespace();
}
Index: src/backend/commands/lockcmds.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/lockcmds.c,v
retrieving revision 1.19
diff -c -r1.19 lockcmds.c
*** src/backend/commands/lockcmds.c 8 Sep 2008 00:47:40 -0000 1.19
--- src/backend/commands/lockcmds.c 27 Oct 2008 18:32:03 -0000
***************
*** 49,54 ****
--- 49,66 ----
*/
reloid = RangeVarGetRelid(relation, false);
+ /*
+ * During recovery we only accept these variations:
+ *
+ * LOCK TABLE foo -- parser translates as AccessEclusiveLock request
+ * LOCK TABLE foo IN AccessShareLock MODE
+ * LOCK TABLE foo IN AccessExclusiveLock MODE
+ */
+ if (IsRecoveryProcessingMode() &&
+ !(lockstmt->mode == AccessShareLock ||
+ lockstmt->mode == AccessExclusiveLock))
+ PreventCommandDuringRecovery();
+
if (lockstmt->mode == AccessShareLock)
aclresult = pg_class_aclcheck(reloid, GetUserId(),
ACL_SELECT);
Index: src/backend/commands/sequence.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/sequence.c,v
retrieving revision 1.154
diff -c -r1.154 sequence.c
*** src/backend/commands/sequence.c 13 Jul 2008 20:45:47 -0000 1.154
--- src/backend/commands/sequence.c 27 Oct 2008 18:32:03 -0000
***************
*** 457,462 ****
--- 457,464 ----
rescnt = 0;
bool logit = false;
+ PreventCommandDuringRecovery();
+
/* open and AccessShareLock sequence */
init_sequence(relid, &elm, &seqrel);
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.53
diff -c -r1.53 bgwriter.c
*** src/backend/postmaster/bgwriter.c 14 Oct 2008 08:06:39 -0000 1.53
--- src/backend/postmaster/bgwriter.c 27 Oct 2008 18:32:03 -0000
***************
*** 49,54 ****
--- 49,55 ----
#include
#include "access/xlog_internal.h"
+ #include "catalog/pg_control.h"
#include "libpq/pqsignal.h"
#include "miscadmin.h"
#include "pgstat.h"
***************
*** 129,134 ****
--- 130,142 ----
int ckpt_flags; /* checkpoint flags, as defined in xlog.h */
+ /*
+ * When the Startup process wants bgwriter to perform a restartpoint, it
+ * sets these fields so that we can update the control file afterwards.
+ */
+ XLogRecPtr ReadPtr; /* Requested log pointer */
+ CheckPoint restartPoint; /* restartPoint data for ControlFile */
+
uint32 num_backend_writes; /* counts non-bgwriter buffer writes */
int num_requests; /* current # of requests */
***************
*** 165,171 ****
/* these values are valid when ckpt_active is true: */
static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;
static double ckpt_cached_elapsed;
static pg_time_t last_checkpoint_time;
--- 173,179 ----
/* these values are valid when ckpt_active is true: */
static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr; /* not used if IsRecoveryProcessingMode */
static double ckpt_cached_elapsed;
static pg_time_t last_checkpoint_time;
***************
*** 197,202 ****
--- 205,211 ----
{
sigjmp_buf local_sigjmp_buf;
MemoryContext bgwriter_context;
+ bool BgWriterRecoveryMode;
BgWriterShmem->bgwriter_pid = MyProcPid;
am_bg_writer = true;
***************
*** 355,370 ****
*/
PG_SETMASK(&UnBlockSig);
/*
* Loop forever
*/
for (;;)
{
- bool do_checkpoint = false;
- int flags = 0;
- pg_time_t now;
- int elapsed_secs;
-
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
--- 364,380 ----
*/
PG_SETMASK(&UnBlockSig);
+ BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+ if (BgWriterRecoveryMode)
+ elog(DEBUG1, "bgwriter starting during recovery, pid = %u",
+ BgWriterShmem->bgwriter_pid);
+
/*
* Loop forever
*/
for (;;)
{
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
***************
*** 372,499 ****
if (!PostmasterIsAlive(true))
exit(1);
- /*
- * Process any requests or signals received recently.
- */
- AbsorbFsyncRequests();
-
if (got_SIGHUP)
{
got_SIGHUP = false;
ProcessConfigFile(PGC_SIGHUP);
}
- if (checkpoint_requested)
- {
- checkpoint_requested = false;
- do_checkpoint = true;
- BgWriterStats.m_requested_checkpoints++;
- }
- if (shutdown_requested)
- {
- /*
- * From here on, elog(ERROR) should end with exit(1), not send
- * control back to the sigsetjmp block above
- */
- ExitOnAnyError = true;
- /* Close down the database */
- ShutdownXLOG(0, 0);
- /* Normal exit from the bgwriter is here */
- proc_exit(0); /* done */
- }
! /*
! * Force a checkpoint if too much time has elapsed since the last one.
! * Note that we count a timed checkpoint in stats only when this
! * occurs without an external request, but we set the CAUSE_TIME flag
! * bit even if there is also an external request.
! */
! now = (pg_time_t) time(NULL);
! elapsed_secs = now - last_checkpoint_time;
! if (elapsed_secs >= CheckPointTimeout)
{
! if (!do_checkpoint)
! BgWriterStats.m_timed_checkpoints++;
! do_checkpoint = true;
! flags |= CHECKPOINT_CAUSE_TIME;
}
!
! /*
! * Do a checkpoint if requested, otherwise do one cycle of
! * dirty-buffer writing.
! */
! if (do_checkpoint)
{
! /* use volatile pointer to prevent code rearrangement */
! volatile BgWriterShmemStruct *bgs = BgWriterShmem;
/*
! * Atomically fetch the request flags to figure out what kind of a
! * checkpoint we should perform, and increase the started-counter
! * to acknowledge that we've started a new checkpoint.
*/
! SpinLockAcquire(&bgs->ckpt_lck);
! flags |= bgs->ckpt_flags;
! bgs->ckpt_flags = 0;
! bgs->ckpt_started++;
! SpinLockRelease(&bgs->ckpt_lck);
! /*
! * We will warn if (a) too soon since last checkpoint (whatever
! * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! * since the last checkpoint start. Note in particular that this
! * implementation will not generate warnings caused by
! * CheckPointTimeout < CheckPointWarning.
! */
! if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! elapsed_secs < CheckPointWarning)
! ereport(LOG,
! (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! elapsed_secs),
! errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!
! /*
! * Initialize bgwriter-private variables used during checkpoint.
! */
! ckpt_active = true;
! ckpt_start_recptr = GetInsertRecPtr();
! ckpt_start_time = now;
! ckpt_cached_elapsed = 0;
!
! /*
! * Do the checkpoint.
! */
! CreateCheckPoint(flags);
/*
! * After any checkpoint, close all smgr files. This is so we
! * won't hang onto smgr references to deleted files indefinitely.
*/
! smgrcloseall();
/*
! * Indicate checkpoint completion to any waiting backends.
*/
! SpinLockAcquire(&bgs->ckpt_lck);
! bgs->ckpt_done = bgs->ckpt_started;
! SpinLockRelease(&bgs->ckpt_lck);
! ckpt_active = false;
! /*
! * Note we record the checkpoint start time not end time as
! * last_checkpoint_time. This is so that time-driven checkpoints
! * happen at a predictable spacing.
! */
! last_checkpoint_time = now;
}
- else
- BgBufferSync();
-
- /* Check for archive_timeout and switch xlog files if necessary. */
- CheckArchiveTimeout();
-
- /* Nap for the configured time. */
- BgWriterNap();
}
}
--- 382,595 ----
if (!PostmasterIsAlive(true))
exit(1);
if (got_SIGHUP)
{
got_SIGHUP = false;
ProcessConfigFile(PGC_SIGHUP);
}
! if (BgWriterRecoveryMode)
{
! if (shutdown_requested)
! {
! /*
! * From here on, elog(ERROR) should end with exit(1), not send
! * control back to the sigsetjmp block above
! */
! ExitOnAnyError = true;
! /* Normal exit from the bgwriter is here */
! proc_exit(0); /* done */
! }
!
! if (!IsRecoveryProcessingMode())
! {
! elog(DEBUG2, "bgwriter changing from recovery to normal mode");
!
! InitXLOGAccess();
! BgWriterRecoveryMode = false;
!
! /*
! * Start time-driven events from now
! */
! last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
!
! /*
! * Notice that we do *not* act on a checkpoint_requested
! * state at this point. We have changed mode, so we wish to
! * perform a checkpoint not a restartpoint.
! */
! continue;
! }
!
! if (checkpoint_requested)
! {
! XLogRecPtr ReadPtr;
! CheckPoint restartPoint;
!
! checkpoint_requested = false;
!
! /*
! * Initialize bgwriter-private variables used during checkpoint.
! */
! ckpt_active = true;
! ckpt_start_time = (pg_time_t) time(NULL);
! ckpt_cached_elapsed = 0;
!
! /*
! * Get the requested values from shared memory that the
! * Startup process has put there for us.
! */
! SpinLockAcquire(&BgWriterShmem->ckpt_lck);
! ReadPtr = BgWriterShmem->ReadPtr;
! memcpy(&restartPoint, &BgWriterShmem->restartPoint, sizeof(CheckPoint));
! SpinLockRelease(&BgWriterShmem->ckpt_lck);
!
! /* Use smoothed writes, until interrupted if ever */
! CreateRestartPoint(ReadPtr, &restartPoint, 0);
!
! /*
! * After any checkpoint, close all smgr files. This is so we
! * won't hang onto smgr references to deleted files indefinitely.
! */
! smgrcloseall();
!
! ckpt_active = false;
! checkpoint_requested = false;
! }
! else
! {
! /* Clean buffers dirtied by recovery */
! BgBufferSync();
!
! /* Nap for the configured time. */
! BgWriterNap();
! }
}
! else /* Normal processing */
{
! bool do_checkpoint = false;
! int flags = 0;
! pg_time_t now;
! int elapsed_secs;
/*
! * Process any requests or signals received recently.
*/
! AbsorbFsyncRequests();
! if (checkpoint_requested)
! {
! checkpoint_requested = false;
! do_checkpoint = true;
! BgWriterStats.m_requested_checkpoints++;
! }
! if (shutdown_requested)
! {
! /*
! * From here on, elog(ERROR) should end with exit(1), not send
! * control back to the sigsetjmp block above
! */
! ExitOnAnyError = true;
! /* Close down the database */
! ShutdownXLOG(0, 0);
! /* Normal exit from the bgwriter is here */
! proc_exit(0); /* done */
! }
/*
! * Force a checkpoint if too much time has elapsed since the last one.
! * Note that we count a timed checkpoint in stats only when this
! * occurs without an external request, but we set the CAUSE_TIME flag
! * bit even if there is also an external request.
*/
! now = (pg_time_t) time(NULL);
! elapsed_secs = now - last_checkpoint_time;
! if (elapsed_secs >= CheckPointTimeout)
! {
! if (!do_checkpoint)
! BgWriterStats.m_timed_checkpoints++;
! do_checkpoint = true;
! flags |= CHECKPOINT_CAUSE_TIME;
! }
/*
! * Do a checkpoint if requested, otherwise do one cycle of
! * dirty-buffer writing.
*/
! if (do_checkpoint)
! {
! /* use volatile pointer to prevent code rearrangement */
! volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!
! /*
! * Atomically fetch the request flags to figure out what kind of a
! * checkpoint we should perform, and increase the started-counter
! * to acknowledge that we've started a new checkpoint.
! */
! SpinLockAcquire(&bgs->ckpt_lck);
! flags |= bgs->ckpt_flags;
! bgs->ckpt_flags = 0;
! bgs->ckpt_started++;
! SpinLockRelease(&bgs->ckpt_lck);
!
! /*
! * We will warn if (a) too soon since last checkpoint (whatever
! * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! * since the last checkpoint start. Note in particular that this
! * implementation will not generate warnings caused by
! * CheckPointTimeout < CheckPointWarning.
! */
! if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! elapsed_secs < CheckPointWarning)
! ereport(LOG,
! (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! elapsed_secs),
! errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!
! /*
! * Initialize bgwriter-private variables used during checkpoint.
! */
! ckpt_active = true;
! ckpt_start_recptr = GetInsertRecPtr();
! ckpt_start_time = now;
! ckpt_cached_elapsed = 0;
!
! /*
! * Do the checkpoint.
! */
! CreateCheckPoint(flags);
!
! /*
! * After any checkpoint, close all smgr files. This is so we
! * won't hang onto smgr references to deleted files indefinitely.
! */
! smgrcloseall();
!
! /*
! * Indicate checkpoint completion to any waiting backends.
! */
! SpinLockAcquire(&bgs->ckpt_lck);
! bgs->ckpt_done = bgs->ckpt_started;
! SpinLockRelease(&bgs->ckpt_lck);
!
! ckpt_active = false;
!
! /*
! * Note we record the checkpoint start time not end time as
! * last_checkpoint_time. This is so that time-driven checkpoints
! * happen at a predictable spacing.
! */
! last_checkpoint_time = now;
! }
! else
! BgBufferSync();
! /* Check for archive_timeout and switch xlog files if necessary. */
! CheckArchiveTimeout();
! /* Nap for the configured time. */
! BgWriterNap();
}
}
}
***************
*** 586,592 ****
(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
break;
pg_usleep(1000000L);
! AbsorbFsyncRequests();
udelay -= 1000000L;
}
--- 682,689 ----
(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
break;
pg_usleep(1000000L);
! if (!IsRecoveryProcessingMode())
! AbsorbFsyncRequests();
udelay -= 1000000L;
}
***************
*** 640,645 ****
--- 737,755 ----
if (!am_bg_writer)
return;
+ /* Perform minimal duties during recovery and skip wait if requested */
+ if (IsRecoveryProcessingMode())
+ {
+ BgBufferSync();
+
+ if (!shutdown_requested &&
+ !checkpoint_requested &&
+ IsCheckpointOnSchedule(progress))
+ BgWriterNap();
+
+ return;
+ }
+
/*
* Perform the usual bgwriter duties and take a nap, unless we're behind
* schedule, in which case we just try to catch up as quickly as possible.
***************
*** 714,729 ****
* However, it's good enough for our purposes, we're only calculating an
* estimate anyway.
*/
! recptr = GetInsertRecPtr();
! elapsed_xlogs =
! (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! CheckPointSegments;
!
! if (progress < elapsed_xlogs)
{
! ckpt_cached_elapsed = elapsed_xlogs;
! return false;
}
/*
--- 824,842 ----
* However, it's good enough for our purposes, we're only calculating an
* estimate anyway.
*/
! if (!IsRecoveryProcessingMode())
{
! recptr = GetInsertRecPtr();
! elapsed_xlogs =
! (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! CheckPointSegments;
!
! if (progress < elapsed_xlogs)
! {
! ckpt_cached_elapsed = elapsed_xlogs;
! return false;
! }
}
/*
***************
*** 965,970 ****
--- 1078,1164 ----
}
/*
+ * Always runs in Startup process (see xlog.c)
+ */
+ void
+ RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter)
+ {
+ /*
+ * Should we just do it ourselves?
+ */
+ if (!IsPostmasterEnvironment || !sendToBGWriter)
+ {
+ CreateRestartPoint(ReadPtr, restartPoint, CHECKPOINT_IMMEDIATE);
+ return;
+ }
+
+ /*
+ * Push requested values into shared memory, then signal to request restartpoint.
+ */
+ if (BgWriterShmem->bgwriter_pid == 0)
+ elog(LOG, "could not request restartpoint because bgwriter not running");
+
+ #ifdef NOT_USED
+ elog(LOG, "tli = %u nextXidEpoch = %u nextXid = %u nextOid = %u",
+ restartPoint->ThisTimeLineID,
+ restartPoint->nextXidEpoch,
+ restartPoint->nextXid,
+ restartPoint->nextOid);
+ #endif
+
+ SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ BgWriterShmem->ReadPtr = ReadPtr;
+ memcpy(&BgWriterShmem->restartPoint, restartPoint, sizeof(CheckPoint));
+ SpinLockRelease(&BgWriterShmem->ckpt_lck);
+
+ if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ elog(LOG, "could not signal for restartpoint: %m");
+ }
+
+ /*
+ * Sends another checkpoint request signal to bgwriter, which causes it
+ * to avoid smoothed writes and continue processing as if it had been
+ * called with CHECKPOINT_IMMEDIATE. This is used at the end of recovery.
+ */
+ void
+ RequestRestartPointCompletion(void)
+ {
+ if (BgWriterShmem->bgwriter_pid != 0 &&
+ kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ elog(LOG, "could not signal for restartpoint immediate: %m");
+ }
+
+ XLogRecPtr
+ GetRedoLocationForArchiveCheckpoint(void)
+ {
+ XLogRecPtr redo;
+
+ SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ redo = BgWriterShmem->ReadPtr;
+ SpinLockRelease(&BgWriterShmem->ckpt_lck);
+
+ return redo;
+ }
+
+ /*
+ * Store the information needed for a checkpoint at the end of recovery.
+ * Returns true if bgwriter can perform checkpoint, or false if bgwriter
+ * not active or otherwise unable to comply.
+ */
+ bool
+ SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo)
+ {
+ SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ BgWriterShmem->ReadPtr = redo;
+ SpinLockRelease(&BgWriterShmem->ckpt_lck);
+
+ if (BgWriterShmem->bgwriter_pid == 0 || !IsPostmasterEnvironment)
+ return false;
+
+ return true;
+ }
+
+ /*
* ForwardFsyncRequest
* Forward a file-fsync request from a backend to the bgwriter
*
Index: src/backend/postmaster/postmaster.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/postmaster.c,v
retrieving revision 1.565
diff -c -r1.565 postmaster.c
*** src/backend/postmaster/postmaster.c 23 Sep 2008 20:35:38 -0000 1.565
--- src/backend/postmaster/postmaster.c 27 Oct 2008 18:32:03 -0000
***************
*** 230,237 ****
* We use a simple state machine to control startup, shutdown, and
* crash recovery (which is rather like shutdown followed by startup).
*
! * Normal child backends can only be launched when we are in PM_RUN state.
! * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
* In other states we handle connection requests by launching "dead_end"
* child processes, which will simply send the client an error message and
* quit. (We track these in the BackendList so that we can know when they
--- 230,239 ----
* We use a simple state machine to control startup, shutdown, and
* crash recovery (which is rather like shutdown followed by startup).
*
! * Normal child backends can only be launched when we are in PM_RUN or
! * PM_RECOVERY state. Any transaction started in PM_RECOVERY state will
! * be read-only for the whole of its life. (We also allow launch of normal
! * child backends in PM_WAIT_BACKUP state, but only for superusers.)
* In other states we handle connection requests by launching "dead_end"
* child processes, which will simply send the client an error message and
* quit. (We track these in the BackendList so that we can know when they
***************
*** 254,259 ****
--- 256,266 ----
{
PM_INIT, /* postmaster starting */
PM_STARTUP, /* waiting for startup subprocess */
+ PM_RECOVERY, /* consistent recovery mode; state only
+ * entered for archive and streaming recovery,
+ * and only after the point where the
+ * all data is in consistent state.
+ */
PM_RUN, /* normal "database is alive" state */
PM_WAIT_BACKUP, /* waiting for online backup mode to end */
PM_WAIT_BACKENDS, /* waiting for live backends to exit */
***************
*** 1302,1308 ****
* state that prevents it, start one. It doesn't matter if this
* fails, we'll just try again later.
*/
! if (BgWriterPID == 0 && pmState == PM_RUN)
BgWriterPID = StartBackgroundWriter();
/*
--- 1309,1315 ----
* state that prevents it, start one. It doesn't matter if this
* fails, we'll just try again later.
*/
! if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
BgWriterPID = StartBackgroundWriter();
/*
***************
*** 1651,1661 ****
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
errmsg("the database system is shutting down")));
break;
- case CAC_RECOVERY:
- ereport(FATAL,
- (errcode(ERRCODE_CANNOT_CONNECT_NOW),
- errmsg("the database system is in recovery mode")));
- break;
case CAC_TOOMANY:
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
--- 1658,1663 ----
***************
*** 1664,1669 ****
--- 1666,1672 ----
case CAC_WAITBACKUP:
/* OK for now, will check in InitPostgres */
break;
+ case CAC_RECOVERY:
case CAC_OK:
break;
}
***************
*** 2115,2122 ****
*/
if (pid == StartupPID)
{
StartupPID = 0;
! Assert(pmState == PM_STARTUP);
/* FATAL exit of startup is treated as catastrophic */
if (!EXIT_STATUS_0(exitstatus))
--- 2118,2127 ----
*/
if (pid == StartupPID)
{
+ bool leavingRecovery = (pmState == PM_RECOVERY);
+
StartupPID = 0;
! Assert(pmState == PM_STARTUP || pmState == PM_RECOVERY);
/* FATAL exit of startup is treated as catastrophic */
if (!EXIT_STATUS_0(exitstatus))
***************
*** 2124,2130 ****
LogChildExit(LOG, _("startup process"),
pid, exitstatus);
ereport(LOG,
! (errmsg("aborting startup due to startup process failure")));
ExitPostmaster(1);
}
--- 2129,2135 ----
LogChildExit(LOG, _("startup process"),
pid, exitstatus);
ereport(LOG,
! (errmsg("aborting startup due to startup process failure")));
ExitPostmaster(1);
}
***************
*** 2157,2166 ****
load_role();
/*
! * Crank up the background writer. It doesn't matter if this
! * fails, we'll just try again later.
*/
! Assert(BgWriterPID == 0);
BgWriterPID = StartBackgroundWriter();
/*
--- 2162,2171 ----
load_role();
/*
! * Check whether we need to start background writer, if not
! * already running.
*/
! if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
/*
***************
*** 2177,2184 ****
PgStatPID = pgstat_start();
/* at this point we are really open for business */
! ereport(LOG,
! (errmsg("database system is ready to accept connections")));
continue;
}
--- 2182,2193 ----
PgStatPID = pgstat_start();
/* at this point we are really open for business */
! if (leavingRecovery)
! ereport(LOG,
! (errmsg("database can now be accessed with read and write transactions")));
! else
! ereport(LOG,
! (errmsg("database system is ready to accept connections")));
continue;
}
***************
*** 2898,2904 ****
bn->pid = pid;
bn->cancel_key = MyCancelKey;
bn->is_autovacuum = false;
! bn->dead_end = (port->canAcceptConnections != CAC_OK &&
port->canAcceptConnections != CAC_WAITBACKUP);
DLAddHead(BackendList, DLNewElem(bn));
#ifdef EXEC_BACKEND
--- 2907,2914 ----
bn->pid = pid;
bn->cancel_key = MyCancelKey;
bn->is_autovacuum = false;
! bn->dead_end = (!(port->canAcceptConnections == CAC_RECOVERY ||
! port->canAcceptConnections == CAC_OK) &&
port->canAcceptConnections != CAC_WAITBACKUP);
DLAddHead(BackendList, DLNewElem(bn));
#ifdef EXEC_BACKEND
***************
*** 3845,3850 ****
--- 3855,3907 ----
PG_SETMASK(&BlockSig);
+ if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+ {
+ Assert(pmState == PM_STARTUP);
+
+ /*
+ * Go to shutdown mode if a shutdown request was pending.
+ */
+ if (Shutdown > NoShutdown)
+ {
+ pmState = PM_WAIT_BACKENDS;
+ /* PostmasterStateMachine logic does the rest */
+ }
+ else
+ {
+ /*
+ * Startup process has entered recovery
+ */
+ pmState = PM_RECOVERY;
+
+ /*
+ * Load the flat authorization file into postmaster's cache. The
+ * startup process won't have recomputed this from the database yet,
+ * so we it may change following recovery.
+ */
+ load_role();
+
+ /*
+ * Crank up the background writer. It doesn't matter if this
+ * fails, we'll just try again later.
+ */
+ Assert(BgWriterPID == 0);
+ BgWriterPID = StartBackgroundWriter();
+
+ /*
+ * Likewise, start other special children as needed.
+ */
+ Assert(PgStatPID == 0);
+ PgStatPID = pgstat_start();
+
+ /* We can now accept read-only connections */
+ ereport(LOG,
+ (errmsg("database system is ready to accept connections")));
+ ereport(LOG,
+ (errmsg("database can now be accessed with read only transactions")));
+ }
+ }
+
if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
{
/*
Index: src/backend/storage/buffer/README
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/README,v
retrieving revision 1.14
diff -c -r1.14 README
*** src/backend/storage/buffer/README 21 Mar 2008 13:23:28 -0000 1.14
--- src/backend/storage/buffer/README 27 Oct 2008 18:32:03 -0000
***************
*** 264,266 ****
--- 264,275 ----
This ensures that the page image transferred to disk is reasonably consistent.
We might miss a hint-bit update or two but that isn't a problem, for the same
reasons mentioned under buffer access rules.
+
+ As of 8.4, background writer starts during recovery mode when there is
+ some form of potentially extended recovery to perform. It performs an
+ identical service to normal processing, except that checkpoints it
+ writes are technically restartpoints. Flushing outstanding WAL for dirty
+ buffers is also skipped, though there shouldn't ever be new WAL entries
+ at that time in any case. We could choose to start background writer
+ immediately but we hold off until we can prove the database is in a
+ consistent state so that postmaster has a single, clean state change.
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.239
diff -c -r1.239 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c 20 Oct 2008 21:11:15 -0000 1.239
--- src/backend/storage/buffer/bufmgr.c 27 Oct 2008 18:32:03 -0000
***************
*** 70,76 ****
/* local state for LockBufferForCleanup */
static volatile BufferDesc *PinCountWaitBuf = NULL;
!
static Buffer ReadBuffer_relcache(Relation reln, ForkNumber forkNum,
BlockNumber blockNum, bool zeroPage, BufferAccessStrategy strategy);
--- 70,78 ----
/* local state for LockBufferForCleanup */
static volatile BufferDesc *PinCountWaitBuf = NULL;
! static long CleanupWaitSecs = 0;
! static int CleanupWaitUSecs = 0;
! static bool CleanupWaitStats = false;
static Buffer ReadBuffer_relcache(Relation reln, ForkNumber forkNum,
BlockNumber blockNum, bool zeroPage, BufferAccessStrategy strategy);
***************
*** 2341,2346 ****
--- 2343,2389 ----
return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
}
+ void
+ StartCleanupDelayStats(void)
+ {
+ CleanupWaitSecs = 0;
+ CleanupWaitUSecs = 0;
+ CleanupWaitStats = true;
+ }
+
+ void
+ EndCleanupDelayStats(void)
+ {
+ CleanupWaitStats = false;
+ }
+
+ /*
+ * Called by Startup process whenever we request restartpoint
+ */
+ void
+ ReportCleanupDelayStats(void)
+ {
+ elog(trace_recovery(DEBUG2), "cleanup wait total=%ld.%03d s",
+ CleanupWaitSecs, CleanupWaitUSecs / 1000);
+ }
+
+ static void
+ CleanupDelayStats(TimestampTz start_ts, TimestampTz end_ts)
+ {
+ long wait_secs;
+ int wait_usecs;
+
+ TimestampDifference(start_ts, end_ts, &wait_secs, &wait_usecs);
+
+ CleanupWaitSecs +=wait_secs;
+ CleanupWaitUSecs +=wait_usecs;
+ if (CleanupWaitUSecs > 999999)
+ {
+ CleanupWaitSecs += 1;
+ CleanupWaitUSecs -= 1000000;
+ }
+ }
+
/*
* LockBufferForCleanup - lock a buffer in preparation for deleting items
*
***************
*** 2384,2389 ****
--- 2427,2434 ----
for (;;)
{
+ TimestampTz start_ts = 0;
+
/* Try to acquire lock */
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBufHdr(bufHdr);
***************
*** 2406,2414 ****
--- 2451,2464 ----
PinCountWaitBuf = bufHdr;
UnlockBufHdr(bufHdr);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ if (CleanupWaitStats)
+ start_ts = GetCurrentTimestamp();
/* Wait to be signaled by UnpinBuffer() */
ProcWaitForSignal();
PinCountWaitBuf = NULL;
+ if (CleanupWaitStats)
+ CleanupDelayStats(start_ts, GetCurrentTimestamp());
+
/* Loop back and try again */
}
}
Index: src/backend/storage/freespace/freespace.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/freespace/freespace.c,v
retrieving revision 1.64
diff -c -r1.64 freespace.c
*** src/backend/storage/freespace/freespace.c 1 Oct 2008 14:59:23 -0000 1.64
--- src/backend/storage/freespace/freespace.c 27 Oct 2008 18:32:03 -0000
***************
*** 779,785 ****
* replay of the smgr truncation record to remove completely unused
* pages.
*/
! buf = XLogReadBufferWithFork(xlrec->node, FSM_FORKNUM, fsmblk, false);
if (BufferIsValid(buf))
{
fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
--- 779,786 ----
* replay of the smgr truncation record to remove completely unused
* pages.
*/
! buf = XLogReadBufferWithFork(xlrec->node, FSM_FORKNUM, fsmblk,
! false, BUFFER_LOCK_CLEANUP);
if (BufferIsValid(buf))
{
fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
Index: src/backend/storage/ipc/procarray.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/ipc/procarray.c,v
retrieving revision 1.46
diff -c -r1.46 procarray.c
*** src/backend/storage/ipc/procarray.c 4 Aug 2008 18:03:46 -0000 1.46
--- src/backend/storage/ipc/procarray.c 27 Oct 2008 18:32:03 -0000
***************
*** 17,22 ****
--- 17,29 ----
* as are the myProcLocks lists. They can be distinguished from regular
* backend PGPROCs at need by checking for pid == 0.
*
+ * The process array now also includes PGPROC structures representing
+ * transactions being recovered. The xid and subxids fields of these are valid,
+ * though that is all. They can also be distinguished from regular backend
+ * PGPROCs at need by checking for pid == 0. The proc array also has an
+ * additional array of UnobservedXids representing transactions that are
+ * known to be running on the master but for which we do not yet know the
+ * slotId, so cannot be assigned to the correct recovery proc.
*
* Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
***************
*** 33,56 ****
#include "access/subtrans.h"
#include "access/transam.h"
! #include "access/xact.h"
#include "access/twophase.h"
#include "miscadmin.h"
#include "storage/procarray.h"
#include "utils/snapmgr.h"
/* Our shared memory area */
typedef struct ProcArrayStruct
{
int numProcs; /* number of valid procs entries */
! int maxProcs; /* allocated size of procs array */
/*
* We declare procs[] as 1 entry because C wants a fixed-size array, but
* actually it is maxProcs entries long.
*/
PGPROC *procs[1]; /* VARIABLE LENGTH ARRAY */
} ProcArrayStruct;
static ProcArrayStruct *procArray;
--- 40,79 ----
#include "access/subtrans.h"
#include "access/transam.h"
! #include "access/xlog.h"
#include "access/twophase.h"
#include "miscadmin.h"
+ #include "storage/proc.h"
#include "storage/procarray.h"
#include "utils/snapmgr.h"
+ static RunningXactsData CurrentRunningXactsData;
+
+ /* Handy constant for an invalid xlog recptr */
+ static const XLogRecPtr InvalidXLogRecPtr = {0, 0};
+
+ void ProcArrayDisplay(int trace_level);
+
/* Our shared memory area */
typedef struct ProcArrayStruct
{
int numProcs; /* number of valid procs entries */
! int maxProcs; /* allocated size of total procs array */
!
! int maxRecoveryProcs; /* number of allocated recovery procs */
!
! int numUnobservedXids; /* number of valid unobserved xids */
! int maxUnobservedXids; /* allocated size of unobserved array */
! bool overflowUnobservedXids; /* array has overflowed */
/*
* We declare procs[] as 1 entry because C wants a fixed-size array, but
* actually it is maxProcs entries long.
*/
PGPROC *procs[1]; /* VARIABLE LENGTH ARRAY */
+
+ /* ARRAY OF UNOBSERVED TRANSACTION XIDs FOLLOWS */
} ProcArrayStruct;
static ProcArrayStruct *procArray;
***************
*** 100,107 ****
Size size;
size = offsetof(ProcArrayStruct, procs);
! size = add_size(size, mul_size(sizeof(PGPROC *),
! add_size(MaxBackends, max_prepared_xacts)));
return size;
}
--- 123,141 ----
Size size;
size = offsetof(ProcArrayStruct, procs);
!
! /* Normal processing */
! /* MyProc slots */
! size = add_size(size, mul_size(sizeof(PGPROC *), MaxBackends));
! size = add_size(size, mul_size(sizeof(PGPROC *), max_prepared_xacts));
!
! /* Recovery processing */
!
! /* Recovery Procs */
! size = add_size(size, mul_size(sizeof(PGPROC *), MaxBackends));
! /* UnobservedXids */
! size = add_size(size, mul_size(sizeof(TransactionId), MaxBackends));
! size = add_size(size, mul_size(sizeof(TransactionId), MaxBackends));
return size;
}
***************
*** 123,130 ****
--- 157,197 ----
/*
* We're the first - initialize.
*/
+ /* Normal processing */
procArray->numProcs = 0;
procArray->maxProcs = MaxBackends + max_prepared_xacts;
+
+ /* Recovery processing */
+ procArray->maxRecoveryProcs = MaxBackends;
+ procArray->maxProcs += procArray->maxRecoveryProcs;
+
+ procArray->maxUnobservedXids = 2 * MaxBackends;
+ procArray->numUnobservedXids = 0;
+ procArray->overflowUnobservedXids = false;
+
+ if (!IsUnderPostmaster)
+ {
+ int i;
+
+ /*
+ * Create and add the Procs for recovery emulation.
+ *
+ * We do this now, so that we can identify which Recovery Proc
+ * goes with each normal backend. Normal procs were allocated
+ * first so we can use the slotId of the *proc* to look up
+ * the Recovery Proc in the *procarray*. Recovery Procs never
+ * move around in the procarray, whereas normal procs do.
+ * e.g. Proc with slotId=7 is always associated with procarray[7]
+ * for recovery processing. see also
+ */
+ for (i = 0; i < procArray->maxRecoveryProcs; i++)
+ {
+ PGPROC *RecoveryProc = InitRecoveryProcess();
+
+ ProcArrayAdd(RecoveryProc);
+ }
+ elog(DEBUG3, "Added %d Recovery Procs", i);
+ }
}
}
***************
*** 213,218 ****
--- 280,332 ----
elog(LOG, "failed to find proc %p in ProcArray", proc);
}
+ /*
+ * ProcArrayStartRecoveryTransaction
+ *
+ * Update Recovery Proc to show transaction is complete. There is no
+ * locking here. It is either handled by caller, or potentially
+ * ignored (see comments for GetNewTransactionId()).
+ *
+ * In recovery we supply an LSN also, to ensure we can tell which of
+ * several inputs is the latest information on the state of the proc.
+ *
+ * There is no ProcArrayStartNormalTransaction, that is handled by
+ * GetNewTransactionId in varsup.c
+ */
+ void
+ ProcArrayStartRecoveryTransaction(PGPROC *proc, TransactionId xid, XLogRecPtr lsn, bool isSubXact)
+ {
+ elog(trace_recovery(DEBUG4),
+ "start recovery xid = %d lsn = %X/%X %s",
+ xid, lsn.xlogid, lsn.xrecoff, (isSubXact ? "(SUB)" : ""));
+ /*
+ * Use volatile pointer to prevent code rearrangement; other backends
+ * could be examining my subxids info concurrently, and we don't want
+ * them to see an invalid intermediate state, such as incrementing
+ * nxids before filling the array entry. Note we are assuming that
+ * TransactionId and int fetch/store are atomic.
+ */
+ {
+ volatile PGPROC *myproc = proc;
+
+ proc->lsn = lsn;
+
+ if (!isSubXact)
+ myproc->xid = xid;
+ else
+ {
+ int nxids = myproc->subxids.nxids;
+
+ if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
+ {
+ myproc->subxids.xids[nxids] = xid;
+ myproc->subxids.nxids = nxids + 1;
+ }
+ else
+ myproc->subxids.overflowed = true;
+ }
+ }
+ }
/*
* ProcArrayEndTransaction -- mark a transaction as no longer running
***************
*** 220,226 ****
* This is used interchangeably for commit and abort cases. The transaction
* commit/abort must already be reported to WAL and pg_clog.
*
! * proc is currently always MyProc, but we pass it explicitly for flexibility.
* latestXid is the latest Xid among the transaction's main XID and
* subtransactions, or InvalidTransactionId if it has no XID. (We must ask
* the caller to pass latestXid, instead of computing it from the PGPROC's
--- 334,342 ----
* This is used interchangeably for commit and abort cases. The transaction
* commit/abort must already be reported to WAL and pg_clog.
*
! * In normal running proc is currently always MyProc, but in recovery we pass
! * one of the recovery procs.
! *
* latestXid is the latest Xid among the transaction's main XID and
* subtransactions, or InvalidTransactionId if it has no XID. (We must ask
* the caller to pass latestXid, instead of computing it from the PGPROC's
***************
*** 301,306 ****
--- 417,423 ----
proc->xid = InvalidTransactionId;
proc->lxid = InvalidLocalTransactionId;
proc->xmin = InvalidTransactionId;
+ proc->lsn = InvalidXLogRecPtr;
/* redundant, but just in case */
proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
***************
*** 311,316 ****
--- 428,575 ----
proc->subxids.overflowed = false;
}
+ /*
+ * ProcArrayClearRecoveryTransactions
+ *
+ * Called during recovery when we see a Shutdown checkpoint or EndRecovery
+ * record, or at the end of recovery processing.
+ */
+ void
+ ProcArrayClearRecoveryTransactions(void)
+ {
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+ /*
+ * Reset Recovery Procs
+ */
+ for (index = 0; index < arrayP->maxRecoveryProcs; index++)
+ {
+ PGPROC *RecoveryProc = arrayP->procs[index];
+
+ ProcArrayClearTransaction(RecoveryProc);
+ }
+
+ /*
+ * Clear the UnobservedXids also
+ */
+ UnobservedTransactionsClearXids();
+
+ LWLockRelease(ProcArrayLock);
+ }
+
+ bool
+ XidInRecoveryProcs(TransactionId xid)
+ {
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ for (index = 0; index < arrayP->maxRecoveryProcs; index++)
+ {
+ PGPROC *RecoveryProc = arrayP->procs[index];
+
+ if (RecoveryProc->xid == xid)
+ return true;
+ }
+ return false;
+ }
+
+ void
+ ProcArrayDisplay(int trace_level)
+ {
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+ for (index = 0; index < arrayP->maxRecoveryProcs; index++)
+ {
+ PGPROC *RecoveryProc = arrayP->procs[index];
+
+ if (TransactionIdIsValid(RecoveryProc->xid))
+ elog(trace_level,
+ "proc %d proc->xid %d proc->lsn %X/%X", index, RecoveryProc->xid,
+ RecoveryProc->lsn.xlogid, RecoveryProc->lsn.xrecoff);
+ }
+
+ UnobservedTransactionsDisplay(trace_level);
+
+ LWLockRelease(ProcArrayLock);
+ }
+
+ /*
+ * Use the data about running transactions on master to either create the
+ * initial state of the Recovery Procs, or maintain correctness of their
+ * state. This is almost the opposite of GetSnapshotData().
+ *
+ * Only used during recovery. Notice the signature is very similar to a
+ * _redo function.
+ */
+ void
+ ProcArrayUpdateRecoveryTransactions(XLogRecPtr lsn, xl_xact_running_xacts *xlrec)
+ {
+ PGPROC *proc;
+ int xid_index;
+ TransactionId *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]);
+
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+ ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+
+ for (xid_index = 0; xid_index < xlrec->xcnt; xid_index++)
+ {
+ RunningXact *rxact = (RunningXact *) xlrec->xrun;
+
+ proc = SlotIdGetRecoveryProc(rxact[xid_index].slotId);
+
+ elog(trace_recovery(DEBUG2),
+ "running xact proc->lsn %X/%X lsn %X/%X proc->xid %d xid %d",
+ proc->lsn.xlogid, proc->lsn.xrecoff,
+ lsn.xlogid, lsn.xrecoff, proc->xid, rxact[xid_index].xid);
+ /*
+ * If our state information is later for this proc, then
+ * overwrite it. It's possible for a commit and possibly
+ * a new transaction record to have arrived in WAL in between
+ * us doing GetRunningTransactionData() and grabbing the
+ * WALInsertLock, so we musn't assume we know best always.
+ */
+ if (XLByteLT(proc->lsn, lsn))
+ {
+ proc->lsn = lsn;
+ proc->xid = rxact[xid_index].xid;
+ /* proc-> pid stays 0 for Recovery Procs */
+ /* proc->slotId should never be touched */
+ proc->databaseId = rxact[xid_index].databaseId;
+ proc->roleId = rxact[xid_index].roleId;
+ proc->vacuumFlags = rxact[xid_index].vacuumFlags;
+
+ proc->subxids.nxids = rxact[xid_index].nsubxids;
+ proc->subxids.overflowed = rxact[xid_index].overflowed;
+
+ memcpy(proc->subxids.xids, subxip,
+ rxact[xid_index].nsubxids * sizeof(TransactionId));
+ }
+ }
+
+ /*
+ * We could look for Recovery Procs that weren't mentioned, but thats
+ * a lot of work for little benefit. We opt for a simple and cheap
+ * alternative: left prune the UnobservedXids array up to latestRunningXid.
+ * This is correct because at the time we take this snapshot, all
+ * completed transactions prior to latestRunningXid will be marked in
+ * WAL. So we won't ever see a WAL record for them again.
+ *
+ * We can't clear the array completely because race conditions allow
+ * things to slip through sometimes.
+ */
+ UnobservedTransactionsPruneXids(xlrec->latestRunningXid);
+
+ LWLockRelease(ProcArrayLock);
+
+ ProcArrayDisplay(trace_recovery(DEBUG1));
+ }
/*
* TransactionIdIsInProgress -- is given transaction running in some backend
***************
*** 655,661 ****
* but since PGPROC has only a limited cache area for subxact XIDs, full
* information may not be available. If we find any overflowed subxid arrays,
* we have to mark the snapshot's subxid data as overflowed, and extra work
! * will need to be done to determine what's running (see XidInMVCCSnapshot()
* in tqual.c).
*
* We also update the following backend-global variables:
--- 914,920 ----
* but since PGPROC has only a limited cache area for subxact XIDs, full
* information may not be available. If we find any overflowed subxid arrays,
* we have to mark the snapshot's subxid data as overflowed, and extra work
! * may need to be done to determine what's running (see XidInMVCCSnapshot()
* in tqual.c).
*
* We also update the following backend-global variables:
***************
*** 680,685 ****
--- 939,945 ----
int index;
int count = 0;
int subcount = 0;
+ bool suboverflowed = false;
Assert(snapshot != NULL);
***************
*** 706,725 ****
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
Assert(snapshot->subxip == NULL);
snapshot->subxip = (TransactionId *)
! malloc(arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS * sizeof(TransactionId));
if (snapshot->subxip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
}
/*
* It is sufficient to get shared lock on ProcArrayLock, even if we are
* going to set MyProc->xmin.
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
/* xmax is always latestCompletedXid + 1 */
xmax = ShmemVariableCache->latestCompletedXid;
Assert(TransactionIdIsNormal(xmax));
--- 966,1007 ----
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
Assert(snapshot->subxip == NULL);
+ #define maxNumSubXids (arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS)
snapshot->subxip = (TransactionId *)
! malloc(maxNumSubXids * sizeof(TransactionId));
if (snapshot->subxip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
}
+ /* XXX we expect to be able to undef this after testing */
+ #define UNOBSERVED_XIDS_CAN_OVERFLOW
+
+ #ifdef UNOBSERVED_XIDS_CAN_OVERFLOW
+ retry:
+ #endif
+
/*
* It is sufficient to get shared lock on ProcArrayLock, even if we are
* going to set MyProc->xmin.
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
+ #ifdef UNOBSERVED_XIDS_CAN_OVERFLOW
+ /*
+ * If UnobservedXids has overflowed then we cannot make a valid snapshot.
+ * This will only ever happen in recovery processing and only when
+ */
+ if (arrayP->overflowUnobservedXids)
+ {
+ LWLockRelease(ProcArrayLock);
+ elog(WARNING, "unable to obtain valid snapshot: unobserved xids overflow");
+ pg_usleep(10000L);
+ goto retry;
+ }
+ #endif
+
/* xmax is always latestCompletedXid + 1 */
xmax = ShmemVariableCache->latestCompletedXid;
Assert(TransactionIdIsNormal(xmax));
***************
*** 771,779 ****
}
/*
! * Save subtransaction XIDs if possible (if we've already overflowed,
! * there's no point). Note that the subxact XIDs must be later than
! * their parent, so no need to check them against xmin. We could
* filter against xmax, but it seems better not to do that much work
* while holding the ProcArrayLock.
*
--- 1053,1060 ----
}
/*
! * Save subtransaction XIDs. Note that the subxact XIDs must be later
! * than their parent, so no need to check them against xmin. We could
* filter against xmax, but it seems better not to do that much work
* while holding the ProcArrayLock.
*
***************
*** 784,806 ****
*
* Again, our own XIDs are not included in the snapshot.
*/
! if (subcount >= 0 && proc != MyProc)
! {
! if (proc->subxids.overflowed)
! subcount = -1; /* overflowed */
! else
{
int nxids = proc->subxids.nxids;
if (nxids > 0)
{
memcpy(snapshot->subxip + subcount,
(void *) proc->subxids.xids,
nxids * sizeof(TransactionId));
subcount += nxids;
}
}
}
}
if (!TransactionIdIsValid(MyProc->xmin))
--- 1065,1157 ----
*
* Again, our own XIDs are not included in the snapshot.
*/
! if (proc != MyProc)
{
int nxids = proc->subxids.nxids;
if (nxids > 0)
{
+ if (proc->subxids.overflowed)
+ suboverflowed = true;
+
memcpy(snapshot->subxip + subcount,
(void *) proc->subxids.xids,
nxids * sizeof(TransactionId));
subcount += nxids;
}
+
+ }
+ }
+
+ /*
+ * Also check for unobserved xids. There is no need for us to specify
+ * only if IsRecoveryProcessingMode(), since the list will always be
+ * empty when normal processing begins and the test will be optimised
+ * to nearly nothing very quickly.
+ */
+ for (index = 0; index < arrayP->numUnobservedXids; index++)
+ {
+ volatile TransactionId *UnobservedXids;
+ TransactionId xid;
+
+ UnobservedXids = (TransactionId *) &(arrayP->procs[arrayP->maxProcs]);
+
+ /* Fetch xid just once - see GetNewTransactionId */
+ xid = UnobservedXids[index];
+
+ /*
+ * If there are no more visible xids, we're done. This works
+ * because UnobservedXids is maintained in strict ascending order.
+ */
+ if (!TransactionIdIsNormal(xid) || TransactionIdPrecedes(xid, xmax))
+ break;
+
+ /*
+ * Typically, there will be space in the snapshot. We know that the
+ * unobserved xids are being run by one of the procs marked with
+ * an xid of InvalidTransactionId, so we will have ignored that above,
+ * and the xidcache for that proc will have been empty also.
+ *
+ * We put the unobserved xid anywhere in the snapshot. The xid might
+ * be a top-level or it might be a subtransaction, but it won't
+ * change the answer to XidInMVCCSnapshot() whichever it is. That's
+ * just as well, since we don't know which it is, by definition.
+ */
+ if (count < arrayP->maxProcs)
+ snapshot->xip[count++] = xid;
+ else
+ {
+ /*
+ * If there is no space left in subxid cache then we will be forced
+ * to look in Subtrans to check for subtransactions when we
+ * run XidInMVCCSnapshot(). If we still have unobserved
+ * transactions we know they won't be found in subtrans,
+ * so we have to abort our attempt to make a snapshot.
+ */
+ #ifdef UNOBSERVED_XIDS_CAN_OVERFLOW
+ if (subcount >= maxNumSubXids)
+ {
+ LWLockRelease(ProcArrayLock);
+ elog(WARNING, "unable to obtain valid snapshot: subxid overflow");
+ pg_usleep(10000L);
+ goto retry;
}
+ #endif
+
+ /*
+ * Store unobserved xids in the subxid cache instead.
+ */
+ snapshot->subxip[subcount++] = xid;
}
+
+ /*
+ * We don't really need xmin during recovery, but lets derive
+ * it anyway for consistency. It is possible that an unobserved
+ * xid could be xmin if there is contention between long-lived
+ * transactions.
+ */
+ if (TransactionIdPrecedes(xid, xmin))
+ xmin = xid;
}
if (!TransactionIdIsValid(MyProc->xmin))
***************
*** 824,829 ****
--- 1175,1181 ----
snapshot->xmax = xmax;
snapshot->xcnt = count;
snapshot->subxcnt = subcount;
+ snapshot->suboverflowed = suboverflowed;
snapshot->curcid = GetCurrentCommandId(false);
***************
*** 839,844 ****
--- 1191,1424 ----
}
/*
+ * GetRunningTransactionData -- returns information about running transactions.
+ *
+ * Similar to GetSnapshotData but returning more information. We include
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes. We
+ * include slotId and databaseId for each PGPROC. We also keep track
+ * of which subtransactions go with each PGPROC, information which is lost
+ * when we GetSnapshotData.
+ *
+ * This is never executed when IsRecoveryMode() so there is no need to look
+ * at UnobservedXids.
+ *
+ * We don't worry about updating other counters, we want to keep this as
+ * simple as possible and leave GetSnapshotData() as the primary code for
+ * that bookkeeping.
+ */
+ RunningTransactions
+ GetRunningTransactionData(void)
+ {
+ ProcArrayStruct *arrayP = procArray;
+ RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData;
+ RunningXact *rxact;
+ TransactionId *subxip;
+ TransactionId latestRunningXid = InvalidTransactionId;
+ TransactionId prev_latestRunningXid = InvalidTransactionId;
+ TransactionId latestCompletedXid;
+ int numAttempts = 0;
+ int index;
+ int count = 0;
+ int subcount = 0;
+ bool suboverflowed = false;
+
+ /*
+ * Allocating space for maxProcs xids is usually overkill; numProcs would
+ * be sufficient. But it seems better to do the malloc while not holding
+ * the lock, so we can't look at numProcs. Likewise, we allocate much
+ * more subxip storage than is probably needed.
+ *
+ * Should only be allocated for bgwriter, since only ever executed
+ * during checkpoints.
+ */
+ if (CurrentRunningXacts->xrun == NULL)
+ {
+ /*
+ * First call
+ */
+ CurrentRunningXacts->xrun = (RunningXact *)
+ malloc(arrayP->maxProcs * sizeof(RunningXact));
+ if (CurrentRunningXacts->xrun == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ Assert(CurrentRunningXacts->subxip == NULL);
+ CurrentRunningXacts->subxip = (TransactionId *)
+ malloc(maxNumSubXids * sizeof(TransactionId));
+ if (CurrentRunningXacts->subxip == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ }
+
+ rxact = CurrentRunningXacts->xrun;
+ subxip = CurrentRunningXacts->subxip;
+
+ /*
+ * Loop until we get a valid snapshot. See exit conditions below.
+ */
+ for (;;)
+ {
+ count = 0;
+ subcount = 0;
+ suboverflowed = false;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ latestCompletedXid = ShmemVariableCache->latestCompletedXid;
+
+ /*
+ * Spin over procArray checking xid, and subxids. Shared lock is enough
+ * because new transactions don't use locks at all, so LW_EXCLUSIVE
+ * wouldn't be enough to prevent them, so don't bother.
+ */
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ volatile PGPROC *proc = arrayP->procs[index];
+ TransactionId xid;
+ int nxids;
+
+ /* Fetch xid just once - see GetNewTransactionId */
+ xid = proc->xid;
+
+ /*
+ * We store all xids, even XIDs >= xmax and our own XID, if any.
+ * But we don't store transactions that don't have a TransactionId
+ * yet because they will not show as running on a standby server.
+ */
+ if (!TransactionIdIsValid(xid))
+ continue;
+
+ rxact[count].xid = xid;
+ rxact[count].slotId = proc->slotId;
+ rxact[count].databaseId = proc->databaseId;
+ rxact[count].roleId = proc->roleId;
+ rxact[count].vacuumFlags = proc->vacuumFlags;
+
+ if (TransactionIdPrecedes(latestRunningXid, xid))
+ latestRunningXid = xid;
+
+ /*
+ * Save subtransaction XIDs.
+ *
+ * The other backend can add more subxids concurrently, but cannot
+ * remove any. Hence it's important to fetch nxids just once. Should
+ * be safe to use memcpy, though. (We needn't worry about missing any
+ * xids added concurrently, because they must postdate xmax.)
+ *
+ * Again, our own XIDs *are* included in the snapshot.
+ */
+ nxids = proc->subxids.nxids;
+
+ if (nxids > 0)
+ {
+ TransactionId *subxids = (TransactionId *) proc->subxids.xids;
+
+ rxact[count].subx_offset = subcount;
+
+ memcpy(subxip + subcount,
+ (void *) proc->subxids.xids,
+ nxids * sizeof(TransactionId));
+ subcount += nxids;
+
+ if (proc->subxids.overflowed)
+ {
+ rxact[count].overflowed = true;
+ suboverflowed = true;
+ }
+ else if (TransactionIdPrecedes(latestRunningXid, subxids[nxids - 1]))
+ latestRunningXid = subxids[nxids - 1];
+ }
+
+ rxact[count].nsubxids = nxids;
+
+ count++;
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ /*
+ * If there's no procs with TransactionIds allocated we need to
+ * find what the last xid assigned was. This takes and releases
+ * XidGenLock, but that shouldn't cause contention in this case.
+ * We could do this as well if the snapshot overflowed, but in
+ * that case we think that XidGenLock might be high, so we punt.
+ *
+ * By the time we do this, another proc may have incremented the
+ * nextxid, so we must rescan the procarray to check whether
+ * there are either new running transactions or the counter is
+ * the same as before. If transactions appear and disappear
+ * faster than we can do this, we're in trouble. So spin for at
+ * a few 3 attempts before giving up.
+ *
+ * We do it this way to avoid needing to grab XidGenLock in all
+ * cases, which is hardly ever actually required.
+ */
+ if (count > 0)
+ break;
+ else
+ {
+ #define MAX_SNAPSHOT_ATTEMPTS 3
+ if (numAttempts >= MAX_SNAPSHOT_ATTEMPTS)
+ {
+ latestRunningXid = InvalidTransactionId;
+ break;
+ }
+
+ latestRunningXid = ReadNewTransactionId();
+ TransactionIdRetreat(latestRunningXid);
+
+ if (prev_latestRunningXid == latestRunningXid)
+ break;
+
+ prev_latestRunningXid = latestRunningXid;
+ numAttempts++;
+ }
+ }
+
+ CurrentRunningXacts->xcnt = count;
+ CurrentRunningXacts->subxcnt = subcount;
+ CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
+ if (!suboverflowed)
+ CurrentRunningXacts->latestRunningXid = latestRunningXid;
+ else
+ CurrentRunningXacts->latestRunningXid = InvalidTransactionId;
+
+ #define RUNNING_XACT_DEBUG
+ #ifdef RUNNING_XACT_DEBUG
+ elog(trace_recovery(DEBUG3),
+ "logging running xacts xcnt %d subxcnt %d latestCompletedXid %d latestRunningXid %d",
+ CurrentRunningXacts->xcnt,
+ CurrentRunningXacts->subxcnt,
+ CurrentRunningXacts->latestCompletedXid,
+ CurrentRunningXacts->latestRunningXid);
+
+ for (index = 0; index < CurrentRunningXacts->xcnt; index++)
+ {
+ int j;
+ elog(trace_recovery(DEBUG3),
+ "xid %d pid %d backend %d db %d role %d nsubxids %d offset %d vf %u, overflow %s",
+ CurrentRunningXacts->xrun[index].xid,
+ CurrentRunningXacts->xrun[index].pid,
+ CurrentRunningXacts->xrun[index].slotId,
+ CurrentRunningXacts->xrun[index].databaseId,
+ CurrentRunningXacts->xrun[index].roleId,
+ CurrentRunningXacts->xrun[index].nsubxids,
+ CurrentRunningXacts->xrun[index].subx_offset,
+ CurrentRunningXacts->xrun[index].vacuumFlags,
+ CurrentRunningXacts->xrun[index].overflowed ? "t" : "f");
+ for (j = 0; j < CurrentRunningXacts->xrun[index].nsubxids; j++)
+ elog(trace_recovery(DEBUG3),
+ "subxid offset %d j %d xid %d",
+ CurrentRunningXacts->xrun[index].subx_offset, j,
+ CurrentRunningXacts->subxip[j + CurrentRunningXacts->xrun[index].subx_offset]);
+ }
+ #endif
+
+ return CurrentRunningXacts;
+ }
+
+ /*
* GetTransactionsInCommit -- Get the XIDs of transactions that are committing
*
* Constructs an array of XIDs of transactions that are currently in commit
***************
*** 968,973 ****
--- 1548,1577 ----
}
/*
+ * SlotIdGetRecoveryProc -- get a PGPROC for a given SlotId
+ *
+ * Run during recovery to identify which PGPROC to access.
+ * Throws ERROR if not found, or we pass an invalid value.
+ *
+ * see comments in CreateSharedProcArray()
+ */
+ PGPROC *
+ SlotIdGetRecoveryProc(int slotId)
+ {
+ if (slotId < 0 || slotId > MaxBackends)
+ elog(ERROR, "invalid slotId %d", slotId);
+
+ Assert(procArray->procs[slotId] != NULL);
+
+ /*
+ * No need to acquire ProcArrayLock to identify proc, we just
+ * use the slotId as an array offset directly, since we assigned
+ * these at start.
+ */
+ return procArray->procs[slotId];
+ }
+
+ /*
* BackendXidGetPid -- get a backend's pid given its XID
*
* Returns 0 if not found or it's a prepared transaction. Note that
***************
*** 1367,1369 ****
--- 1971,2155 ----
}
#endif /* XIDCACHE_DEBUG */
+
+ /*
+ * Must be called with ProcArrayLock held.
+ */
+ void
+ UnobservedTransactionsAddXids(TransactionId firstXid, TransactionId lastXid)
+ {
+ TransactionId ixid = firstXid;
+ int index = procArray->numUnobservedXids;
+ TransactionId *UnobservedXids;
+
+ UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+ Assert(TransactionIdPrecedes(firstXid, lastXid));
+
+ /*
+ * UnobservedXids is maintained as a ascending list of xids, with no gaps.
+ * Incoming xids are always higher than previous entries, so we just add
+ * them directly to the end of the array.
+ */
+ while (ixid != lastXid)
+ {
+ /*
+ * check to see if we have space to store more UnobservedXids
+ */
+ if (index >= procArray->maxUnobservedXids)
+ {
+ UnobservedTransactionsDisplay(WARNING);
+ elog(FATAL, "No more entries in UnobservedXids array");
+ // procArray->overflowUnobservedXids = true;
+ break;
+ }
+
+ /*
+ * append ixid to UnobservedXids
+ */
+ Assert(!TransactionIdIsValid(UnobservedXids[index]));
+ Assert(index == 0 || TransactionIdPrecedes(UnobservedXids[index - 1], ixid));
+
+ elog(trace_recovery(DEBUG4), "Adding UnobservedXid %d", ixid);
+ UnobservedXids[index++] = ixid;
+
+ TransactionIdAdvance(ixid);
+ }
+
+ procArray->numUnobservedXids = index;
+ }
+
+ /*
+ * Must be called with ProcArrayLock held.
+ */
+ void
+ UnobservedTransactionsRemoveXid(TransactionId xid)
+ {
+ int index;
+ bool found = false;
+ TransactionId *UnobservedXids;
+
+ UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+ elog(trace_recovery(DEBUG4), "Remove UnobservedXid = %d", xid);
+
+ /*
+ * If we haven't initialised array yet, or if we've already cleared it
+ * ignore this and get on with it. If it's missing after this it is an
+ * ERROR if removal is requested and the value isn't present.
+ */
+ if (procArray->numUnobservedXids > 0 &&
+ TransactionIdPrecedes(xid, UnobservedXids[0]))
+ return;
+
+ /*
+ * XXX we could use bsearch, if this has significant overhead.
+ */
+ for (index = 0; index < procArray->numUnobservedXids; index++)
+ {
+ if (!found)
+ {
+ if (UnobservedXids[index] == xid)
+ found = true;
+ }
+ else
+ {
+ UnobservedXids[index - 1] = UnobservedXids[index];
+ }
+ }
+
+ if (found)
+ UnobservedXids[--procArray->numUnobservedXids] = InvalidTransactionId;
+
+ if (!found)
+ {
+ UnobservedTransactionsDisplay(LOG);
+ elog(ERROR, "could not remove unobserved xid = %d", xid);
+ }
+ }
+
+ void
+ UnobservedTransactionsPruneXids(TransactionId limitXid)
+ {
+ int index;
+ int pruneUpToThisIndex = 0;
+ TransactionId *UnobservedXids;
+
+ UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+ elog(trace_recovery(DEBUG4), "Prune UnobservedXids up to %d", limitXid);
+
+ for (index = 0; index < procArray->numUnobservedXids; index++)
+ {
+ if (TransactionIdPrecedes(UnobservedXids[index], limitXid))
+ pruneUpToThisIndex = index + 1;
+ else
+ {
+ /*
+ * Anything to delete?
+ */
+ if (pruneUpToThisIndex == 0)
+ return;
+
+ /*
+ * Move unpruned values to start of array
+ */
+ UnobservedXids[index - pruneUpToThisIndex] = UnobservedXids[index];
+ UnobservedXids[index] = 0;
+ }
+ }
+
+ procArray->numUnobservedXids -= pruneUpToThisIndex;
+ }
+
+ void
+ UnobservedTransactionsClearXids(void)
+ {
+ int index;
+ TransactionId *UnobservedXids;
+
+ elog(trace_recovery(DEBUG4), "Clear UnobservedXids");
+
+ UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+ for (index = 0; index < procArray->numUnobservedXids; index++)
+ {
+ UnobservedXids[index] = 0;
+ }
+
+ procArray->numUnobservedXids = 0;
+ }
+
+ void
+ UnobservedTransactionsDisplay(int trace_level)
+ {
+ #define UNOBSV_XACTS_DEBUG
+ #ifdef UNOBSV_XACTS_DEBUG
+ int index;
+ TransactionId *UnobservedXids;
+
+ UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+ for (index = 0; index < procArray->numUnobservedXids; index++)
+ {
+ elog(trace_level, "%d unobserved[%d] = %d ",
+ procArray->numUnobservedXids, index, UnobservedXids[index]);
+ }
+ #endif
+ }
+
+ bool
+ XidInUnobservedTransactions(TransactionId xid)
+ {
+ int index;
+ TransactionId *UnobservedXids;
+
+ UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+ for (index = 0; index < procArray->numUnobservedXids; index++)
+ {
+ if (UnobservedXids[index] == xid)
+ return true;
+ }
+ return false;
+ }
Index: src/backend/storage/lmgr/lock.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/lmgr/lock.c,v
retrieving revision 1.184
diff -c -r1.184 lock.c
*** src/backend/storage/lmgr/lock.c 1 Aug 2008 13:16:09 -0000 1.184
--- src/backend/storage/lmgr/lock.c 27 Oct 2008 18:32:03 -0000
***************
*** 490,495 ****
--- 490,504 ----
if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
elog(ERROR, "unrecognized lock mode: %d", lockmode);
+ if (IsRecoveryProcessingMode() &&
+ locktag->locktag_type == LOCKTAG_OBJECT &&
+ lockmode > AccessShareLock)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot acquire lockmode %s on database objects while recovery is in progress",
+ lockMethodTable->lockModeNames[lockmode]),
+ errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
+
#ifdef LOCK_DEBUG
if (LOCK_DEBUG_ENABLED(locktag))
elog(LOG, "LockAcquire: lock [%u,%u] %s",
Index: src/backend/storage/lmgr/proc.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/lmgr/proc.c,v
retrieving revision 1.201
diff -c -r1.201 proc.c
*** src/backend/storage/lmgr/proc.c 9 Jun 2008 18:23:05 -0000 1.201
--- src/backend/storage/lmgr/proc.c 27 Oct 2008 18:32:03 -0000
***************
*** 103,108 ****
--- 103,110 ----
size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
/* MyProcs, including autovacuum */
size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
+ /* RecoveryProcs, including recovery actions by autovacuum */
+ size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
/* ProcStructLock */
size = add_size(size, sizeof(slock_t));
***************
*** 152,157 ****
--- 154,160 ----
PGPROC *procs;
int i;
bool found;
+ int slotId = 0;
/* Create the ProcGlobal shared structure */
ProcGlobal = (PROC_HDR *)
***************
*** 178,183 ****
--- 181,187 ----
/*
* Pre-create the PGPROC structures and create a semaphore for each.
*/
+
procs = (PGPROC *) ShmemAlloc((MaxConnections) * sizeof(PGPROC));
if (!procs)
ereport(FATAL,
***************
*** 188,193 ****
--- 192,198 ----
{
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = ProcGlobal->freeProcs;
+ procs[i].slotId = slotId++; /* once set, never changed */
ProcGlobal->freeProcs = MAKE_OFFSET(&procs[i]);
}
***************
*** 201,209 ****
--- 206,232 ----
{
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = ProcGlobal->autovacFreeProcs;
+ procs[i].slotId = slotId++; /* once set, never changed */
ProcGlobal->autovacFreeProcs = MAKE_OFFSET(&procs[i]);
}
+ /*
+ * Create enough Recovery Procs so there is a shadow proc for every
+ * normal proc. Recovery procs don't need semaphores.
+ */
+ procs = (PGPROC *) ShmemAlloc((MaxBackends) * sizeof(PGPROC));
+ if (!procs)
+ ereport(FATAL,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of shared memory")));
+ MemSet(procs, 0, MaxBackends * sizeof(PGPROC));
+ for (i = 0; i < MaxBackends; i++)
+ {
+ procs[i].links.next = ProcGlobal->freeProcs;
+ procs[i].slotId = -1;
+ ProcGlobal->freeProcs = MAKE_OFFSET(&procs[i]);
+ }
+
MemSet(AuxiliaryProcs, 0, NUM_AUXILIARY_PROCS * sizeof(PGPROC));
for (i = 0; i < NUM_AUXILIARY_PROCS; i++)
{
***************
*** 278,284 ****
/*
* Initialize all fields of MyProc, except for the semaphore which was
! * prepared for us by InitProcGlobal.
*/
SHMQueueElemInit(&(MyProc->links));
MyProc->waitStatus = STATUS_OK;
--- 301,307 ----
/*
* Initialize all fields of MyProc, except for the semaphore which was
! * prepared for us by InitProcGlobal. Never touch the slotId.
*/
SHMQueueElemInit(&(MyProc->links));
MyProc->waitStatus = STATUS_OK;
***************
*** 322,327 ****
--- 345,432 ----
}
/*
+ * InitRecoveryProcess -- initialize a per-master process data structure
+ * for use when emulating transactions in recovery
+ */
+ PGPROC *
+ InitRecoveryProcess(void)
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile PROC_HDR *procglobal = ProcGlobal;
+ SHMEM_OFFSET myOffset;
+ PGPROC *ThisProc = NULL;
+
+ /*
+ * ProcGlobal should be set up already (if we are a backend, we inherit
+ * this by fork() or EXEC_BACKEND mechanism from the postmaster).
+ */
+ if (procglobal == NULL)
+ elog(PANIC, "proc header uninitialized");
+
+ /*
+ * Try to get a proc struct from the free list. If this fails, we must be
+ * out of PGPROC structures (not to mention semaphores).
+ */
+ SpinLockAcquire(ProcStructLock);
+
+ myOffset = procglobal->freeProcs;
+
+ if (myOffset != INVALID_OFFSET)
+ {
+ ThisProc = (PGPROC *) MAKE_PTR(myOffset);
+ procglobal->freeProcs = ThisProc->links.next;
+ SpinLockRelease(ProcStructLock);
+ }
+ else
+ {
+ /*
+ * Should never reach here if shared memory is allocated correctly.
+ */
+ SpinLockRelease(ProcStructLock);
+ elog(FATAL, "too many procs - could not create recovery proc");
+ }
+
+ /*
+ * xid will be set later as WAL records arrive for this recovery proc
+ */
+ ThisProc->xid = InvalidTransactionId;
+
+ /*
+ * The backendid of the recovery proc stays at InvalidBackendId. There
+ * is a direct 1:1 correspondence between a master backendid and this
+ * proc, but that same backendid may also be in use during recovery,
+ * so if we set this field we would have duplicate backendids.
+ */
+ ThisProc->backendId = InvalidBackendId;
+
+ /*
+ * The following are not used in recovery
+ */
+ ThisProc->pid = 0;
+
+ SHMQueueElemInit(&(ThisProc->links));
+ ThisProc->waitStatus = STATUS_OK;
+ ThisProc->lxid = InvalidLocalTransactionId;
+ ThisProc->xmin = InvalidTransactionId;
+ ThisProc->databaseId = InvalidOid;
+ ThisProc->roleId = InvalidOid;
+ ThisProc->inCommit = false;
+ ThisProc->vacuumFlags = 0;
+ ThisProc->lwWaiting = false;
+ ThisProc->lwExclusive = false;
+ ThisProc->lwWaitLink = NULL;
+ ThisProc->waitLock = NULL;
+ ThisProc->waitProcLock = NULL;
+
+ /*
+ * There is little else to do. The recovery proc is never used to
+ * acquire buffers, nor will we ever acquire LWlocks using the proc.
+ * Deadlock checker is not active during recovery.
+ */
+ return ThisProc;
+ }
+
+ /*
* InitProcessPhase2 -- make MyProc visible in the shared ProcArray.
*
* This is separate from InitProcess because we can't acquire LWLocks until
Index: src/backend/tcop/postgres.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/tcop/postgres.c,v
retrieving revision 1.557
diff -c -r1.557 postgres.c
*** src/backend/tcop/postgres.c 30 Sep 2008 10:52:13 -0000 1.557
--- src/backend/tcop/postgres.c 27 Oct 2008 18:32:03 -0000
***************
*** 3261,3267 ****
* We have to build the flat file for pg_database, but not for the
* user and group tables, since we won't try to do authentication.
*/
! BuildFlatFiles(true);
}
/*
--- 3261,3267 ----
* We have to build the flat file for pg_database, but not for the
* user and group tables, since we won't try to do authentication.
*/
! BuildFlatFiles(true, false, false);
}
/*
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.299
diff -c -r1.299 utility.c
*** src/backend/tcop/utility.c 10 Oct 2008 13:48:05 -0000 1.299
--- src/backend/tcop/utility.c 27 Oct 2008 18:32:03 -0000
***************
*** 296,301 ****
--- 296,302 ----
break;
case TRANS_STMT_PREPARE:
+ PreventCommandDuringRecovery();
if (!PrepareTransactionBlock(stmt->gid))
{
/* report unsuccessful commit in completionTag */
***************
*** 305,315 ****
--- 306,318 ----
break;
case TRANS_STMT_COMMIT_PREPARED:
+ PreventCommandDuringRecovery();
PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
FinishPreparedTransaction(stmt->gid, true);
break;
case TRANS_STMT_ROLLBACK_PREPARED:
+ PreventCommandDuringRecovery();
PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
FinishPreparedTransaction(stmt->gid, false);
break;
***************
*** 631,636 ****
--- 634,640 ----
break;
case T_GrantStmt:
+ PreventCommandDuringRecovery();
ExecuteGrantStmt((GrantStmt *) parsetree);
break;
***************
*** 801,806 ****
--- 805,811 ----
case T_NotifyStmt:
{
NotifyStmt *stmt = (NotifyStmt *) parsetree;
+ PreventCommandDuringRecovery();
Async_Notify(stmt->conditionname);
}
***************
*** 809,814 ****
--- 814,820 ----
case T_ListenStmt:
{
ListenStmt *stmt = (ListenStmt *) parsetree;
+ PreventCommandDuringRecovery();
Async_Listen(stmt->conditionname);
}
***************
*** 817,822 ****
--- 823,829 ----
case T_UnlistenStmt:
{
UnlistenStmt *stmt = (UnlistenStmt *) parsetree;
+ PreventCommandDuringRecovery();
if (stmt->conditionname)
Async_Unlisten(stmt->conditionname);
***************
*** 836,845 ****
--- 843,854 ----
break;
case T_ClusterStmt:
+ PreventCommandDuringRecovery();
cluster((ClusterStmt *) parsetree, isTopLevel);
break;
case T_VacuumStmt:
+ PreventCommandDuringRecovery();
vacuum((VacuumStmt *) parsetree, InvalidOid, true, NULL, false,
isTopLevel);
break;
***************
*** 950,955 ****
--- 959,965 ----
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("must be superuser to do CHECKPOINT")));
+ PreventCommandDuringRecovery();
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
break;
***************
*** 957,962 ****
--- 967,974 ----
{
ReindexStmt *stmt = (ReindexStmt *) parsetree;
+ PreventCommandDuringRecovery();
+
switch (stmt->kind)
{
case OBJECT_INDEX:
***************
*** 2386,2388 ****
--- 2398,2409 ----
return lev;
}
+
+ void
+ PreventCommandDuringRecovery(void)
+ {
+ if (IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+ errmsg("cannot be run until recovery completes")));
+ }
Index: src/backend/utils/adt/txid.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/adt/txid.c,v
retrieving revision 1.7
diff -c -r1.7 txid.c
*** src/backend/utils/adt/txid.c 12 May 2008 20:02:02 -0000 1.7
--- src/backend/utils/adt/txid.c 27 Oct 2008 18:32:03 -0000
***************
*** 338,343 ****
--- 338,349 ----
txid val;
TxidEpoch state;
+ if (IsRecoveryProcessingMode())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot assign txid while recovery is in progress"),
+ errhint("only read only queries can execute during recovery")));
+
load_xid_epoch(&state);
val = convert_xid(GetTopTransactionId(), &state);
Index: src/backend/utils/cache/inval.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/cache/inval.c,v
retrieving revision 1.87
diff -c -r1.87 inval.c
*** src/backend/utils/cache/inval.c 9 Sep 2008 18:58:08 -0000 1.87
--- src/backend/utils/cache/inval.c 27 Oct 2008 18:32:03 -0000
***************
*** 1235,1237 ****
--- 1235,1314 ----
++relcache_callback_count;
}
+
+ /*
+ * --------------------------------------------------
+ * Recovery handling for Rmgr RM_RELATION_ID
+ * --------------------------------------------------
+ */
+
+ /*
+ * Redo for relation invalidation messages
+ */
+ static void
+ relation_redo_inval(xl_rel_inval *xlrec)
+ {
+ }
+
+ /*
+ * Redo for relation lock messages
+ */
+ static void
+ relation_redo_lock(xl_rel_lock *xlrec)
+ {
+ }
+
+ void
+ relation_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ if (info == XLOG_RELATION_INVAL)
+ {
+ xl_rel_inval *xlrec = (xl_rel_inval *) XLogRecGetData(record);
+
+ relation_redo_inval(xlrec);
+ }
+ else if (info == XLOG_RELATION_LOCK)
+ {
+ xl_rel_lock *xlrec = (xl_rel_lock *) XLogRecGetData(record);
+
+ relation_redo_lock(xlrec);
+ }
+ else
+ elog(PANIC, "relation_redo: unknown op code %u", info);
+ }
+
+ static void
+ relation_desc_inval(StringInfo buf, xl_rel_inval *xlrec)
+ {
+ }
+
+ static void
+ relation_desc_lock(StringInfo buf, xl_rel_lock *xlrec)
+ {
+ }
+
+ void
+ relation_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+ uint8 info = xl_info & ~XLR_INFO_MASK;
+
+ if (info == XLOG_RELATION_INVAL)
+ {
+ xl_rel_inval *xlrec = (xl_rel_inval *) rec;
+
+ appendStringInfo(buf, "inval: ");
+ relation_desc_inval(buf, xlrec);
+ }
+ else if (info == XLOG_RELATION_LOCK)
+ {
+ xl_rel_lock *xlrec = (xl_rel_lock *) rec;
+
+ appendStringInfo(buf, "lock: ");
+ relation_desc_lock(buf, xlrec);
+ }
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
+
Index: src/backend/utils/error/elog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/error/elog.c,v
retrieving revision 1.208
diff -c -r1.208 elog.c
*** src/backend/utils/error/elog.c 17 Oct 2008 22:56:16 -0000 1.208
--- src/backend/utils/error/elog.c 27 Oct 2008 18:32:03 -0000
***************
*** 2544,2546 ****
--- 2544,2563 ----
return false;
}
+
+ /*
+ * If trace_recovery_messages is set to make this visible, then show as LOG,
+ * else display as whatever level is set. It may still be shown, but only
+ * if log_min_messages is set lower than trace_recovery_messages.
+ *
+ * Intention is to keep this for at least the whole of the 8.4 production
+ * release, so we can more easily diagnose production problems in the field.
+ */
+ int
+ trace_recovery(int trace_level)
+ {
+ if (trace_level >= trace_recovery_messages)
+ return LOG;
+
+ return trace_level;
+ }
Index: src/backend/utils/init/flatfiles.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/init/flatfiles.c,v
retrieving revision 1.35
diff -c -r1.35 flatfiles.c
*** src/backend/utils/init/flatfiles.c 12 Jun 2008 09:12:31 -0000 1.35
--- src/backend/utils/init/flatfiles.c 27 Oct 2008 18:32:03 -0000
***************
*** 678,686 ****
/*
* This routine is called once during database startup, after completing
* WAL replay if needed. Its purpose is to sync the flat files with the
! * current state of the database tables. This is particularly important
! * during PITR operation, since the flat files will come from the
! * base backup which may be far out of sync with the current state.
*
* In theory we could skip rebuilding the flat files if no WAL replay
* occurred, but it seems best to just do it always. We have to
--- 678,687 ----
/*
* This routine is called once during database startup, after completing
* WAL replay if needed. Its purpose is to sync the flat files with the
! * current state of the database tables.
! *
! * In 8.4 we also run this during xact_redo_commit() if the transaction
! * wrote a new database or auth flat file.
*
* In theory we could skip rebuilding the flat files if no WAL replay
* occurred, but it seems best to just do it always. We have to
***************
*** 696,702 ****
* something corrupt in the authid/authmem catalogs.
*/
void
! BuildFlatFiles(bool database_only)
{
ResourceOwner owner;
RelFileNode rnode;
--- 697,703 ----
* something corrupt in the authid/authmem catalogs.
*/
void
! BuildFlatFiles(bool database_only, bool acquire_locks, bool release_locks)
{
ResourceOwner owner;
RelFileNode rnode;
***************
*** 713,723 ****
rnode.dbNode = 0;
rnode.relNode = DatabaseRelationId;
/*
* We don't have any hope of running a real relcache, but we can use the
* same fake-relcache facility that WAL replay uses.
- *
- * No locking is needed because no one else is alive yet.
*/
rel_db = CreateFakeRelcacheEntry(rnode);
write_database_file(rel_db, true);
--- 714,736 ----
rnode.dbNode = 0;
rnode.relNode = DatabaseRelationId;
+ if (!acquire_locks && release_locks)
+ elog(FATAL, "BuildFlatFiles called with invalid parameters");
+
+ if (acquire_locks)
+ {
+ #ifdef HAVE_RECOVERY_LOCKING
+ LockSharedObject(DatabaseRelationId, InvalidOid, 0,
+ AccessExclusiveLock);
+
+ LockSharedObject(AuthIdRelationId, InvalidOid, 0,
+ AccessExclusiveLock);
+ #endif
+ }
+
/*
* We don't have any hope of running a real relcache, but we can use the
* same fake-relcache facility that WAL replay uses.
*/
rel_db = CreateFakeRelcacheEntry(rnode);
write_database_file(rel_db, true);
***************
*** 744,749 ****
--- 757,778 ----
CurrentResourceOwner = NULL;
ResourceOwnerDelete(owner);
+
+ /*
+ * If we don't release locks it is because we presume that all
+ * locks will be released by the end of xact_redo_commit().
+ */
+ if (release_locks)
+ {
+ #ifdef HAVE_RECOVERY_LOCKING
+ XXXR change these to lock releases
+ LockSharedObject(DatabaseRelationId, InvalidOid, 0,
+ AccessExclusiveLock);
+
+ LockSharedObject(AuthIdRelationId, InvalidOid, 0,
+ AccessExclusiveLock);
+ #endif
+ }
}
***************
*** 859,864 ****
--- 888,907 ----
ForceSyncCommit();
}
+ /*
+ * Exported to allow transaction commit to set flags to perform this in redo
+ */
+ bool
+ AtEOXact_Database_FlatFile_Update_Needed(void)
+ {
+ return TransactionIdIsValid(database_file_update_subid);
+ }
+
+ bool
+ AtEOXact_Auth_FlatFile_Update_Needed(void)
+ {
+ return TransactionIdIsValid(auth_file_update_subid);
+ }
/*
* This routine is called during transaction prepare.
Index: src/backend/utils/init/postinit.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/init/postinit.c,v
retrieving revision 1.186
diff -c -r1.186 postinit.c
*** src/backend/utils/init/postinit.c 23 Sep 2008 09:20:36 -0000 1.186
--- src/backend/utils/init/postinit.c 27 Oct 2008 18:32:03 -0000
***************
*** 489,497 ****
--- 489,503 ----
* Start a new transaction here before first access to db, and get a
* snapshot. We don't have a use for the snapshot itself, but we're
* interested in the secondary effect that it sets RecentGlobalXmin.
+ * If we are connecting during recovery, make sure the initial
+ * transaction is read only and force all subsequent transactions
+ * that way also.
*/
if (!bootstrap)
{
+ if (IsRecoveryProcessingMode())
+ SetConfigOption("default_transaction_read_only", "true",
+ PGC_POSTMASTER, PGC_S_OVERRIDE);
StartTransactionCommand();
(void) GetTransactionSnapshot();
}
***************
*** 515,521 ****
*/
if (!bootstrap)
LockSharedObject(DatabaseRelationId, MyDatabaseId, 0,
! RowExclusiveLock);
/*
* Recheck the flat file copy of pg_database to make sure the target
--- 521,527 ----
*/
if (!bootstrap)
LockSharedObject(DatabaseRelationId, MyDatabaseId, 0,
! (IsRecoveryProcessingMode() ? AccessShareLock : RowExclusiveLock));
/*
* Recheck the flat file copy of pg_database to make sure the target
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.475
diff -c -r1.475 guc.c
*** src/backend/utils/misc/guc.c 6 Oct 2008 13:05:36 -0000 1.475
--- src/backend/utils/misc/guc.c 27 Oct 2008 18:32:03 -0000
***************
*** 114,119 ****
--- 114,121 ----
extern bool synchronize_seqscans;
extern bool fullPageWrites;
+ int trace_recovery_messages = DEBUG1;
+
#ifdef TRACE_SORT
extern bool trace_sort;
#endif
***************
*** 2588,2593 ****
--- 2590,2605 ----
},
{
+ {"trace_recovery_messages", PGC_SUSET, LOGGING_WHEN,
+ gettext_noop("Sets the message levels that are logged during recovery."),
+ gettext_noop("Each level includes all the levels that follow it. The later"
+ " the level, the fewer messages are sent.")
+ },
+ &trace_recovery_messages,
+ DEBUG1, server_message_level_options, NULL, NULL
+ },
+
+ {
{"track_functions", PGC_SUSET, STATS_COLLECTOR,
gettext_noop("Collects function-level statistics on database activity."),
NULL
Index: src/backend/utils/time/tqual.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/time/tqual.c,v
retrieving revision 1.110
diff -c -r1.110 tqual.c
*** src/backend/utils/time/tqual.c 26 Mar 2008 16:20:47 -0000 1.110
--- src/backend/utils/time/tqual.c 27 Oct 2008 18:32:03 -0000
***************
*** 86,92 ****
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
! if (TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */
XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid);
--- 86,92 ----
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
! if (!IsRecoveryProcessingMode() && TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */
XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid);
***************
*** 1238,1263 ****
return true;
/*
! * If the snapshot contains full subxact data, the fastest way to check
! * things is just to compare the given XID against both subxact XIDs and
! * top-level XIDs. If the snapshot overflowed, we have to use pg_subtrans
! * to convert a subxact XID to its parent XID, but then we need only look
! * at top-level XIDs not subxacts.
*/
! if (snapshot->subxcnt >= 0)
{
! /* full data, so search subxip */
! int32 j;
!
! for (j = 0; j < snapshot->subxcnt; j++)
! {
! if (TransactionIdEquals(xid, snapshot->subxip[j]))
return true;
}
! /* not there, fall through to search xip[] */
! }
! else
{
/* overflowed, so convert xid to top-level */
xid = SubTransGetTopmostTransaction(xid);
--- 1238,1257 ----
return true;
/*
! * Compare the given XID against subxact XIDs.
*/
! for (i = 0; i < snapshot->subxcnt; i++)
{
! if (TransactionIdEquals(xid, snapshot->subxip[i]))
return true;
}
! /*
! * If the snapshot overflowed, we have to use pg_subtrans to convert a
! * subxact XID to its parent XID, but then we need only look at top-level
! * XIDs not subxacts.
! */
! if (snapshot->suboverflowed)
{
/* overflowed, so convert xid to top-level */
xid = SubTransGetTopmostTransaction(xid);
***************
*** 1270,1275 ****
--- 1264,1272 ----
return false;
}
+ /*
+ * Compare the given XID against top-level XIDs.
+ */
for (i = 0; i < snapshot->xcnt; i++)
{
if (TransactionIdEquals(xid, snapshot->xip[i]))
Index: src/bin/pg_controldata/pg_controldata.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_controldata/pg_controldata.c,v
retrieving revision 1.41
diff -c -r1.41 pg_controldata.c
*** src/bin/pg_controldata/pg_controldata.c 24 Sep 2008 08:59:42 -0000 1.41
--- src/bin/pg_controldata/pg_controldata.c 27 Oct 2008 18:32:03 -0000
***************
*** 197,202 ****
--- 197,205 ----
printf(_("Minimum recovery ending location: %X/%X\n"),
ControlFile.minRecoveryPoint.xlogid,
ControlFile.minRecoveryPoint.xrecoff);
+ printf(_("Minimum safe starting location: %X/%X\n"),
+ ControlFile.minSafeStartPoint.xlogid,
+ ControlFile.minSafeStartPoint.xrecoff);
printf(_("Maximum data alignment: %u\n"),
ControlFile.maxAlign);
/* we don't print floatFormat since can't say much useful about it */
Index: src/bin/pg_resetxlog/pg_resetxlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_resetxlog/pg_resetxlog.c,v
retrieving revision 1.68
diff -c -r1.68 pg_resetxlog.c
*** src/bin/pg_resetxlog/pg_resetxlog.c 24 Sep 2008 09:00:44 -0000 1.68
--- src/bin/pg_resetxlog/pg_resetxlog.c 27 Oct 2008 18:32:03 -0000
***************
*** 595,600 ****
--- 595,602 ----
ControlFile.prevCheckPoint.xrecoff = 0;
ControlFile.minRecoveryPoint.xlogid = 0;
ControlFile.minRecoveryPoint.xrecoff = 0;
+ ControlFile.minSafeStartPoint.xlogid = 0;
+ ControlFile.minSafeStartPoint.xrecoff = 0;
/* Now we can force the recorded xlog seg size to the right thing. */
ControlFile.xlog_seg_size = XLogSegSize;
Index: src/include/miscadmin.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/miscadmin.h,v
retrieving revision 1.203
diff -c -r1.203 miscadmin.h
*** src/include/miscadmin.h 9 Oct 2008 17:24:05 -0000 1.203
--- src/include/miscadmin.h 27 Oct 2008 18:32:03 -0000
***************
*** 221,226 ****
--- 221,232 ----
/* in tcop/postgres.c */
extern void check_stack_depth(void);
+ /* in tcop/utility.c */
+ extern void PreventCommandDuringRecovery(void);
+
+ /* in utils/misc/guc.c */
+ extern int trace_recovery_messages;
+ int trace_recovery(int trace_level);
/*****************************************************************************
* pdir.h -- *
Index: src/include/access/rmgr.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/rmgr.h,v
retrieving revision 1.18
diff -c -r1.18 rmgr.h
*** src/include/access/rmgr.h 30 Sep 2008 10:52:13 -0000 1.18
--- src/include/access/rmgr.h 27 Oct 2008 18:32:03 -0000
***************
*** 24,29 ****
--- 24,30 ----
#define RM_TBLSPC_ID 5
#define RM_MULTIXACT_ID 6
#define RM_FREESPACE_ID 7
+ #define RM_RELATION_ID 8
#define RM_HEAP2_ID 9
#define RM_HEAP_ID 10
#define RM_BTREE_ID 11
Index: src/include/access/xact.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xact.h,v
retrieving revision 1.95
diff -c -r1.95 xact.h
*** src/include/access/xact.h 11 Aug 2008 11:05:11 -0000 1.95
--- src/include/access/xact.h 27 Oct 2008 18:32:03 -0000
***************
*** 17,22 ****
--- 17,23 ----
#include "access/xlog.h"
#include "nodes/pg_list.h"
#include "storage/relfilenode.h"
+ #include "utils/snapshot.h"
#include "utils/timestamp.h"
***************
*** 84,95 ****
--- 85,114 ----
#define XLOG_XACT_ABORT 0x20
#define XLOG_XACT_COMMIT_PREPARED 0x30
#define XLOG_XACT_ABORT_PREPARED 0x40
+ #define XLOG_XACT_ASSIGNMENT 0x50
+ #define XLOG_XACT_RUNNING_XACTS 0x60
+ /* 0x70 can also be used, if required */
+
+ typedef struct xl_xact_assignment
+ {
+ TransactionId xassign; /* assigned xid */
+ TransactionId xparent; /* assigned xids parent, if any */
+ bool isSubXact; /* is a subtransaction */
+ int slotId; /* slotId in procarray */
+ } xl_xact_assignment;
+
+ /*
+ * xl_xact_running_xacts is in utils/snapshot.h so it can be passed
+ * around to the same places as snapshots. Not snapmgr.h
+ */
typedef struct xl_xact_commit
{
TimestampTz xact_time; /* time of commit */
int nrels; /* number of RelFileForks */
int nsubxacts; /* number of subtransaction XIDs */
+ int slotId; /* slotId in procarray */
+ uint flags; /* info flags */
/* Array of RelFileFork(s) to drop at commit */
RelFileFork xnodes[1]; /* VARIABLE LENGTH ARRAY */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
***************
*** 102,107 ****
--- 121,128 ----
TimestampTz xact_time; /* time of abort */
int nrels; /* number of RelFileForks */
int nsubxacts; /* number of subtransaction XIDs */
+ int slotId; /* slotId in procarray */
+ uint flags; /* info flags */
/* Array of RelFileFork(s) to drop at abort */
RelFileFork xnodes[1]; /* VARIABLE LENGTH ARRAY */
/* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
***************
*** 109,114 ****
--- 130,143 ----
#define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes)
+ #define XACT_COMPLETION_UNMARKED_SUBXIDS 0x01
+ #define XACT_COMPLETION_UPDATE_DB_FILE 0x02
+ #define XACT_COMPLETION_UPDATE_AUTH_FILE 0x04
+
+ #define XactCompletionHasUnMarkedSubxids(xlrec) ((xlrec)->flags & XACT_COMPLETION_UNMARKED_SUBXIDS)
+ #define XactCompletionUpdateDBFile(xlrec) ((xlrec)->flags & XACT_COMPLETION_UPDATE_DB_FILE)
+ #define XactCompletionUpdateAuthFile(xlrec) ((xlrec)->flags & XACT_COMPLETION_UPDATE_AUTH_FILE)
+
/*
* COMMIT_PREPARED and ABORT_PREPARED are identical to COMMIT/ABORT records
* except that we have to store the XID of the prepared transaction explicitly
***************
*** 185,190 ****
--- 214,227 ----
extern int xactGetCommittedChildren(TransactionId **ptr);
+ extern void LogCurrentRunningXacts(void);
+ extern void GetStandbyInfoForTransaction(RmgrId rmid, uint8 info,
+ XLogRecData *rdata,
+ TransactionId *xid2,
+ uint16 *info2);
+ extern void RecordKnownAssignedTransactionIds(XLogRecPtr lsn, XLogRecord *record);
+ extern bool IsRunningXactDataIsValid(void);
+
extern void xact_redo(XLogRecPtr lsn, XLogRecord *record);
extern void xact_desc(StringInfo buf, uint8 xl_info, char *rec);
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog.h,v
retrieving revision 1.88
diff -c -r1.88 xlog.h
*** src/include/access/xlog.h 12 May 2008 08:35:05 -0000 1.88
--- src/include/access/xlog.h 27 Oct 2008 18:32:03 -0000
***************
*** 46,55 ****
TransactionId xl_xid; /* xact id */
uint32 xl_tot_len; /* total len of entire record */
uint32 xl_len; /* total len of rmgr data */
! uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
! /* Depending on MAXALIGN, there are either 2 or 6 wasted bytes here */
/* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */
--- 46,63 ----
TransactionId xl_xid; /* xact id */
uint32 xl_tot_len; /* total len of entire record */
uint32 xl_len; /* total len of rmgr data */
! uint8 xl_info; /* flag bits, see below (XLR_ entries) */
RmgrId xl_rmid; /* resource manager for this record */
+ uint16 xl_info2; /* more flag bits, see below (XLR2_ entries) */
! /*
! * Next we have an additional entry that can have multiple meanings.
! * If XLR2_FIRST_SUBXID_RECORD is set we interpret this as the parent xid.
! * If XLR2_ROW_REMOVAL is set we interpret this as latestRemovedXid.
! */
! TransactionId xl_xid2;
!
! /* Above structure has 8 byte alignment */
/* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */
***************
*** 85,90 ****
--- 93,131 ----
*/
#define XLR_BKP_REMOVABLE 0x01
+ /*
+ * XLOG uses only high 4 bits of xl_info2.
+ *
+ * Other 12 bits are the slotId, allowing up to XLOG_MAX_SLOT_ID
+ * slotIds in the WAL record. This doesn't prevent having more than
+ * that number of backends, it just means all backends with a slotId
+ * higher than XLOG_MAX_SLOT_ID need to write a specific WAL record
+ * during AssignTransactionId()
+ */
+ #define XLR2_INFO2_MASK 0x0FFF
+ #define XLOG_MAX_SLOT_ID 4096
+ /*
+ * xl_info2 records
+ */
+ #define XLR2_INVALID_SLOT_ID 0x8000
+ #define XLR2_FIRST_XID_RECORD 0x4000
+ #define XLR2_FIRST_SUBXID_RECORD 0x2000
+ #define XLR2_MARK_SUBTRANS 0x1000
+
+ #define XLR2_XID_MASK 0x6000
+
+ #define XLogRecGetSlotId(record) \
+ ( \
+ ((record)->xl_info2 & XLR2_INVALID_SLOT_ID) ? \
+ -1 : \
+ (int)((record)->xl_info2 & XLR2_INFO2_MASK) \
+ )
+
+ #define XLogRecIsFirstXidRecord(record) ((record)->xl_info2 & XLR2_FIRST_XID_RECORD)
+ #define XLogRecIsFirstSubXidRecord(record) ((record)->xl_info2 & XLR2_FIRST_SUBXID_RECORD)
+ #define XLogRecIsFirstUseOfXid(record) ((record)->xl_info2 & XLR2_XID_MASK)
+ #define XLogRecMustMarkSubtrans(record) ((record)->xl_info2 & XLR2_MARK_SUBTRANS)
+
/* Sync methods */
#define SYNC_METHOD_FSYNC 0
#define SYNC_METHOD_FDATASYNC 1
***************
*** 133,139 ****
} XLogRecData;
extern TimeLineID ThisTimeLineID; /* current TLI */
! extern bool InRecovery;
extern XLogRecPtr XactLastRecEnd;
/* these variables are GUC parameters related to XLOG */
--- 174,187 ----
} XLogRecData;
extern TimeLineID ThisTimeLineID; /* current TLI */
! /*
! * Prior to 8.4, all activity during recovery were carried out by Startup
! * process. This local variable continues to be used in many parts of the
! * code to indicate actions taken by RecoveryManagers. Other processes who
! * potentially perform work during recovery should check
! * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
! */
! extern bool InRecovery;
extern XLogRecPtr XactLastRecEnd;
/* these variables are GUC parameters related to XLOG */
***************
*** 166,171 ****
--- 214,220 ----
/* These indicate the cause of a checkpoint request */
#define CHECKPOINT_CAUSE_XLOG 0x0010 /* XLOG consumption */
#define CHECKPOINT_CAUSE_TIME 0x0020 /* Elapsed time */
+ #define CHECKPOINT_RESTARTPOINT 0x0040 /* Restartpoint during recovery */
/* Checkpoint statistics */
typedef struct CheckpointStatsData
***************
*** 197,202 ****
--- 246,253 ----
extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
+ extern bool IsRecoveryProcessingMode(void);
+
extern void UpdateControlFile(void);
extern Size XLOGShmemSize(void);
extern void XLOGShmemInit(void);
Index: src/include/access/xlog_internal.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog_internal.h,v
retrieving revision 1.24
diff -c -r1.24 xlog_internal.h
*** src/include/access/xlog_internal.h 11 Aug 2008 11:05:11 -0000 1.24
--- src/include/access/xlog_internal.h 27 Oct 2008 18:32:03 -0000
***************
*** 17,22 ****
--- 17,23 ----
#define XLOG_INTERNAL_H
#include "access/xlog.h"
+ #include "catalog/pg_control.h"
#include "fmgr.h"
#include "pgtime.h"
#include "storage/block.h"
***************
*** 71,77 ****
/*
* Each page of XLOG file has a header like this:
*/
! #define XLOG_PAGE_MAGIC 0xD063 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
--- 72,78 ----
/*
* Each page of XLOG file has a header like this:
*/
! #define XLOG_PAGE_MAGIC 0x5352 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
***************
*** 245,250 ****
--- 246,254 ----
extern pg_time_t GetLastSegSwitchTime(void);
extern XLogRecPtr RequestXLogSwitch(void);
+ extern void CreateRestartPoint(const XLogRecPtr ReadPtr,
+ const CheckPoint *restartPoint, int flags);
+
/*
* These aren't in xlog.h because I'd rather not include fmgr.h there.
*/
***************
*** 255,259 ****
--- 259,273 ----
extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
+ extern Datum pg_recovery_continue(PG_FUNCTION_ARGS);
+ extern Datum pg_recovery_pause(PG_FUNCTION_ARGS);
+ extern Datum pg_recovery_pause_cleanup(PG_FUNCTION_ARGS);
+ extern Datum pg_recovery_pause_xid(PG_FUNCTION_ARGS);
+ extern Datum pg_recovery_pause_time(PG_FUNCTION_ARGS);
+ extern Datum pg_recovery_advance(PG_FUNCTION_ARGS);
+ extern Datum pg_recovery_stop(PG_FUNCTION_ARGS);
+ extern Datum pg_is_in_recovery(PG_FUNCTION_ARGS);
+ extern Datum pg_last_completed_xact_timestamp(PG_FUNCTION_ARGS);
+ extern Datum pg_last_completed_xid(PG_FUNCTION_ARGS);
#endif /* XLOG_INTERNAL_H */
Index: src/include/access/xlogutils.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlogutils.h,v
retrieving revision 1.26
diff -c -r1.26 xlogutils.h
*** src/include/access/xlogutils.h 11 Aug 2008 11:05:11 -0000 1.26
--- src/include/access/xlogutils.h 27 Oct 2008 18:32:03 -0000
***************
*** 25,32 ****
BlockNumber nblocks);
extern Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init);
extern Buffer XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! BlockNumber blkno, bool init);
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
--- 25,33 ----
BlockNumber nblocks);
extern Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init);
+ extern Buffer XLogReadBufferForCleanup(RelFileNode rnode, BlockNumber blkno, bool init);
extern Buffer XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! BlockNumber blkno, bool init, int mode);
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
Index: src/include/catalog/pg_control.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/catalog/pg_control.h,v
retrieving revision 1.42
diff -c -r1.42 pg_control.h
*** src/include/catalog/pg_control.h 23 Sep 2008 09:20:39 -0000 1.42
--- src/include/catalog/pg_control.h 27 Oct 2008 18:32:03 -0000
***************
*** 21,27 ****
/* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION 843
/*
* Body of CheckPoint XLOG records. This is declared here because we keep
--- 21,28 ----
/* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION 847
! // xxx change me
/*
* Body of CheckPoint XLOG records. This is declared here because we keep
***************
*** 46,52 ****
#define XLOG_NOOP 0x20
#define XLOG_NEXTOID 0x30
#define XLOG_SWITCH 0x40
!
/* System status indicator */
typedef enum DBState
--- 47,53 ----
#define XLOG_NOOP 0x20
#define XLOG_NEXTOID 0x30
#define XLOG_SWITCH 0x40
! #define XLOG_RECOVERY_END 0x50
/* System status indicator */
typedef enum DBState
***************
*** 102,107 ****
--- 103,109 ----
CheckPoint checkPointCopy; /* copy of last check point record */
XLogRecPtr minRecoveryPoint; /* must replay xlog to here */
+ XLogRecPtr minSafeStartPoint; /* safe point after recovery crashes */
/*
* This data is used to check for hardware-architecture compatibility of
Index: src/include/catalog/pg_proc.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/catalog/pg_proc.h,v
retrieving revision 1.520
diff -c -r1.520 pg_proc.h
*** src/include/catalog/pg_proc.h 14 Oct 2008 17:12:33 -0000 1.520
--- src/include/catalog/pg_proc.h 27 Oct 2008 18:32:03 -0000
***************
*** 3199,3204 ****
--- 3199,3226 ----
DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 f f t f i 1 25 "25" _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
DESCR("xlog filename, given an xlog location");
+ DATA(insert OID = 3101 ( pg_recovery_continue PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_continue _null_ _null_ _null_ ));
+ DESCR("if recovery is paused, continue with recovery");
+ DATA(insert OID = 3102 ( pg_recovery_pause PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_pause _null_ _null_ _null_ ));
+ DESCR("pause recovery until recovery target reset");
+ DATA(insert OID = 3103 ( pg_recovery_pause_cleanup PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_pause_cleanup _null_ _null_ _null_ ));
+ DESCR("continue recovery until cleanup record arrives, then pause recovery");
+ DATA(insert OID = 3104 ( pg_recovery_pause_xid PGNSP PGUID 12 1 0 0 f f t f v 1 2278 "23" _null_ _null_ _null_ pg_recovery_pause_xid _null_ _null_ _null_ ));
+ DESCR("continue recovery until specified xid completes, if ever seen, then pause recovery");
+ DATA(insert OID = 3105 ( pg_recovery_pause_time PGNSP PGUID 12 1 0 0 f f t f v 1 2278 "1184" _null_ _null_ _null_ pg_recovery_pause_time _null_ _null_ _null_ ));
+ DESCR("continue recovery until a transaction with specified timestamp completes, if ever seen, then pause recovery");
+ DATA(insert OID = 3106 ( pg_recovery_advance PGNSP PGUID 12 1 0 0 f f t f v 1 2278 "23" _null_ _null_ _null_ pg_recovery_advance _null_ _null_ _null_ ));
+ DESCR("continue recovery exactly specified number of records, then pause recovery");
+ DATA(insert OID = 3107 ( pg_recovery_stop PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_stop _null_ _null_ _null_ ));
+ DESCR("stop recovery immediately");
+
+ DATA(insert OID = 3110 ( pg_is_in_recovery PGNSP PGUID 12 1 0 0 f f t f v 0 16 "" _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ ));
+ DESCR("true if server is in recovery");
+ DATA(insert OID = 3111 ( pg_last_completed_xact_timestamp PGNSP PGUID 12 1 0 0 f f t f v 0 1184 "" _null_ _null_ _null_ pg_last_completed_xact_timestamp _null_ _null_ _null_ ));
+ DESCR("timestamp of last commit or abort record that arrived during recovery, if any");
+ DATA(insert OID = 3112 ( pg_last_completed_xid PGNSP PGUID 12 1 0 0 f f t f v 0 28 "" _null_ _null_ _null_ pg_last_completed_xid _null_ _null_ _null_ ));
+ DESCR("xid of last commit or abort record that arrived during recovery, if any");
+
DATA(insert OID = 2621 ( pg_reload_conf PGNSP PGUID 12 1 0 0 f f t f v 0 16 "" _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ ));
DESCR("reload configuration files");
DATA(insert OID = 2622 ( pg_rotate_logfile PGNSP PGUID 12 1 0 0 f f t f v 0 16 "" _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ ));
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.12
diff -c -r1.12 bgwriter.h
*** src/include/postmaster/bgwriter.h 11 Aug 2008 11:05:11 -0000 1.12
--- src/include/postmaster/bgwriter.h 27 Oct 2008 18:32:03 -0000
***************
*** 12,17 ****
--- 12,18 ----
#ifndef _BGWRITER_H
#define _BGWRITER_H
+ #include "catalog/pg_control.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
***************
*** 25,30 ****
--- 26,36 ----
extern void BackgroundWriterMain(void);
extern void RequestCheckpoint(int flags);
+ extern void RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter);
+ extern void RequestRestartPointCompletion(void);
+ extern XLogRecPtr GetRedoLocationForArchiveCheckpoint(void);
+ extern bool SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo);
+
extern void CheckpointWriteDelay(int flags, double progress);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.115
diff -c -r1.115 bufmgr.h
*** src/include/storage/bufmgr.h 11 Aug 2008 11:05:11 -0000 1.115
--- src/include/storage/bufmgr.h 27 Oct 2008 18:32:03 -0000
***************
*** 58,63 ****
--- 58,66 ----
#define BUFFER_LOCK_SHARE 1
#define BUFFER_LOCK_EXCLUSIVE 2
+ /* Not used by LockBuffer, but is used by XLogReadBuffer... */
+ #define BUFFER_LOCK_CLEANUP 3
+
/*
* These routines are beaten on quite heavily, hence the macroization.
*/
***************
*** 190,195 ****
--- 193,202 ----
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+ extern void StartCleanupDelayStats(void);
+ extern void EndCleanupDelayStats(void);
+ extern void ReportCleanupDelayStats(void);
+
extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
Index: src/include/storage/pmsignal.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/pmsignal.h,v
retrieving revision 1.20
diff -c -r1.20 pmsignal.h
*** src/include/storage/pmsignal.h 19 Jun 2008 21:32:56 -0000 1.20
--- src/include/storage/pmsignal.h 27 Oct 2008 18:32:03 -0000
***************
*** 22,27 ****
--- 22,28 ----
*/
typedef enum
{
+ PMSIGNAL_RECOVERY_START, /* move to PM_RECOVERY state */
PMSIGNAL_PASSWORD_CHANGE, /* pg_auth file has changed */
PMSIGNAL_WAKEN_ARCHIVER, /* send a NOTIFY signal to xlog archiver */
PMSIGNAL_ROTATE_LOGFILE, /* send SIGUSR1 to syslogger to rotate logfile */
Index: src/include/storage/proc.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/proc.h,v
retrieving revision 1.106
diff -c -r1.106 proc.h
*** src/include/storage/proc.h 15 Apr 2008 20:28:47 -0000 1.106
--- src/include/storage/proc.h 27 Oct 2008 18:32:03 -0000
***************
*** 14,19 ****
--- 14,20 ----
#ifndef _PROC_H_
#define _PROC_H_
+ #include "access/xlog.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
***************
*** 92,97 ****
--- 93,100 ----
bool inCommit; /* true if within commit critical section */
uint8 vacuumFlags; /* vacuum-related flags, see above */
+ XLogRecPtr lsn; /* Last LSN which maintained state of Recovery Proc */
+ int slotId; /* slot number in procarray, never changes once set, OK to reuse */
/* Info about LWLock the process is currently waiting for, if any. */
bool lwWaiting; /* true if waiting for an LW lock */
***************
*** 157,162 ****
--- 160,166 ----
extern Size ProcGlobalShmemSize(void);
extern void InitProcGlobal(void);
extern void InitProcess(void);
+ extern PGPROC *InitRecoveryProcess(void);
extern void InitProcessPhase2(void);
extern void InitAuxiliaryProcess(void);
extern bool HaveNFreeProcs(int n);
Index: src/include/storage/procarray.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/procarray.h,v
retrieving revision 1.23
diff -c -r1.23 procarray.h
*** src/include/storage/procarray.h 4 Aug 2008 18:03:46 -0000 1.23
--- src/include/storage/procarray.h 27 Oct 2008 18:32:03 -0000
***************
*** 14,19 ****
--- 14,20 ----
#ifndef PROCARRAY_H
#define PROCARRAY_H
+ #include "access/xact.h"
#include "storage/lock.h"
#include "utils/snapshot.h"
***************
*** 23,32 ****
--- 24,41 ----
extern void ProcArrayAdd(PGPROC *proc);
extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
+ extern void ProcArrayStartRecoveryTransaction(PGPROC *proc, TransactionId xid,
+ XLogRecPtr lsn, bool isSubXact);
extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
extern void ProcArrayClearTransaction(PGPROC *proc);
+ extern void ProcArrayClearRecoveryTransactions(void);
+ extern bool XidInRecoveryProcs(TransactionId xid);
+ extern void ProcArrayDisplay(int trace_level);
+ extern void ProcArrayUpdateRecoveryTransactions(XLogRecPtr lsn,
+ xl_xact_running_xacts *xlrec);
extern Snapshot GetSnapshotData(Snapshot snapshot);
+ extern RunningTransactions GetRunningTransactionData(void);
extern bool TransactionIdIsInProgress(TransactionId xid);
extern bool TransactionIdIsActive(TransactionId xid);
***************
*** 37,42 ****
--- 46,52 ----
extern PGPROC *BackendPidGetProc(int pid);
extern int BackendXidGetPid(TransactionId xid);
+ extern PGPROC *SlotIdGetRecoveryProc(int slotid);
extern bool IsBackendPid(int pid);
extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
***************
*** 51,54 ****
--- 61,72 ----
int nxids, const TransactionId *xids,
TransactionId latestXid);
+ extern void UnobservedTransactionsAddXids(TransactionId firstXid,
+ TransactionId lastXid);
+ extern void UnobservedTransactionsRemoveXid(TransactionId xid);
+ extern void UnobservedTransactionsPruneXids(TransactionId limitXid);
+ extern void UnobservedTransactionsClearXids(void);
+ extern void UnobservedTransactionsDisplay(int trace_level);
+ extern bool XidInUnobservedTransactions(TransactionId xid);
+
#endif /* PROCARRAY_H */
Index: src/include/utils/flatfiles.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/utils/flatfiles.h,v
retrieving revision 1.6
diff -c -r1.6 flatfiles.h
*** src/include/utils/flatfiles.h 15 Oct 2005 02:49:46 -0000 1.6
--- src/include/utils/flatfiles.h 27 Oct 2008 18:32:03 -0000
***************
*** 19,25 ****
extern char *database_getflatfilename(void);
extern char *auth_getflatfilename(void);
! extern void BuildFlatFiles(bool database_only);
extern void AtPrepare_UpdateFlatFiles(void);
extern void AtEOXact_UpdateFlatFiles(bool isCommit);
--- 19,25 ----
extern char *database_getflatfilename(void);
extern char *auth_getflatfilename(void);
! extern void BuildFlatFiles(bool database_only, bool acquire_locks, bool release_locks);
extern void AtPrepare_UpdateFlatFiles(void);
extern void AtEOXact_UpdateFlatFiles(bool isCommit);
***************
*** 27,32 ****
--- 27,35 ----
SubTransactionId mySubid,
SubTransactionId parentSubid);
+ extern bool AtEOXact_Database_FlatFile_Update_Needed(void);
+ extern bool AtEOXact_Auth_FlatFile_Update_Needed(void);
+
extern Datum flatfile_update_trigger(PG_FUNCTION_ARGS);
extern void flatfile_twophase_postcommit(TransactionId xid, uint16 info,
Index: src/include/utils/inval.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/utils/inval.h,v
retrieving revision 1.44
diff -c -r1.44 inval.h
*** src/include/utils/inval.h 9 Sep 2008 18:58:09 -0000 1.44
--- src/include/utils/inval.h 27 Oct 2008 18:32:03 -0000
***************
*** 15,20 ****
--- 15,21 ----
#define INVAL_H
#include "access/htup.h"
+ #include "access/xlog.h"
#include "utils/relcache.h"
***************
*** 60,63 ****
--- 61,90 ----
extern void inval_twophase_postcommit(TransactionId xid, uint16 info,
void *recdata, uint32 len);
+ /* Relation recovery handlers */
+ extern void relation_redo(XLogRecPtr lsn, XLogRecord *record);
+ extern void relation_desc(StringInfo buf, uint8 xl_info, char *rec);
+
+ /*
+ * relation-related XLOG entries
+ *
+ */
+
+ /*
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field
+ */
+ #define XLOG_RELATION_INVAL 0x00
+ #define XLOG_RELATION_LOCK 0x10
+
+ typedef struct xl_rel_inval
+ {
+ bool dummy;
+ } xl_rel_inval;
+
+ typedef struct xl_rel_lock
+ {
+ bool dummy;
+ } xl_rel_lock;
+
#endif /* INVAL_H */
Index: src/include/utils/snapshot.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/utils/snapshot.h,v
retrieving revision 1.3
diff -c -r1.3 snapshot.h
*** src/include/utils/snapshot.h 12 May 2008 20:02:02 -0000 1.3
--- src/include/utils/snapshot.h 27 Oct 2008 18:32:03 -0000
***************
*** 49,55 ****
uint32 xcnt; /* # of xact ids in xip[] */
TransactionId *xip; /* array of xact IDs in progress */
/* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */
! int32 subxcnt; /* # of xact ids in subxip[], -1 if overflow */
TransactionId *subxip; /* array of subxact IDs in progress */
/*
--- 49,56 ----
uint32 xcnt; /* # of xact ids in xip[] */
TransactionId *xip; /* array of xact IDs in progress */
/* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */
! uint32 subxcnt; /* # of xact ids in subxip[] */
! bool suboverflowed; /* true means at least one subxid cache overflowed */
TransactionId *subxip; /* array of subxact IDs in progress */
/*
***************
*** 63,68 ****
--- 64,133 ----
} SnapshotData;
/*
+ * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
+ * not quite. This has nothing at all to do with visibility on this server,
+ * so this is completely separate from snapmgr.c and snapmgr.h
+ */
+ typedef struct RunningXact
+ {
+ /* Items matching PGPROC entries */
+ TransactionId xid; /* xact ID in progress */
+ int pid; /* backend's process id, or 0 */
+ int slotId; /* backend's slotId */
+ Oid databaseId; /* OID of database this backend is using */
+ Oid roleId; /* OID of role using this backend */
+ uint8 vacuumFlags; /* vacuum-related flags, see above */
+
+ /* Items matching XidCache */
+ bool overflowed;
+ int nsubxids; /* # of subxact ids for this xact only */
+
+ /* Additional info */
+ uint32 subx_offset; /* array offset of start of subxip,
+ * zero if nsubxids == 0
+ */
+ } RunningXact;
+
+ typedef struct RunningXactsData
+ {
+ uint32 xcnt; /* # of xact ids in xrun[] */
+ uint32 subxcnt; /* total # of xact ids in subxip[] */
+ TransactionId latestRunningXid; /* Initial setting of LatestObservedXid */
+ TransactionId latestCompletedXid;
+
+ RunningXact *xrun; /* array of RunningXact structs */
+
+ /*
+ * subxip is held as a single contiguous array, so no space is wasted,
+ * plus it helps it fit into one XLogRecord. We continue to keep track
+ * of which subxids go with each top-level xid by tracking the start
+ * offset, held on each RunningXact struct.
+ */
+ TransactionId *subxip; /* array of subxact IDs in progress */
+
+ } RunningXactsData;
+
+ typedef RunningXactsData *RunningTransactions;
+
+ /*
+ * When we write running xact data to WAL, we use this structure.
+ */
+ typedef struct xl_xact_running_xacts
+ {
+ int xcnt; /* # of xact ids in xrun[] */
+ int subxcnt; /* # of xact ids in subxip[] */
+ TransactionId latestRunningXid; /* Initial setting of LatestObservedXid */
+ TransactionId latestCompletedXid;
+
+ /* Array of RunningXact(s) */
+ RunningXact xrun[1]; /* VARIABLE LENGTH ARRAY */
+
+ /* ARRAY OF RUNNING SUBTRANSACTION XIDs FOLLOWS */
+ } xl_xact_running_xacts;
+
+ #define MinSizeOfXactRunningXacts offsetof(xl_xact_running_xacts, xrun)
+
+ /*
* Result codes for HeapTupleSatisfiesUpdate. This should really be in
* tqual.h, but we want to avoid including that file elsewhere.
*/
Index: src/test/regress/parallel_schedule
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/test/regress/parallel_schedule,v
retrieving revision 1.49
diff -c -r1.49 parallel_schedule
*** src/test/regress/parallel_schedule 4 Oct 2008 21:56:55 -0000 1.49
--- src/test/regress/parallel_schedule 27 Oct 2008 18:32:03 -0000
***************
*** 67,75 ****
ignore: random
# ----------
! # Another group of parallel tests
# ----------
! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update namespace prepared_xacts delete
test: privileges
test: misc
--- 67,75 ----
ignore: random
# ----------
! # Another group of parallel tests test removed=prepared_xacts
# ----------
! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update namespace delete
test: privileges
test: misc