Обсуждение: Hot Standby (v9d)
Latest version of Hot Standby includes new deferred cancellation for conflict resolution and further refactoring. http://wiki.postgresql.org/wiki/Hot_Standby#Usage [Please note that max_standby_delay parameter has been set to default to zero (0), to allow investigation of usability. i.e. queries will be cancelled fairly quickly if they conflict. Please set this in recovery.conf to any value you consider reasonable and report findings.] GENERAL PLANS * Bug fix v9 over next few days * Put corrected version into GIT * Produce outstanding items as patch-on-patch via GIT * Work with reviewers to nibble away until fully committed OUTSTANDING ITEMS (as of Version 9d) Reported problems (Identified by) (Target release) * None (as yet!) - please post all bugs to list All v9 bugs have been fixed. This version has passed initial bash and function tests and we will continue testing this version tomorrow. Main work/required additional items (Proposed by) (Target release) * Complete Prepared Transaction support (Simon) Performance tuning opportunities (Proposed by) (Target release) * tuning of RecordKnownAssignedTransactionIds() using hash table (Heikki) Usability enhancements * don't cancel all currently connected users every time we drop a tablespace that has temp files being used within that tablespace - should be able to parse the filenames to get pids to cancel * don't kill all currently connected users every time we drop a user - should be able to match list of current users against connected roles * more work possible on btree deletes to reduce impact - should be able to work out a reasonable latestRemovedXid Testing of any kind welcome, the heavier the better. Save the database and logs if it crashes, please. Performance data especially valuable. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Вложения
Simon Riggs wrote: > * Put corrected version into GIT > * Produce outstanding items as patch-on-patch via GIT I've applied the hot standby patch and recovery infra v9 patch to branches in my git repository, and pushed those to: git://git.postgresql.org/git/~hlinnaka/pgsql.git You can clone that to get started. I've set those branches up so that the hot standby branch is branched off from the recovery infra branch. I'd like to keep the two separate, as the recovery infra patch is useful on it's own, and can be reviewed separately. As a teaser, I made a couple of minor changes after importing your patches. For the sake of the archives, I've also included those changes as patches here. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com >From 6d583356063c9d3c8d0e69233a40065bc5d7bde1 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas <heikki@enterprisedb.com> Date: Fri, 23 Jan 2009 14:37:05 +0200 Subject: [PATCH] Remove padding in XLogCtl; might be a good idea, but let's keep it simple for now. Also remove the XLogCtl->mode_lck spinlock, which is pretty pointless for a single boolean that's only written by one process. --- src/backend/access/transam/xlog.c | 24 +----------------------- 1 files changed, 1 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index fcf5657..42c4552 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -340,18 +340,11 @@ typedef struct XLogCtlWrite /* * Total shared-memory state for XLOG. - * - * This small structure is accessed by many backends, so we take care to - * pad out the parts of the structure so they can be accessed by separate - * CPUs without causing false sharing cache flushes. Padding is generous - * to allow for a wide variety of CPU architectures. */ -#define XLOGCTL_BUFFER_SPACING 128 typedef struct XLogCtlData { /* Protected by WALInsertLock: */ XLogCtlInsert Insert; - char InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)]; /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; @@ -359,16 +352,9 @@ typedef struct XLogCtlData uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ - /* add data structure padding for above info_lck declarations */ - char InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst) - - sizeof(XLogwrtResult) - - sizeof(uint32) - - sizeof(TransactionId) - - sizeof(XLogRecPtr)]; /* Protected by WALWriteLock: */ XLogCtlWrite Write; - char WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)]; /* * These values do not change after startup, although the pointed-to pages @@ -390,11 +376,8 @@ typedef struct XLogCtlData * always during Recovery Processing Mode. This allows us to identify * code executed *during* Recovery Processing Mode but not necessarily * by Startup process itself. - * - * Protected by mode_lck */ bool SharedRecoveryProcessingMode; - slock_t mode_lck; /* * recovery target control information @@ -410,8 +393,6 @@ typedef struct XLogCtlData TransactionId recoveryLastXid; XLogRecPtr recoveryLastRecPtr; - char InfoLockPadding[XLOGCTL_BUFFER_SPACING]; - slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -4399,7 +4380,6 @@ XLOGShmemInit(void) XLogCtl->XLogCacheBlck = XLOGbuffers - 1; XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages); SpinLockInit(&XLogCtl->info_lck); - SpinLockInit(&XLogCtl->mode_lck); /* * If we are not in bootstrap mode, pg_control should already exist. Read @@ -6183,9 +6163,7 @@ IsRecoveryProcessingMode(void) if (xlogctl == NULL) return false; - SpinLockAcquire(&xlogctl->mode_lck); - LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode; - SpinLockRelease(&xlogctl->mode_lck); + LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode; } knownProcessingMode = true; -- 1.5.6.5 >From 4061ac8f84cc699bf0f417689f853791089ed472 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas <heikki@enterprisedb.com> Date: Fri, 23 Jan 2009 15:55:33 +0200 Subject: [PATCH] Remove knownProcessingMode variable. --- src/backend/access/transam/xlog.c | 19 ++++++++----------- 1 files changed, 8 insertions(+), 11 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7e480e2..e64fb48 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -126,9 +126,11 @@ bool InRecovery = false; /* Are we recovering using offline XLOG archives? */ static bool InArchiveRecovery = false; -/* Local copy of shared RecoveryProcessingMode state */ +/* + * Local copy of shared RecoveryProcessingMode state. True actually + * means "not known, need to check the shared state" + */ static bool LocalRecoveryProcessingMode = true; -static bool knownProcessingMode = false; /* Was the last xlog file restored from archive, or local? */ static bool restoredFromArchive = false; @@ -5608,21 +5610,16 @@ StartupXLOG(void) bool IsRecoveryProcessingMode(void) { - if (knownProcessingMode && !LocalRecoveryProcessingMode) + if (!LocalRecoveryProcessingMode) return false; - + else { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; - SpinLockAcquire(&xlogctl->mode_lck); - LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode; - SpinLockRelease(&xlogctl->mode_lck); + LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode; + return LocalRecoveryProcessingMode; } - - knownProcessingMode = true; - - return LocalRecoveryProcessingMode; } /* -- 1.5.6.5
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > * Put corrected version into GIT > > * Produce outstanding items as patch-on-patch via GIT > > I've applied the hot standby patch and recovery infra v9 patch to > branches in my git repository, and pushed those to: > > git://git.postgresql.org/git/~hlinnaka/pgsql.git > > You can clone that to get started. > > I've set those branches up so that the hot standby branch is branched > off from the recovery infra branch. I'd like to keep the two separate, > as the recovery infra patch is useful on it's own, and can be reviewed > separately. > > As a teaser, I made a couple of minor changes after importing your > patches. For the sake of the archives, I've also included those changes > as patches here. OK, I'll look at those. I've fixed 6 bugs today (!), so I'd rather go for the new version coming out in an hour. That's too many to unpick unfortunately. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote: > I made a couple of minor changes after importing your > patches. I've applied them both to v9g, attached here. If you wouldn't mind redoing the initial step, I will promise not to do anything else to the code, except via patch on GIT. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Вложения
Simon Riggs wrote: > On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote: > >> I made a couple of minor changes after importing your >> patches. > > I've applied them both to v9g, attached here. > > If you wouldn't mind redoing the initial step, I will promise not to do > anything else to the code, except via patch on GIT. No problem. I don't need to do it from scratch, I'll just apply the changes from that patch as an incremental commit. Done, you can see it in my git repo now too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> @@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
>         {
>                 XLogRecPtr      bufLSN = BufferGetLSN(bufHdr);
>  
> +               /*
> +                * If the buffer is recent we may need to cancel ourselves
> +                * rather than risk returning a wrong answer. This test is
> +                * too conservative, but it is correct.
> +                *
>>> +                * We only need to cancel the current subtransaction.
> +                * Once we've handled the error then other subtransactions can
> +                * continue processing. Note that we do *not* reset the
> +                * BufferRecoveryConflictLSN at subcommit/abort, but we do
> +                * reset it if we release our last remaining sbapshot.
> +                * see SnapshotResetXmin()
> +                *
Is it really enough to cancel just the current subtransaction? What if 
it's a serializable transaction?
--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com
			
		
On Fri, 2009-01-23 at 18:22 +0200, Heikki Linnakangas wrote:
> > @@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
> >         {
> >                 XLogRecPtr      bufLSN = BufferGetLSN(bufHdr);
> >  
> > +               /*
> > +                * If the buffer is recent we may need to cancel ourselves
> > +                * rather than risk returning a wrong answer. This test is
> > +                * too conservative, but it is correct.
> > +                *
> >>> +                * We only need to cancel the current subtransaction.
> > +                * Once we've handled the error then other subtransactions can
> > +                * continue processing. Note that we do *not* reset the
> > +                * BufferRecoveryConflictLSN at subcommit/abort, but we do
> > +                * reset it if we release our last remaining sbapshot.
> > +                * see SnapshotResetXmin()
> > +                *
> 
> Is it really enough to cancel just the current subtransaction? What if 
> it's a serializable transaction?
I did originally think that when I first looked at the problem. I'm
sorry if I say that a lot. 
If you have a serializable transaction with subtransactions that suffers
a serializability error it only cancels the current subtransaction. That
means it's snapshot is still valid and can be used again. By analogy, as
long as a transaction does not see any data that is inconsistent with
its snapshot it seems OK for it to continue. So I think it is correct.
(Bizarrely, this might mean that if we did this programatically in a
loop we might keep the system busy for some time while it continually
re-reads data and fails. But that's another story).
You remind me that we can now do what Kevin has requested and throw a
errcode(ERRCODE_T_R_SERIALIZATION_FAILURE) at this point, which I agree
is the most easily understood way of describing this error.
(I was sorely tempted to make it "snapshot too old", as a joke).
-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support
			
		Simon Riggs wrote: > If you have a serializable transaction with subtransactions that suffers > a serializability error it only cancels the current subtransaction. That > means it's snapshot is still valid and can be used again. By analogy, as > long as a transaction does not see any data that is inconsistent with > its snapshot it seems OK for it to continue. So I think it is correct. Yeah, you're right. How bizarre. > (I was sorely tempted to make it "snapshot too old", as a joke). Yeah, that is a very describing message, actually. If there wasn't any precedence to that, I believe we might have came up with exactly that message ourselves. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Simon Riggs wrote: >> If you have a serializable transaction with subtransactions that suffers >> a serializability error it only cancels the current subtransaction. That >> means it's snapshot is still valid and can be used again. By analogy, as >> long as a transaction does not see any data that is inconsistent with >> its snapshot it seems OK for it to continue. So I think it is correct. > > Yeah, you're right. How bizarre. It was argued this way to me way back when subtransactions were written. Originally I had written it in such a way as to abort the whole transaction, on the rationale that if you're blindly restarting the subtransaction after a serialization error, it would result in a (maybe infinite) loop. I think the reason it only aborts the subxact is that that's what all other errors do, so why would this one act differently. Hmm, now that I think about it, I think it was deadlock errors not serialization errors ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>>> >> > If you have a serializable transaction with subtransactions that > suffers > a serializability error it only cancels the current subtransaction. This is a complete tangent to your work, but I wonder if this is really right. I mean it's not as if we could have srrialized the transaction as a whole with respect to whatever other transaction we failed.
On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote: > * Bug fix v9 over next few days version 9g - please use this for testing now -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Вложения
On Fri, 2009-01-23 at 20:13 +0000, Greg Stark wrote: > > If you have a serializable transaction with subtransactions that > > suffers > > a serializability error it only cancels the current subtransaction. > > This is a complete tangent to your work, but I wonder if this is > really right. I mean it's not as if we could have srrialized the > transaction as a whole with respect to whatever other transaction we > failed. Isn't this back to the discussion about PostgreSQL serializability versus true serializability? I agree that's not true serializability. Regards,Jeff Davis
Simon Riggs wrote: > On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote: > > >> * Bug fix v9 over next few days >> > > version 9g - please use this for testing now > > I'm doing some test runs with this now. I notice an old flatfiles related bug has reappeared: master: =# create database test; slave: =# select datname from pg_database where datname like 'test';datname ---------test (1 row) postgres=# \c test FATAL: database "test" does not exist Previous connection kept
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote: > > version 9g - please use this for testing now > I'm doing some test runs with this now. I notice an old flatfiles > related bug has reappeared: I'm seeing an off-by-one error on xmax, in some cases. That then causes the flat file update to not pick up correct info, even though it executed in other ways as intended. If you run two create databases and then test only the first, it appears to have worked as intended. These bugs are result of recent refactoring and it will take a few days to shake some of them out. We've had more than 20 already so we're beating them back, but we're not done yet. Thanks for the report, will publish this and other fixes on Monday. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote: > On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote: > > > > version 9g - please use this for testing now > > > I'm doing some test runs with this now. I notice an old flatfiles > > related bug has reappeared: > > I'm seeing an off-by-one error on xmax, in some cases. That then causes > the flat file update to not pick up correct info, even though it > executed in other ways as intended. If you run two create databases and > then test only the first, it appears to have worked as intended. > > These bugs are result of recent refactoring and it will take a few days > to shake some of them out. We've had more than 20 already so we're > beating them back, but we're not done yet. I was at a loss to explain how this could have slipped through our tests. It appears that the error was corrected following each checkpoint as a result of ProcArrayUpdateRunningXacts(). Our tests were performed after a short delay, which typically would be greater than the deliberately short setting of checkpoint_timeout/archive_timeout and so by the time we looked the error was gone and masked the problem. We're setting checkpoint_timeout to 30 mins now to avoid the delay... -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote: > >> On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote: >> >> >>>> version 9g - please use this for testing now >>>> >>> I'm doing some test runs with this now. I notice an old flatfiles >>> related bug has reappeared: >>> >> I'm seeing an off-by-one error on xmax, in some cases. That then causes >> the flat file update to not pick up correct info, even though it >> executed in other ways as intended. If you run two create databases and >> then test only the first, it appears to have worked as intended. >> >> These bugs are result of recent refactoring and it will take a few days >> to shake some of them out. We've had more than 20 already so we're >> beating them back, but we're not done yet. >> > > I was at a loss to explain how this could have slipped through our > tests. It appears that the error was corrected following each checkpoint > as a result of ProcArrayUpdateRunningXacts(). Our tests were performed > after a short delay, which typically would be greater than the > deliberately short setting of checkpoint_timeout/archive_timeout and so > by the time we looked the error was gone and masked the problem. We're > setting checkpoint_timeout to 30 mins now to avoid the delay... > > That makes sense - I had archive_timeout set at 5 minutes, and I would have checked before 5 minutes were up. Cheers Mark
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote: > I'm doing some test runs with this now. I notice an old flatfiles > related bug has reappeared: Bug fix patch against git repo. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Вложения
I skimmed through the Hot Standby patch for a preliminary review. I noted the
following things, some minor tweaks, some just questions. None of the things I
noted are big issues unless some of the questions uncover issues.
1) This code is obviously a cut-pasto:
+       else if (strcmp(tok1, "max_standby_delay") == 0)
+       {
+           errno = 0;
+           maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
+           if (errno == EINVAL || errno == ERANGE)
+               ereport(FATAL,
+                (errmsg("max_standby_delay is not a valid number: \"%s\"",
+                        tok2)));
Have you ever tested the behaviour with max_standby_delay = -1 ?
Also, the default max_standby_delay is currently 0 -- ie, kill queries
immediately upon detecting any conflicts at all -- which I don't really think
anyone would be happy with. I still *strongly* feel the default has to be the
non-destructive conservative -1.
2) This hard coded constant of 500ms seems arbitrary to me. If your use case
is a heavily-used reporting engine you might get this message all the time. I
think this either has to be configurable (warn_standby_delay?) or based on
max_standby_delay or some other parameter somehow.
+       /*
+        * log that we have been waiting for a while now...
+        */
+       if (!logged && standbyWait_ms > 500)
3) These two blocks of code seem unsatisfactory:
!           /*
!            * Keep track of the block number of the lastBlockVacuumed, so
!            * we can scan those blocks as well during WAL replay. This then
!            * provides concurrency protection and allows btrees to be used
!            * while in recovery.
!            */
!           if (lastBlockVacuumed > vstate->lastBlockVacuumed)
!               vstate->lastBlockVacuumed = lastBlockVacuumed;
! 
+           /*
+            * XXXHS we don't actually need to read the block, we
+            * just need to confirm it is unpinned. If we had a special call
+            * into the buffer manager we could optimise this so that
+            * if the block is not in shared_buffers we confirm it as unpinned.
+            */
+           buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+           if (BufferIsValid(buffer))
+           {
+               LockBufferForCleanup(buffer);           
+               UnlockReleaseBuffer(buffer);
+           }
Aside from the XXX comment (I thought we actually had such a call now, but if
not shouldn't we just add one instead of carping?) I'm not convinced this
handles all the cases that can arise. Notable, what happens if a previous
vacuum died in the middle of the scan?
I think we have a vacuum id which we use already with btrees to the same
issue. It seems to me this be more robust if you stamped the xlog record with
that id number and ensured it matched the id of the lastBloockVacuumed? Then
you could start from 0 if the id changes.
4) Why is this necessary?
+   if (IsRecoveryProcessingMode() && 
+       locktag->locktag_type == LOCKTAG_OBJECT &&
+       lockmode > AccessShareLock)
+       ereport(ERROR,
+               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("cannot acquire lockmode %s on database objects while recovery is in progress", 
+                                   lockMethodTable->lockModeNames[lockmode]),
+                errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
+ 
Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
access. But shouldn't we be able to do manual LOCK TABLE calls?
I hope this isn't the only interlock we have against trying to do write i/o or
DDL against tables...?
5) The code for killing off conflicting transactions looks odd to me but I
haven't really traced through it to see what it's doing. It seems to kill off
just one process? What if there are several holding the lock in question?
Also, it doesn't seem to take into account how long the various transactions
have held the lock?
Is there a risk of, for example, if you have a long report running against a
table and lots of OLTP requests against the same table it seems the conflict
resolution code would fire randomly into the crowd and hit mostly innocent
OLTP transactions until eventually it found the culprit report?
Also, it kills of backends using SIGINT which I *think* Tom went through and
made safe earlier this release, right?
In any case if the backend doesn't die off promptly we wait forever with no
further warnings or log messages. I would think we should at least print some
sort of message here, even if it's a "can't happen" elog. The doubling thing
is probably unnecessary too in this case.
+               if (!XLogRecPtrIsValid(conflict_LSN))
+               {
+                   /* wait awhile for it to die */
+                   pg_usleep(wontDieWait * 5000L);
+                   wontDieWait *= 2;
+               }
+           }
6) I still don't understand why you need unobserved_xids. We don't need this
in normal running, an xid we don't know for certain is committed is exactly
the same as a transaction we know is currently running or aborted. So why do
you need it during HS?
The comment says:
+  * This is very important because if we leave 
+  * those xids out of the snapshot then they will appear to be already complete. 
+  * Later, when they have actually completed this could lead to confusion as to 
+  * whether those xids are visible or not, blowing a huge hole in MVCC. 
+  * We need 'em.
But that doesn't sound rational to me. I'm not sure what "confusion" this
would cause. If they later actually complete then any existing snapshots would
still not see them. And any later snapshots wouldn't be confused by the
earlier conclusions.
--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
			
		
On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
> I skimmed through the Hot Standby patch for a preliminary review. I noted the
> following things, some minor tweaks, some just questions. None of the things I
> noted are big issues unless some of the questions uncover issues.
Thanks for your attention.
> 1) This code is obviously a cut-pasto:
> 
> +       else if (strcmp(tok1, "max_standby_delay") == 0)
> +       {
> +           errno = 0;
> +           maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
> +           if (errno == EINVAL || errno == ERANGE)
> +               ereport(FATAL,
> +                (errmsg("max_standby_delay is not a valid number: \"%s\"",
> +                        tok2)));
> 
> Have you ever tested the behaviour with max_standby_delay = -1 ?
> Also, the default max_standby_delay is currently 0 -- ie, kill queries
> immediately upon detecting any conflicts at all -- which I don't really think
> anyone would be happy with. 
Agreed. As explained when I published that patch it is deliberately
severe to allow testing of conflict resolution and feedback on it.
> I still *strongly* feel the default has to be the
> non-destructive conservative -1.
I don't. Primarily, we must support high availability. It is much better
if we get people saying "I get my queries cancelled" and we say RTFM and
change parameter X, than if people say "my failover was 12 hours behind
when I needed it to be 10 seconds behind and I lost a $1 million because
of downtime of Postgres" and we say RTFM and change parameter X.
> 2) This hard coded constant of 500ms seems arbitrary to me. If your use case
> is a heavily-used reporting engine you might get this message all the time. I
> think this either has to be configurable (warn_standby_delay?) or based on
> max_standby_delay or some other parameter somehow.
> 
> +       /*
> +        * log that we have been waiting for a while now...
> +        */
> +       if (!logged && standbyWait_ms > 500)
Greg, that's a DEBUG5 message.
> 3) These two blocks of code seem unsatisfactory:
> 
> !           /*
> !            * Keep track of the block number of the lastBlockVacuumed, so
> !            * we can scan those blocks as well during WAL replay. This then
> !            * provides concurrency protection and allows btrees to be used
> !            * while in recovery.
> !            */
> !           if (lastBlockVacuumed > vstate->lastBlockVacuumed)
> !               vstate->lastBlockVacuumed = lastBlockVacuumed;
> ! 
> 
> 
> +           /*
> +            * XXXHS we don't actually need to read the block, we
> +            * just need to confirm it is unpinned. If we had a special call
> +            * into the buffer manager we could optimise this so that
> +            * if the block is not in shared_buffers we confirm it as unpinned.
> +            */
> +           buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
> +           if (BufferIsValid(buffer))
> +           {
> +               LockBufferForCleanup(buffer);           
> +               UnlockReleaseBuffer(buffer);
> +           }
> 
> Aside from the XXX comment (I thought we actually had such a call now, but if
> not shouldn't we just add one instead of carping?)
We should add many things. The equivalent optimisation of VACUUM in
normal running has not been done either. Considering we have both HOT
and visibility map enhanced VACUUM, its not a priority.
>  I'm not convinced this
> handles all the cases that can arise. Notable, what happens if a previous
> vacuum died in the middle of the scan?
Nothing, because it won't then have removed any heap rows.
> I think we have a vacuum id which we use already with btrees to the same
> issue. It seems to me this be more robust if you stamped the xlog record with
> that id number and ensured it matched the id of the lastBloockVacuumed? Then
> you could start from 0 if the id changes.
> 4) Why is this necessary?
> 
> +   if (IsRecoveryProcessingMode() && 
> +       locktag->locktag_type == LOCKTAG_OBJECT &&
> +       lockmode > AccessShareLock)
> +       ereport(ERROR,
> +               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                errmsg("cannot acquire lockmode %s on database objects while recovery is in progress", 
> +                                   lockMethodTable->lockModeNames[lockmode]),
> +                errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
> + 
> 
> Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
> access. But shouldn't we be able to do manual LOCK TABLE calls?
Read the rest of the comments on the locking code section.
> I hope this isn't the only interlock we have against trying to do write i/o or
> DDL against tables...?
No it's not.
> 5) The code for killing off conflicting transactions looks odd to me but I
> haven't really traced through it to see what it's doing. It seems to kill off
> just one process? 
No, why do you think that?
> What if there are several holding the lock in question?
Covered.
> Also, it doesn't seem to take into account how long the various transactions
> have held the lock?
True. Why would that matter?
> Is there a risk of, for example, if you have a long report running against a
> table and lots of OLTP requests against the same table 
> it seems the conflict resolution code would fire randomly into the
> crowd 
OK, that's where I stop bothering to respond.
-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support
			
		On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote: > On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote: > Agreed. As explained when I published that patch it is deliberately > severe to allow testing of conflict resolution and feedback on it. > > > I still *strongly* feel the default has to be the > > non-destructive conservative -1. > > I don't. Primarily, we must support high availability. It is much better > if we get people saying "I get my queries cancelled" and we say RTFM and > change parameter X, than if people say "my failover was 12 hours behind > when I needed it to be 10 seconds behind and I lost a $1 million because > of downtime of Postgres" and we say RTFM and change parameter X. If the person was stupid enough to configure it for such as thing they deserve to the lose the money. Not to mention we have already lost them as a user because they will blame postgresql regardless of reality as evidenced by their inability to RTFM (or have a vendor that RTFMs) in the first place. I got to vote with Greg on this one. Sincerely, Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Gregory Stark wrote: > 6) I still don't understand why you need unobserved_xids. We don't need this > in normal running, an xid we don't know for certain is committed is exactly > the same as a transaction we know is currently running or aborted. So why do > you need it during HS? In normal operation, any transaction that's in-progress has an entry in ProcArray. GetSnapshot() gathers the xids of all those in-progress transactions, so that they're seen as not-committed even when the they're later marked as committed in clog. In HS, we might see the first WAL record of transaction 10 before we see the first WAL record of transaction 9. Without unobserved_xids, if you take a snapshot in the standby between those two WAL records, xid 10 is included in the snapshot, but 9 is not. If xact 9 later commits, it's marked in clog as committed, and it will suddenly appear as visible to the snapshot. To avoid that, when we replay the first WAL record of xact 10, we also add 9 to the unobserved xid array, so that it's included in snapshots. So, you can think of the unobserved xids array as an extension of ProcArray. The entries are like light-weight PGPROC entries. In fact I proposed earlier to simply create dummy PGPROC entries instead. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote: > I still don't understand why you need unobserved_xids. We don't need > this in normal running, an xid we don't know for certain is committed > is exactly the same as a transaction we know is currently running or > aborted. So why do you need it during HS? All running transactions need to be part of the snapshot. This is true on both standby and primary. On primary we allocate new xids from a single counter, so there are no gaps. Newly assigned xids go straight into the procarray and then are removed later when they commit/abort. In standby we only know what is in WAL. Xid assignment is not currently WAL logged, so either we choose to WAL log each new xid and face the performance consequences (IMHO, unacceptable), or we choose a different strategy. UnobservedXids is that different strategy. Florian Pflug had a different strategy, but he did have a strategy. > The comment says: > > + * This is very important because if we leave > + * those xids out of the snapshot then they will appear to be > already complete. > + * Later, when they have actually completed this could lead to > confusion as to > + * whether those xids are visible or not, blowing a huge hole in > MVCC. > + * We need 'em. > > But that doesn't sound rational to me. I'm not sure what "confusion" > this would cause. If they later actually complete then any existing > snapshots would still not see them. And any later snapshots wouldn't > be confused by the earlier conclusions. If a xact is running when we take a snapshot, yet we do not include it then bad things will happen, but not til later. If a xact is not in snapshot *and* less than xamx then we presume it completed prior to the snapshot. If that xact did subsequently commit *after* we took the snapshot, then it's absence from the snapshot will make that data visible if the xact committed, making it look like it committed *before* the snapshot was taken. So the "confusion" results in an MVCC violation and so we must handle this case correctly, or die. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
"Joshua D. Drake" <jd@commandprompt.com> writes:
> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>> I still *strongly* feel the default has to be the
>>> non-destructive conservative -1.
>> 
>> I don't. Primarily, we must support high availability. It is much better
>> if we get people saying "I get my queries cancelled" and we say RTFM and
>> change parameter X, than if people say "my failover was 12 hours behind
>> when I needed it to be 10 seconds behind and I lost a $1 million because
>> of downtime of Postgres" and we say RTFM and change parameter X.
> If the person was stupid enough to configure it for such as thing they
> deserve to the lose the money.
Well, those unexpectedly cancelled queries could have represented
critical functionality too.  I think this argument calls the entire
approach into question.  If there is no safe setting for the parameter
then we need to find a way to not have the parameter.
        regards, tom lane
			
		* Tom Lane <tgl@sss.pgh.pa.us> [090128 15:02]: > Well, those unexpectedly cancelled queries could have represented > critical functionality too. I think this argument calls the entire > approach into question. If there is no safe setting for the parameter > then we need to find a way to not have the parameter. That's what we currently have without HS, but people aren't completely satisfied with it either ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
(sorry for top posting -- blame apple) I don't see anything "dangerous" with either setting. For use cases where the backup is the primary purpose then killing queries is fine. For use cases where the maching is a reporting machine then saving large amounts of archived logs is fine. Engineering is about tradeoffs and these two use cases are intrinsically in conflict. The alternative is postponing vacuuming on the master which is imho even worse than the disease. -- Greg On 28 Jan 2009, at 19:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote: >>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote: >>>> I still *strongly* feel the default has to be the >>>> non-destructive conservative -1. >>> >>> I don't. Primarily, we must support high availability. It is much >>> better >>> if we get people saying "I get my queries cancelled" and we say >>> RTFM and >>> change parameter X, than if people say "my failover was 12 hours >>> behind >>> when I needed it to be 10 seconds behind and I lost a $1 million >>> because >>> of downtime of Postgres" and we say RTFM and change parameter X. > >> If the person was stupid enough to configure it for such as thing >> they >> deserve to the lose the money. > > Well, those unexpectedly cancelled queries could have represented > critical functionality too. I think this argument calls the entire > approach into question. If there is no safe setting for the parameter > then we need to find a way to not have the parameter. > > regards, tom lane
Put another way: your characterization is no more true than claiming there's no "safe" setting for statement_timeout since a large value means clog could overflow your disk and your tables could bloat. (I note we default statement_timeout to off though) -- Greg On 28 Jan 2009, at 19:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote: >>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote: >>>> I still *strongly* feel the default has to be the >>>> non-destructive conservative -1. >>> >>> I don't. Primarily, we must support high availability. It is much >>> better >>> if we get people saying "I get my queries cancelled" and we say >>> RTFM and >>> change parameter X, than if people say "my failover was 12 hours >>> behind >>> when I needed it to be 10 seconds behind and I lost a $1 million >>> because >>> of downtime of Postgres" and we say RTFM and change parameter X. > >> If the person was stupid enough to configure it for such as thing >> they >> deserve to the lose the money. > > Well, those unexpectedly cancelled queries could have represented > critical functionality too. I think this argument calls the entire > approach into question. If there is no safe setting for the parameter > then we need to find a way to not have the parameter. > > regards, tom lane
On Wed, 2009-01-28 at 21:41 +0200, Heikki Linnakangas wrote: > So, you can think of the unobserved xids array as an extension of > ProcArray. The entries are like light-weight PGPROC entries. In fact I > proposed earlier to simply create dummy PGPROC entries instead. Which we don't do because we don't know whether we are dealing with top-level xids or subtransactions of already observed top-level xids. Either way we have to rearrange things when we move from unobserved to observed. A major difference is that what we have now works and what we might have instead may not, which is being illustrated by recent testing. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote: > "my failover was 12 hours behind when I needed it to be 10 seconds > behind and I lost a $1 million because of downtime of Postgres" The same could be said for warm standby right now. Or Slony-I, for that matter. I think that we can reasonably expect anyone implementing asynchronous replication for HA to properly monitor the lag time. There are many sources of latency in the process, so I don't think anyone can expect 10 seconds without actually monitoring to verify what the actual lag time is. I apologize if my post is based on ignorance, I haven't followed your patch as closely as others involved in this discussion. Regards,Jeff Davis
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote: >>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote: >>>> I still *strongly* feel the default has to be the >>>> non-destructive conservative -1. >>> I don't. Primarily, we must support high availability. It is much better >>> if we get people saying "I get my queries cancelled" and we say RTFM and >>> change parameter X, than if people say "my failover was 12 hours behind >>> when I needed it to be 10 seconds behind and I lost a $1 million because >>> of downtime of Postgres" and we say RTFM and change parameter X. > >> If the person was stupid enough to configure it for such as thing they >> deserve to the lose the money. > > Well, those unexpectedly cancelled queries could have represented > critical functionality too. I think this argument calls the entire > approach into question. If there is no safe setting for the parameter > then we need to find a way to not have the parameter. We've gone through that already. Different ideas were hashed out around September. There's four basic feasible approaches to what to do when an incoming WAL record conflicts with a running read-only query: 1. Kill the query. (max_standby_delay=0) 2. Wait for the query to finish before continuing (max_standby_delay=-1) 3. Have a feedback loop from standby to master, feeding an OldestXmin to the master, preventing it from removing tuples that are still needed in the standby. 4. Allow the query to continue, knowing that it will return wrong results. I don't consider 4 to be an option. Option 3 has its own set of drawbacks, as a standby can then cause bloat in the master, and in any case we're not going to have it in this release. And then there's some middle ground, like wait a while and then kill the query (max_standby_delay > 0). I don't see any way around the fact that when a tuple is removed, it's gone and can't be accessed by queries. Either you don't remove it, or you kill the query. I think the max_standby_delay setting is fairly easy to explain. It shouldn't be too hard for a DBA to set it correctly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-01-28 at 11:33 -0800, Joshua D. Drake wrote: > On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote: > > On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote: > > > Agreed. As explained when I published that patch it is deliberately > > severe to allow testing of conflict resolution and feedback on it. > > > > > I still *strongly* feel the default has to be the > > > non-destructive conservative -1. > > > > I don't. Primarily, we must support high availability. It is much better > > if we get people saying "I get my queries cancelled" and we say RTFM and > > change parameter X, than if people say "my failover was 12 hours behind > > when I needed it to be 10 seconds behind and I lost a $1 million because > > of downtime of Postgres" and we say RTFM and change parameter X. > > If the person was stupid enough to configure it for such as thing they > deserve to the lose the money. It was never intended to be 0, that was just for testing, as I said. But a smallish integer number of seconds, say 10, 60, 300 or at most 600 is reasonable. Crash barriers can be moved, falling off a cliff is permanent. It's easy to change the parameter if you don't like it, and too late to change it if we set the default wrong. My name is on the patch and I take responsibility for such failures. I'm not going to turn round and say "but Josh said", kinda like Stan Laurel, if it fails. I've never called any user stupid and never will, not while they use Postgres, at least... If we get the default wrong then its down to us. Never cancelling queries is definitely a wrong default choice for an HA server. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2009-01-28 at 14:56 -0500, Tom Lane wrote: > Well, those unexpectedly cancelled queries could have represented > critical functionality too. I think this argument calls the entire > approach into question. If there is no safe setting for the parameter > then we need to find a way to not have the parameter. I see the opposite: We don't know what tradeoffs, if any, the user is willing to put up with, so we need input. It is about resource prioritisation and not for us to decide, since these reflect business objectives not internal twangy things like join_collapse_limit. The essential choice is "What would you like the max failover time to be?". Some users want one server with max 5 mins behind, some want two servers, one with 0 seconds behind, one with 12 hours behind -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > The essential choice is "What would you like the max failover time to > be?". Some users want one server with max 5 mins behind, some want two > servers, one with 0 seconds behind, one with 12 hours behind It's not quite that simple. Setting max_standby_delay=5mins means that you're willing to wait 5 minutes for each query to die. Which means that in worst case you have to stop for 5 minutes at every single vacuum record, and fall behind much more than 5 minutes. You could make it more like that by tracking the timestamps in commit records, and/or having some sort of a moving average logic in the timeout, where you allow more waiting if you haven't waited for a long time, and kill queries more aggressively if you've had to wait a lot recently. It should also be noted that the control functions allow you to connect to the database and manually pause/resume the replay. So you can for example set max_standby_delay=0 during the day, but pause the replay manually before starting a nightly report. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-01-28 at 22:47 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > The essential choice is "What would you like the max failover time to > > be?". Some users want one server with max 5 mins behind, some want two > > servers, one with 0 seconds behind, one with 12 hours behind > > It's not quite that simple. In this case, yes it is. > Setting max_standby_delay=5mins means that > you're willing to wait 5 minutes for each query to die. Which means that > in worst case you have to stop for 5 minutes at every single vacuum > record, and fall behind much more than 5 minutes. That's not how this patch works. > You could make it more like that by tracking the timestamps in commit > records Which is what we do. > It should also be noted that the control functions allow you to connect > to the database and manually pause/resume the replay. So you can for > example set max_standby_delay=0 during the day, but pause the replay > manually before starting a nightly report. Yes, thank you for bringing balance to the discussions. Please everybody read this before commenting further. http://wiki.postgresql.org/wiki/Hot_Standby#Usage -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2009-01-28 at 22:47 +0200, Heikki Linnakangas wrote: > It's not quite that simple. Setting max_standby_delay=5mins means that > you're willing to wait 5 minutes for each query to die. Which means that > in worst case you have to stop for 5 minutes at every single vacuum > record, and fall behind much more than 5 minutes. Just trying to follow along: are you talking about the situation where there are (for example) a continuous stream of "select pg_sleep(600)" on the standby, and a series of rapid VACUUMs on the primary? This situation might be more likely now that we have partial VACUUMs. > It should also be noted that the control functions allow you to connect > to the database and manually pause/resume the replay. So you can for > example set max_standby_delay=0 during the day, but pause the replay > manually before starting a nightly report. > That's a very cool feature. Regards,Jeff Davis
>> I don't. Primarily, we must support high availability. It is much better >> if we get people saying "I get my queries cancelled" and we say RTFM and >> change parameter X, than if people say "my failover was 12 hours behind >> when I needed it to be 10 seconds behind and I lost a $1 million because >> of downtime of Postgres" and we say RTFM and change parameter X. > > If the person was stupid enough to configure it for such as thing they > deserve to the lose the money. Not to mention we have already lost them > as a user because they will blame postgresql regardless of reality as > evidenced by their inability to RTFM (or have a vendor that RTFMs) in > the first place. > > I got to vote with Greg on this one. I vote with Simon. The thing is that if you get some queries cancelled, you'll realize you have a problem. Now you have several options for what to do to fix it. Having your failover be 12 hours behind (or 12 months behind) is something that it would be much easier to not realize. ...Robert
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2009-01-28 at 14:56 -0500, Tom Lane wrote: > >> Well, those unexpectedly cancelled queries could have represented >> critical functionality too. I think this argument calls the entire >> approach into question. If there is no safe setting for the parameter >> then we need to find a way to not have the parameter. > > I see the opposite: We don't know what tradeoffs, if any, the user is > willing to put up with, so we need input. Well, if you see it that way then it seems to me you should be arguing for making max_standby_delay a mandatory parameter. Without it don't start allow connections. I hadn't considered that and am not exactly sure where I would stand on it. > The essential choice is "What would you like the max failover time to > be?". Some users want one server with max 5 mins behind, some want two > servers, one with 0 seconds behind, one with 12 hours behind Sure. But if they don't configure one then we shouldn't impose one. You're thinking of precisely one use case and taking positive action to interrupt the user's requests on the basis of it. But there are plenty of other use cases. I claim the default has to be to do as the user instructed without intervention. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Robert Haas <robertmhaas@gmail.com> writes:
> I vote with Simon.  The thing is that if you get some queries
> cancelled, you'll realize you have a problem.
... or if you don't, they couldn't have been all that critical.
> Having your failover be 12 hours
> behind (or 12 months behind) is something that it would be much easier
> to not realize.
Okay, I'm sold, positive max_standby_delay should be the default.
        regards, tom lane
			
		On Wed, 2009-01-28 at 22:19 +0200, Heikki Linnakangas wrote: > Tom Lane wrote: ... > > Well, those unexpectedly cancelled queries could have represented > > critical functionality too. I think this argument calls the entire > > approach into question. If there is no safe setting for the parameter > > then we need to find a way to not have the parameter. > > We've gone through that already. Different ideas were hashed out around > September. There's four basic feasible approaches to what to do when an > incoming WAL record conflicts with a running read-only query: > > 1. Kill the query. (max_standby_delay=0) > 2. Wait for the query to finish before continuing (max_standby_delay=-1) > 3. Have a feedback loop from standby to master, feeding an OldestXmin to > the master, preventing it from removing tuples that are still needed in > the standby. > 4. Allow the query to continue, knowing that it will return wrong results. > > I don't consider 4 to be an option. Option 3 has its own set of > drawbacks, as a standby can then cause bloat in the master, and in any > case we're not going to have it in this release. And then there's some > middle ground, like wait a while and then kill the query > (max_standby_delay > 0). > > I don't see any way around the fact that when a tuple is removed, it's > gone and can't be accessed by queries. Either you don't remove it, or > you kill the query. Actually we came up with a solution to this - use filesystem level snapshots (like LVM2+XFS or ZFS), and redirect backends with long-running queries to use fs snapshot mounted to a different mountpoint. I don't think Simon has yet put full support for it in code, but it is clearly _the_ solution for those who want to eat the cake and have it too. > > I think the max_standby_delay setting is fairly easy to explain. It > shouldn't be too hard for a DBA to set it correctly. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Hannu Krosing wrote: > Actually we came up with a solution to this - use filesystem level > snapshots (like LVM2+XFS or ZFS), and redirect backends with > long-running queries to use fs snapshot mounted to a different > mountpoint. > > I don't think Simon has yet put full support for it in code, but it is > clearly _the_ solution for those who want to eat the cake and have it > too. > > How does that work if you're using mutiple file systems via tablespaces (e.g. indexes in a different TS)? cheers andrew
Hannu Krosing <hannu@krosing.net> writes: > Actually we came up with a solution to this - use filesystem level > snapshots (like LVM2+XFS or ZFS), and redirect backends with > long-running queries to use fs snapshot mounted to a different > mountpoint. Uhm, how do you determine which snapshot to direct the backend to? There could have been several generations of tuples in that tid since your query started. Do you take a snapshot every time there's a vacuum-snapshot conflict and record which snapshot goes with that snapshot? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
>> I don't see any way around the fact that when a tuple is removed, it's >> gone and can't be accessed by queries. Either you don't remove it, or >> you kill the query. > > Actually we came up with a solution to this - use filesystem level > snapshots (like LVM2+XFS or ZFS), and redirect backends with > long-running queries to use fs snapshot mounted to a different > mountpoint. > > I don't think Simon has yet put full support for it in code, but it is > clearly _the_ solution for those who want to eat the cake and have it > too. I think _the_ solution is to notice when you're about to vacuum a page that is still visible to a running backend on the standby, and save that page off to a separate cache of old page versions (perhaps using the relation fork mechanism). I suspect filesystem-level snapshots can be made to work, but it's never going to be a portable solution, and I suspect you're going to need a lot of the same bookkeeping anyway (like, when you have page X in shared buffers, which version of page X is it, anyway?). ...Robert
On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote: > > Hannu Krosing wrote: > > Actually we came up with a solution to this - use filesystem level > > snapshots (like LVM2+XFS or ZFS), and redirect backends with > > long-running queries to use fs snapshot mounted to a different > > mountpoint. > > > > I don't think Simon has yet put full support for it in code, but it is > > clearly _the_ solution for those who want to eat the cake and have it > > too. > How does that work if you're using mutiple file systems via tablespaces > (e.g. indexes in a different TS)? It's a great idea and easy to do, but I can't do everything in one go. The main reasons not to are multiple file system difficulties and lack of a mainstream Linux solution, and more simply lack of time and test resources. So not now, maybe later. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote: > > Hannu Krosing wrote: > > Actually we came up with a solution to this - use filesystem level > > snapshots (like LVM2+XFS or ZFS), and redirect backends with > > long-running queries to use fs snapshot mounted to a different > > mountpoint. > > > > I don't think Simon has yet put full support for it in code, but it is > > clearly _the_ solution for those who want to eat the cake and have it > > too. > > > > > > > How does that work if you're using mutiple file systems via tablespaces > (e.g. indexes in a different TS)? Basically the same way we do WAL shipping up to 8.3. That is using external scripts. Once we have enough experience, we could try to move tha actual snapshot-mount-switch inside the core -- ------------------------------------------ Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote: > >> I don't see any way around the fact that when a tuple is removed, it's > >> gone and can't be accessed by queries. Either you don't remove it, or > >> you kill the query. > > > > Actually we came up with a solution to this - use filesystem level > > snapshots (like LVM2+XFS or ZFS), and redirect backends with > > long-running queries to use fs snapshot mounted to a different > > mountpoint. > > > > I don't think Simon has yet put full support for it in code, but it is > > clearly _the_ solution for those who want to eat the cake and have it > > too. > > I think _the_ solution is to notice when you're about to vacuum a page > that is still visible to a running backend on the standby, and save > that page off to a separate cache of old page versions (perhaps using > the relation fork mechanism). I suspect filesystem-level snapshots > can be made to work, but it's never going to be a portable solution, > and I suspect you're going to need a lot of the same bookkeeping > anyway (like, when you have page X in shared buffers, which version of > page X is it, anyway?). The idea was to switch file pointers inside the backend needing old versions, (and then flush cache if needed) so the only bookkeeping you need is which fs snapshots you need to keep and which can be released. -- ------------------------------------------ Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote: > I think _the_ solution is to notice when you're about to vacuum a page > that is still visible to a running backend on the standby, and save > that page off to a separate cache of old page versions (perhaps using > the relation fork mechanism). I'll let you write that, for the next release... The problem with all of this is we've been talking about it for 8 months now and various opinions are held by people. What is being presented is a broad set of options (summarised from Wiki) 1. Wait then cancel a) always for databases, tablespaces and locks b) deferred cancelation for buffer changes 2. We connect to Primary server from Standby server and keep a transaction open using contrib/dblink functions, then commit as soon as we are done. 3. We pause recovery by issuing a pg_recovery_pause() function, or start server in pause mode using recovery_started_paused = on. Yes, it's a critical area to the success of the feature. But this is enough for first release and for us to get user feedback. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2009-02-03 at 13:50 +0000, Gregory Stark wrote: > Hannu Krosing <hannu@krosing.net> writes: > > > Actually we came up with a solution to this - use filesystem level > > snapshots (like LVM2+XFS or ZFS), and redirect backends with > > long-running queries to use fs snapshot mounted to a different > > mountpoint. > > Uhm, how do you determine which snapshot to direct the backend to? There could > have been several generations of tuples in that tid since your query started. > Do you take a snapshot every time there's a vacuum-snapshot conflict and > record which snapshot goes with that snapshot? The most sensible thing to do seems to wait for some configurable period (say a few seconds or a few minutes), delaying WAL apply, and then to do the snaphot, mount it and redirect all running transactions to use _current_ filesystem snapshot, and then resume WAL application. As each of the transactions running on saved fs snapshots complete, they are switced back to main/live fs view. When the last there transaction ends, the snapshot is unmounted and released. -- ------------------------------------------ Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hi, On 02/03/2009 02:26 PM, Hannu Krosing wrote: >> I don't see any way around the fact that when a tuple is removed, it's >> gone and can't be accessed by queries. Either you don't remove it, or >> you kill the query. > Actually we came up with a solution to this - use filesystem level > snapshots (like LVM2+XFS or ZFS), and redirect backends with > long-running queries to use fs snapshot mounted to a different > mountpoint. Isn't that really, really expensive? A single write on the master logical volume yields writes of PE size for _every_ single snapshot (the first time the block is touched) - considering that there could quite many such snapshots I don't think that this is really feasible - io quite possible might be saturated. The default PE size is 4MB - but on most bigger systems it is set to a bigger size, so its just getting worse for bigger systems. Sure, one might say, that this is an LVM deficiency - but I do knot know of any snapshot-able block layer doing it that way. Andres
On Tue, Feb 3, 2009 at 9:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote: >> I think _the_ solution is to notice when you're about to vacuum a page >> that is still visible to a running backend on the standby, and save >> that page off to a separate cache of old page versions (perhaps using >> the relation fork mechanism). > > I'll let you write that, for the next release... LOL. How many sponsorship dollars are available for that project? > The problem with all of this is we've been talking about it for 8 months > now and various opinions are held by people. What is being presented is > a broad set of options (summarised from Wiki) I think everyone understands that these are things we might want to do down the line, not things we need to have now. For this release, I was under the impression that we'd pretty much settled on implementing (1) and maybe (3) but not (2) from the below list. > 1. Wait then cancel > a) always for databases, tablespaces and locks > b) deferred cancelation for buffer changes > > 2. We connect to Primary server from Standby server and keep a > transaction open using contrib/dblink functions, then commit as soon as > we are done. > > 3. We pause recovery by issuing a pg_recovery_pause() function, or start > server in pause mode using recovery_started_paused = on. > > Yes, it's a critical area to the success of the feature. But this is > enough for first release and for us to get user feedback. I completely agree. ...Robert
On Tue, 2009-02-03 at 15:55 +0100, Andres Freund wrote: > Hi, > > On 02/03/2009 02:26 PM, Hannu Krosing wrote: > >> I don't see any way around the fact that when a tuple is removed, it's > >> gone and can't be accessed by queries. Either you don't remove it, or > >> you kill the query. > > Actually we came up with a solution to this - use filesystem level > > snapshots (like LVM2+XFS or ZFS), and redirect backends with > > long-running queries to use fs snapshot mounted to a different > > mountpoint. > Isn't that really, really expensive? > > A single write on the master logical volume yields writes of PE size > for _every_ single snapshot (the first time the block is touched) - > considering that there could quite many such snapshots I don't think > that this is really feasible - io quite possible might be saturated. If we did that we would provide an option to select the MVCC snapshot that went with the filesystem snapshot. There need not be one per user. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2009-02-03 at 10:19 -0500, Robert Haas wrote: > On Tue, Feb 3, 2009 at 9:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote: > >> I think _the_ solution is to notice when you're about to vacuum a page > >> that is still visible to a running backend on the standby, and save > >> that page off to a separate cache of old page versions (perhaps using > >> the relation fork mechanism). > > > > I'll let you write that, for the next release... > > LOL. How many sponsorship dollars are available for that project? > > > The problem with all of this is we've been talking about it for 8 months > > now and various opinions are held by people. What is being presented is > > a broad set of options (summarised from Wiki) > > I think everyone understands that these are things we might want to do > down the line, not things we need to have now. For this release, I > was under the impression that we'd pretty much settled on implementing > (1) and maybe (3) but not (2) from the below list. (2) is something that you can always do manually if you need it. So no reason to support it in HS code explicitly. Once you keep trx open on master, 1 and 3 should not happen anymore until you close that trx. > > 1. Wait then cancel > > a) always for databases, tablespaces and locks > > b) deferred cancelation for buffer changes > > > > 2. We connect to Primary server from Standby server and keep a > > transaction open using contrib/dblink functions, then commit as soon as > > we are done. > > > > 3. We pause recovery by issuing a pg_recovery_pause() function, or start > > server in pause mode using recovery_started_paused = on. > > > > Yes, it's a critical area to the success of the feature. But this is > > enough for first release and for us to get user feedback. > > I completely agree. > > ...Robert >
On Tue, 2009-02-03 at 14:28 +0000, Simon Riggs wrote: > On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote: > > > > Hannu Krosing wrote: > > > Actually we came up with a solution to this - use filesystem level > > > snapshots (like LVM2+XFS or ZFS), and redirect backends with > > > long-running queries to use fs snapshot mounted to a different > > > mountpoint. > > > > > > I don't think Simon has yet put full support for it in code, but it is > > > clearly _the_ solution for those who want to eat the cake and have it > > > too. > > > How does that work if you're using mutiple file systems via tablespaces > > (e.g. indexes in a different TS)? > > It's a great idea and easy to do, but I can't do everything in one go. > > The main reasons not to are multiple file system difficulties and lack > of a mainstream Linux solution, and more simply lack of time and test > resources. More general, but also lot harder, solution would be going back to roots and implement what original postgres 4.2 and earlier versions were meant to do - namely VACUUM was not meant to just discard older versions , but rather move it to WORM storage (write once read many was all the rage back then :) . If we did that in a way that each relation, at least on warm standby , has its own "archive" fork, possibly in a separate tablespace for cheaper storage, then we could basically apply WAL's as fast we want and just move the old versions to "archive". It will be slower(ish), especially for HOT updates, but may be a good solution for lots of usecases. And the decision to do the archiving on master and WAL-copy to slave, or just do it on slave only could probably be left to user. Reintroducing keeping old tuples "forever" would also allow us to bring back time travel feature, that is SELECT .... AS OF 'yesterday afternoon'::timestamp; Which was thrown out at the times we got WAL-logging. To be really useful we should also have some way to know trx timestamps, but that can be easily done using ticker feature from Slony - SkyTools/pgQ, which could be run a a separate server thread similar to what we do with background writer, autovacuum etc. now. > > So not now, maybe later. >
On Tue, 2009-02-03 at 18:09 +0200, Hannu Krosing wrote: > On Tue, 2009-02-03 at 14:28 +0000, Simon Riggs wrote: > > On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote: > > > > > > Hannu Krosing wrote: > > > > Actually we came up with a solution to this - use filesystem level > > > > snapshots (like LVM2+XFS or ZFS), and redirect backends with > > > > long-running queries to use fs snapshot mounted to a different > > > > mountpoint. > > > > > > > > I don't think Simon has yet put full support for it in code, but it is > > > > clearly _the_ solution for those who want to eat the cake and have it > > > > too. > > > > > How does that work if you're using mutiple file systems via tablespaces > > > (e.g. indexes in a different TS)? > > > > It's a great idea and easy to do, but I can't do everything in one go. > > > > The main reasons not to are multiple file system difficulties and lack > > of a mainstream Linux solution, and more simply lack of time and test > > resources. > > More general, but also lot harder, solution would be going back to roots > and implement what original postgres 4.2 and earlier versions were meant > to do - namely VACUUM was not meant to just discard older versions , but > rather move it to WORM storage (write once read many was all the rage > back then :) . > > If we did that in a way that each relation, at least on warm standby , > has its own "archive" fork, possibly in a separate tablespace for > cheaper storage, then we could basically apply WAL's as fast we want and > just move the old versions to "archive". It will be slower(ish), > especially for HOT updates, but may be a good solution for lots of > usecases. > > And the decision to do the archiving on master and WAL-copy to slave, or > just do it on slave only could probably be left to user. > > Reintroducing keeping old tuples "forever" would also allow us to bring > back time travel feature, that is > > SELECT .... AS OF 'yesterday afternoon'::timestamp; > > Which was thrown out at the times we got WAL-logging. > > To be really useful we should also have some way to know trx timestamps, > but that can be easily done using ticker feature from Slony - > SkyTools/pgQ, which could be run a a separate server thread similar to > what we do with background writer, autovacuum etc. now. That might be the way to do the "Named Restore Points" that is frequently requested. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support