Обсуждение: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Steve Kehlet wrote: > I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we > just dropped new binaries in place) but it wouldn't start up. I found this > in the logs: > > waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: > database system was shut down at 2015-05-27 13:12:55 PDT > 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is > starting up > .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of > transaction 1 I am debugging today a problem currently that looks very similar to this. AFAICT the problem is that WAL replay of an online checkpoint in which multixact files are removed fails because replay tries to read a file that has already been removed. (I was nervous about removing the check to omit reading pg_multixact files while on recovery. Looks like my hunch was right, though the actual problem is not what I was fearing.) I think the fix to this is to verify whether the file exists on disk before reading it; if it doesn't, assume the truncation has already happened and that it's not necessary to remove it. > I found [this report from a couple days ago]( > https://bugs.archlinux.org/task/45071) from someone else that looks like > the same problem. Right :-( I think a patch like this should be able to fix it ... not tested yet. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Steve Kehlet
Дата:
On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I think a patch like this should be able to fix it ... not tested yet.
Thanks Alvaro. I got a compile error, so looked for other uses of SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look right?
+ (!InRecovery || SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno)))
It compiled then, but it didn't seem to change anything: postgres gave the same errors and still would not start up.
I'm recompiling it now just to be sure I didn't make any mistakes. I have to leave for the day in a few minutes so I thought I'd check on the above argument with you. Thanks again!
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Steve Kehlet wrote: > On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > > > I think a patch like this should be able to fix it ... not tested yet. > > > > Thanks Alvaro. I got a compile error, so looked for other uses of > SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look > right? > > + (!InRecovery || SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, > pageno))) Meh, I sent you the wrong version of the patch. Yeah, that's obviously better. > It compiled then, but it didn't seem to change anything: postgres gave the > same errors and still would not start up. I'm setting up a reproducer for the problem to verify that the patch fixes it (evidently not). Maybe I've messed up the conditional or something else ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Steve Kehlet wrote: >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >> just dropped new binaries in place) but it wouldn't start up. I found this >> in the logs: >> >> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: >> database system was shut down at 2015-05-27 13:12:55 PDT >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is >> starting up >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of >> transaction 1 > > I am debugging today a problem currently that looks very similar to > this. AFAICT the problem is that WAL replay of an online checkpoint in > which multixact files are removed fails because replay tries to read a > file that has already been removed. Hmm, so what exactly is the sequence of events here? It's possible that I'm not thinking clearly just now, but it seems to me that if we're replaying the same checkpoint we replayed previously, the offset of the oldest multixact will be the first file that we didn't remove. However, I can see that there could be a problem if we try to replay an older checkpoint after having already replayed a new one - for example, if a standby replays checkpoint A truncating the members multixact and performs a restart point, and then replays checkpoint B truncating the members multixact again but without performing a restartpoint, and then is shut down, it will resume replay from checkpoint A, and trouble will ensue. Is that the scenario, or is there something else? > I think the fix to this is to verify whether the file exists on disk > before reading it; if it doesn't, assume the truncation has already > happened and that it's not necessary to remove it. That might be an OK fix, but this implementation doesn't seem very clean. If we're going to remove the invariant that MultiXactState->oldestOffset will always be valid after replaying a checkpoint, then we should be explicit about that and add a flag indicating whether or not it's currently valid. Shoving nextOffset in there and hoping that's good enough seems like a bad idea to me. I think we should modify the API for find_multixact_start. Let's have it return a Boolean and return oldestOffset via an out parameter. If !InRecovery, it will always return true and set the out parameter; but if in recovery, it is allowed to return false without setting the out parameter. Both values can get stored in MultiXactState, and we can adjust the logic elsewhere to disregard oldestOffset when the accompanying flag is false. This still leaves open an ugly possibility: can we reach normal running without a valid oldestOffset? If so, until the next checkpoint happens, autovacuum has no clue whether it needs to worry. There's got to be a fix for that, but it escapes me at the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Robert Haas wrote: > On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Steve Kehlet wrote: > >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we > >> just dropped new binaries in place) but it wouldn't start up. I found this > >> in the logs: > >> > >> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: > >> database system was shut down at 2015-05-27 13:12:55 PDT > >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is > >> starting up > >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of > >> transaction 1 > > > > I am debugging today a problem currently that looks very similar to > > this. AFAICT the problem is that WAL replay of an online checkpoint in > > which multixact files are removed fails because replay tries to read a > > file that has already been removed. > > Hmm, so what exactly is the sequence of events here? It's possible > that I'm not thinking clearly just now, but it seems to me that if > we're replaying the same checkpoint we replayed previously, the offset > of the oldest multixact will be the first file that we didn't remove. Well I'm not very clear on what's the problematic case. The scenario I actually saw this first reported was a pg_basebackup taken on a very large database, so the master could have truncated multixact and the standby receives a truncated directory but actually tries to apply a checkpoint that is much older than what the master currently has transmitted as pg_multixact contents. > > I think the fix to this is to verify whether the file exists on disk > > before reading it; if it doesn't, assume the truncation has already > > happened and that it's not necessary to remove it. > > That might be an OK fix, but this implementation doesn't seem very > clean. If we're going to remove the invariant that > MultiXactState->oldestOffset will always be valid after replaying a > checkpoint, then we should be explicit about that and add a flag > indicating whether or not it's currently valid. Shoving nextOffset in > there and hoping that's good enough seems like a bad idea to me. > > I think we should modify the API for find_multixact_start. Let's have > it return a Boolean and return oldestOffset via an out parameter. If > !InRecovery, it will always return true and set the out parameter; but > if in recovery, it is allowed to return false without setting the out > parameter. Both values can get stored in MultiXactState, and we can > adjust the logic elsewhere to disregard oldestOffset when the > accompanying flag is false. Sounds good. I think I prefer that multixact creation is rejected altogether if the new flag is false. Is that what you mean when you say "adjust the logic"? > This still leaves open an ugly possibility: can we reach normal > running without a valid oldestOffset? If so, until the next > checkpoint happens, autovacuum has no clue whether it needs to worry. > There's got to be a fix for that, but it escapes me at the moment. I think the fix to that issue is to set the oldest offset on TrimMultiXact. That way, once WAL replay finished we're certain that we have a valid oldest offset to create new multixacts with. I'm also wondering whether the call to DetermineSafeOldestOffset on StartupMultiXact is good. At that point, we haven't replayed any WAL yet, so the oldest multi might be pointing at a file that has already been removed -- again considering the pg_basebackup scenario where the multixact files are copied much later than pg_control, so the checkpoint to replay is old but the pg_multixact contents have already been truncated in the master and are copied truncated. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Well I'm not very clear on what's the problematic case. The scenario I > actually saw this first reported was a pg_basebackup taken on a very > large database, so the master could have truncated multixact and the > standby receives a truncated directory but actually tries to apply a > checkpoint that is much older than what the master currently has > transmitted as pg_multixact contents. OK, that makes sense. >> That might be an OK fix, but this implementation doesn't seem very >> clean. If we're going to remove the invariant that >> MultiXactState->oldestOffset will always be valid after replaying a >> checkpoint, then we should be explicit about that and add a flag >> indicating whether or not it's currently valid. Shoving nextOffset in >> there and hoping that's good enough seems like a bad idea to me. >> >> I think we should modify the API for find_multixact_start. Let's have >> it return a Boolean and return oldestOffset via an out parameter. If >> !InRecovery, it will always return true and set the out parameter; but >> if in recovery, it is allowed to return false without setting the out >> parameter. Both values can get stored in MultiXactState, and we can >> adjust the logic elsewhere to disregard oldestOffset when the >> accompanying flag is false. > > Sounds good. I think I prefer that multixact creation is rejected > altogether if the new flag is false. Is that what you mean when you say > "adjust the logic"? No. I'm not sure quite what you mean here. We can't reject multixact creation during normal running, and during recovery, we won't create any really new mulitxacts, but we must replay the creation of multixacts. What I meant was stuff like this: if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) || (MultiXactState->nextOffset - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD)) I meant that we'd change the second prong of the test to check multiXactState->nextOffsetValid && MultiXactState->nextOffset - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD. And likewise change anything else that relies on oldestOffset. Or else we guarantee that we can't reach those points until the oldestOffset is valid, and then check that it is with an Assert() or elog(). >> This still leaves open an ugly possibility: can we reach normal >> running without a valid oldestOffset? If so, until the next >> checkpoint happens, autovacuum has no clue whether it needs to worry. >> There's got to be a fix for that, but it escapes me at the moment. > > I think the fix to that issue is to set the oldest offset on > TrimMultiXact. That way, once WAL replay finished we're certain that we > have a valid oldest offset to create new multixacts with. > > I'm also wondering whether the call to DetermineSafeOldestOffset on > StartupMultiXact is good. At that point, we haven't replayed any WAL > yet, so the oldest multi might be pointing at a file that has already > been removed -- again considering the pg_basebackup scenario where the > multixact files are copied much later than pg_control, so the checkpoint > to replay is old but the pg_multixact contents have already been > truncated in the master and are copied truncated. Moving the call from StartupMultiXact() to TrimMultiXact() seems like a good idea. I'm not sure why we didn't do that before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Steve Kehlet wrote: >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >> just dropped new binaries in place) but it wouldn't start up. I found this >> in the logs: >> >> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: >> database system was shut down at 2015-05-27 13:12:55 PDT >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is >> starting up >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of >> transaction 1 > > I am debugging today a problem currently that looks very similar to > this. AFAICT the problem is that WAL replay of an online checkpoint in > which multixact files are removed fails because replay tries to read a > file that has already been removed. Steve: Can you tell us more about how you shut down the old cluster? Did you by any chance perform an immediate shutdown? Do you have the actual log messages that were written when the system was shut down for the upgrade? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Steve Kehlet wrote: >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >> just dropped new binaries in place) but it wouldn't start up. I found this >> in the logs: >> >> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: >> database system was shut down at 2015-05-27 13:12:55 PDT >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is >> starting up >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of >> transaction 1 > > I am debugging today a problem currently that looks very similar to > this. AFAICT the problem is that WAL replay of an online checkpoint in > which multixact files are removed fails because replay tries to read a > file that has already been removed. Wait a minute, wait a minute. There's a serious problem with this theory, at least in Steve's scenario. This message: 2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut down at 2015-05-27 That message implies a *clean shutdown*. If he had performed an immediate shutdown or just pulled the plug, it would have said "database system was interrupted" or some such. There may be bugs in redo, also, but they don't explain what happened to Steve. Steve, is there any chance we can get your pg_controldata output and a list of all the files in pg_clog? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Thu, May 28, 2015 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Steve Kehlet wrote: >>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >>> just dropped new binaries in place) but it wouldn't start up. I found this >>> in the logs: >>> >>> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: >>> database system was shut down at 2015-05-27 13:12:55 PDT >>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is >>> starting up >>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of >>> transaction 1 >> >> I am debugging today a problem currently that looks very similar to >> this. AFAICT the problem is that WAL replay of an online checkpoint in >> which multixact files are removed fails because replay tries to read a >> file that has already been removed. > > Wait a minute, wait a minute. There's a serious problem with this > theory, at least in Steve's scenario. This message: > > 2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut > down at 2015-05-27 > > That message implies a *clean shutdown*. If he had performed an > immediate shutdown or just pulled the plug, it would have said > "database system was interrupted" or some such. > > There may be bugs in redo, also, but they don't explain what happened to Steve. > > Steve, is there any chance we can get your pg_controldata output and a > list of all the files in pg_clog? Err, make that pg_multixact/members, which I assume is at issue here. You didn't show us the DETAIL line from this message, which would presumably clarify: FATAL: could not access status of transaction 1 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Thu, May 28, 2015 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Steve, is there any chance we can get your pg_controldata output and a >> list of all the files in pg_clog? > > Err, make that pg_multixact/members, which I assume is at issue here. > You didn't show us the DETAIL line from this message, which would > presumably clarify: > > FATAL: could not access status of transaction 1 And I'm still wrong, probably. The new code in 9.4.2 cares about being able to look at an *offsets* file to find the corresponding member offset. So most likely it is an offsets file that is missing here. The question is, how are we ending up with an offsets file that is referenced by the control file but not actually present on disk? It seems like it would be good to compare the pg_controldata output to what is actually present in pg_multixact/offsets (hopefully that's the right directory, now that I'm on my third try) and try to understand what is going on here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > [ speculation ] OK, I finally managed to reproduce this, after some off-list help from Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to do it: 1. Install any pre-9.3 version of the server and generate enough multixacts to create at least TWO new segments. When you shut down the server, all segments except for the most current one will be removed. At this point, the only thing in $PGDATA/pg_multixact/offsets should be a single file, and the name of that file should not be 0000 or 0001. 2. Use pg_upgrade to upgrade to 9.3.4. It is possible that versions < 9.3.4 will also work here, but you must not use 9.3.5 or higher, because 9.3.5 includes Bruce's commit 3d2e18510, which arranged to preserve relminmxid and datminmxid values. At this point, pg_controldata on the new cluster should show an oldestMultiXid value greater than 1 (copied from the old cluster), but all the datminmxid values are 1. Also, initdb will have left behind a bogus 0000 file in pg_multixact/offsets. 3. Move to 9.3.5 (or 9.3.6), not via pg_upgrade, but just by dropping in the new binaries. Follow the instructions in the 9.3.5 release notes; since you created at least TWO new segments in step one, there will be no 0001 file, and the query there will say that you should remove the bogus 0000 file. So do that, leaving just the good file in pg_multixact/offsets. At this point, pg_multixact/offsets is OK, and pg_controldata still says that oldestMultiXid > 1, so that is also OK. The only problem is that we've got some bogus datminmxid values floating around. Our next step will be to convince vacuum to propagate the bogus datminmxid values back into pg_controldata. 4. Consume at least one transaction ID (e.g. SELECT txid_current()) and then do this: postgres=# set vacuum_freeze_min_age = 0; SET postgres=# set vacuum_freeze_table_age = 0; SET postgres=# vacuum; VACUUM Setting the GUCs forces full table scans, so that we advance relfrozenxid. But notice that we were careful not to just run VACUUM FREEZE, which would have also advanced relminmxid, which, for purposes of reproducing this bug, is not what we want to happen. So relminmxid is still (incorrectly) set to 1 for every database. However, since the vacuum did advance relfrozenxid, it will call vac_truncate_clog, which will call SetMultiXactIdLimit, which will propagate the bogus datminmxid = 1 setting into shared memory. (In my testing, this step doesn't work if performed on 9.3.4; you have to do it on 9.3.5. I think that's because of Tom's commit 78db307bb, but I believe in a more complex test scenario you might be able to get this to happen on 9.3.4 also.) I believe it's the case that an autovacuum of even a single table can substitute for this step if it happens to advance relfrozenxid but not relminmxid. 5. The next checkpoint, or the shutdown checkpoint in any event, will propagate the bogus value of 1 from shared memory back into the control file. 6. Now try to start 9.3.7. It will see the bogus oldestMultiXid = 1 value in the control file, attempt to read the corresponding offsets file, and die. In the process of investigating this, we found a few other things that seem like they may also be bugs: - As noted upthread, replaying an older checkpoint after a newer checkpoint has already happened may lead to similar problems. This may be possible when starting from an online base backup; or when restarting a standby that did not perform a restartpoint when replaying the last checkpoint before the shutdown. - pg_upgrade sets datminmxid = old_cluster.controldata.chkpnt_nxtmulti, which is correct only if there are ZERO multixacts in use at the time of the upgrade. It would be best, I think, to set this to the same value it had in the old cluster, but if we're going to use a blanket value, I think it needs to be chkpnt_oldstMulti. - There's a third possible problem related to boundary cases in SlruScanDirCbRemoveMembers, but I don't understand that one well enough to explain it. Maybe Thomas can jump in here and explain the concern. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
"Joshua D. Drake"
Дата:
On 05/28/2015 12:56 PM, Robert Haas wrote: > FTR: Robert, you have been a Samurai on this issue. Our many thanks. Sincerely, jD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Robert Haas wrote: > On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > [ speculation ] > > OK, I finally managed to reproduce this, after some off-list help from > Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to > do it: It's a long list of steps, but if you consider them carefully, it becomes clear that they are natural steps that a normal production system would go through -- essentially the only one that's really time-critical is the decision to pg_upgrade with a version before 9.3.5. > In the process of investigating this, we found a few other things that > seem like they may also be bugs: > > - As noted upthread, replaying an older checkpoint after a newer > checkpoint has already happened may lead to similar problems. This > may be possible when starting from an online base backup; or when > restarting a standby that did not perform a restartpoint when > replaying the last checkpoint before the shutdown. I'm going through this one now, as it's closely related and caused issues for us. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Fri, May 29, 2015 at 7:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> [ speculation ] > > [...] However, since > the vacuum did advance relfrozenxid, it will call vac_truncate_clog, > which will call SetMultiXactIdLimit, which will propagate the bogus > datminmxid = 1 setting into shared memory. Ah! > [...] > > - There's a third possible problem related to boundary cases in > SlruScanDirCbRemoveMembers, but I don't understand that one well > enough to explain it. Maybe Thomas can jump in here and explain the > concern. I noticed something in passing which is probably not harmful, and not relevant to this bug report, it was just a bit confusing while testing: SlruScanDirCbRemoveMembers never deletes any files if rangeStart == rangeEnd. In practice, if you have an idle cluster with a lot of multixact data and you VACUUM FREEZE all databases and then CHECKPOINT, you might be surprised to see no member files going away quite yet, but they'll eventually be truncated by a future checkpoint, once rangeEnd has had a chance to advance to the next page due to more multixacts being created. If we want to fix this one day, maybe the right thing to do is to treat the rangeStart == rangeEnd case the same way we treat rangeStart < rangeEnd, that is, to assume that the range of pages isn't wrapped/inverted in this case. Although we don't have the actual start and end offset values to compare here, we know that for them to fall on the same page, the start offset index must be <= the end offset index (since we added the new error to prevent member space wrapping, we never allow the end to get close enough to the start to fall on the same page). Like this (not tested): diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 9568ff1..4d0bcc4 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2755,7 +2755,7 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char *filename, int segpage, /* Recheck the deletion condition. If it still holds, perform deletion */ if ((range->rangeStart > range->rangeEnd && segpage > range->rangeEnd && segpage < range->rangeStart) || - (range->rangeStart < range->rangeEnd && + (range->rangeStart <= range->rangeEnd && (segpage < range->rangeStart || segpage > range->rangeEnd))) SlruDeleteSegment(ctl, filename); -- Thomas Munro http://www.enterprisedb.com
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > FTR: Robert, you have been a Samurai on this issue. Our many thanks. Thanks! I really appreciate the kind words. So, in thinking through this situation further, it seems to me that the situation is pretty dire: 1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid or datminmxid values which are 1 instead of the correct value. Setting the value to 1 was too far in the past if your MXID counter is < 2B, and too far in the future if your MXID counter is > 2B. 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid values which are equal to the next-mxid counter instead of the correct value; in other words, they are two new. 3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will have the first problem for tables in template databases, and the second one for the rest. (See 866f3017a.) 4. Wrong relminmxid or datminmxid values can eventually propagate into the control file, as demonstrated in my previous post. Therefore, we can't count on relminmxid to be correct, we can't count on datminmxid to be correct, and we can't count on the control file to be correct. That's a sack of sad. 5. If the values are too far in the past, then nothing really terrible will happen unless you upgrade to 9.3.7 or 9.4.2, at which point the system will refuse to start. Forcing a VACUUM FREEZE on every database, including the unconnectable ones, should fix this and allow you to upgrade safely - which you want to do, because 9.3.7 and 9.4.2 fix a different set of multixact data loss bugs. 6. If the values are too far in the future, the system may fail to prevent wraparound, leading to data loss. I am not totally clear on whether a VACUUM FREEZE will fix this problem. It seems like the chances are better if you are running at least 9.3.5+ or 9.4.X, because of 78db307bb. But I'm not sure how complete a fix that is. So what do we do about this? I have a few ideas: A. Most obviously, we should fix pg_upgrade so that it installs chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so that we stop creating new instances of this problem. That won't get us out of the hole we've dug for ourselves, but we can at least try to stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the wrong thing - anyone want to double-check me on that one?) B. We need to change find_multixact_start() to fail softly. This is important because it's legitimate for it to fail in recovery, as discussed upthread, and also because we probably want to eliminate the fail-to-start hazard introduced in 9.4.2 and 9.3.7. find_multixact_start() is used in three places, and they each require separate handling: - In SetMultiXactIdLimit, find_multixact_start() is used to set MultiXactState->oldestOffset, which is used to determine how aggressively to vacuum. If find_multixact_start() fails, we don't know how aggressively we need to vacuum to prevent members wraparound; it's probably best to decide to vacuum as aggressively as possible. Of course, if we're in recovery, we won't vacuum either way; the fact that it fails softly is good enough. - In DetermineSafeOldestOffset, find_multixact_start() is used to set MultiXactState->offsetStopLimit. If it fails here, we don't know when to refuse multixact creation to prevent wraparound. Again, in recovery, that's fine. If it happens in normal running, it's not clear what to do. Refusing multixact creation is an awfully blunt instrument. Maybe we can scan pg_multixact/offsets to determine a workable stop limit: the first file greater than the current file that exists, minus two segments, is a good stop point. Perhaps we ought to use this mechanism here categorically, not just when find_multixact_start() fails. It might be more robust than what we have now. - In TruncateMultiXact, find_multixact_start() is used to set the truncation point for the members SLRU. If it fails here, I'm guessing the right solution is not to truncate anything - instead, rely on intense vacuuming to eventually advance oldestMXact to a value whose member data still exists; truncate then. C. I think we should also change TruncateMultiXact() to truncate offsets first, and then members. As things stand, if we truncate members first, we increase the risk of seeing an offset that will fail when passed to find_multixact_start(), because TruncateMultiXact() might get interrupted before it finishes. That seem like an unnecessary risk. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Robert Haas wrote: > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid > values which are equal to the next-mxid counter instead of the correct > value; in other words, they are too new. What you describe is what happens if you upgrade from 9.2 or earlier. For this case we use this call: exec_prog(UTILITY_LOG_FILE, NULL, true, "\"%s/pg_resetxlog\" -m %u,%u \"%s\"", new_cluster.bindir, old_cluster.controldata.chkpnt_nxtmulti + 1, old_cluster.controldata.chkpnt_nxtmulti, new_cluster.pgdata); This uses the old cluster's nextMulti value as oldestMulti in the new cluster, and that value+1 is used as nextMulti. This is correct: we don't want to preserve any of the multixact state from the previous cluster; anything before that value can be truncated with no loss of critical data. In fact, there is no critical data before that value at all. If you upgrade from 9.3, this other call is used instead: /* * we preserve all files and contents, so we must preserve both "next" * counters here and the oldest multi present on system. */ exec_prog(UTILITY_LOG_FILE, NULL, true, "\"%s/pg_resetxlog\" -O %u -m %u,%u \"%s\"", new_cluster.bindir, old_cluster.controldata.chkpnt_nxtmxoff, old_cluster.controldata.chkpnt_nxtmulti, old_cluster.controldata.chkpnt_oldstMulti, new_cluster.pgdata); In this case we use the oldestMulti from the old cluster as oldestMulti in the new cluster, which is also correct. > A. Most obviously, we should fix pg_upgrade so that it installs > chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so > that we stop creating new instances of this problem. That won't get > us out of the hole we've dug for ourselves, but we can at least try to > stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the > wrong thing - anyone want to double-check me on that one?) I don't think there's anything that we need to fix here. > B. We need to change find_multixact_start() to fail softly. This is > important because it's legitimate for it to fail in recovery, as > discussed upthread, and also because we probably want to eliminate the > fail-to-start hazard introduced in 9.4.2 and 9.3.7. > find_multixact_start() is used in three places, and they each require > separate handling: > > - In SetMultiXactIdLimit, find_multixact_start() is used to set > MultiXactState->oldestOffset, which is used to determine how > aggressively to vacuum. If find_multixact_start() fails, we don't > know how aggressively we need to vacuum to prevent members wraparound; > it's probably best to decide to vacuum as aggressively as possible. > Of course, if we're in recovery, we won't vacuum either way; the fact > that it fails softly is good enough. Sounds good. > - In DetermineSafeOldestOffset, find_multixact_start() is used to set > MultiXactState->offsetStopLimit. If it fails here, we don't know when > to refuse multixact creation to prevent wraparound. Again, in > recovery, that's fine. If it happens in normal running, it's not > clear what to do. Refusing multixact creation is an awfully blunt > instrument. Maybe we can scan pg_multixact/offsets to determine a > workable stop limit: the first file greater than the current file that > exists, minus two segments, is a good stop point. Perhaps we ought to > use this mechanism here categorically, not just when > find_multixact_start() fails. It might be more robust than what we > have now. Blunt instruments have the desirable property of being simple. We don't want any more clockwork here, I think --- this stuff is pretty complicated already. As far as I understand, if during normal running we see that find_multixact_start has failed, sufficient vacuuming should get it straight eventually with no loss of data. > - In TruncateMultiXact, find_multixact_start() is used to set the > truncation point for the members SLRU. If it fails here, I'm guessing > the right solution is not to truncate anything - instead, rely on > intense vacuuming to eventually advance oldestMXact to a value whose > member data still exists; truncate then. Fine. > C. I think we should also change TruncateMultiXact() to truncate > offsets first, and then members. As things stand, if we truncate > members first, we increase the risk of seeing an offset that will fail > when passed to find_multixact_start(), because TruncateMultiXact() > might get interrupted before it finishes. That seem like an > unnecessary risk. Not sure about this point. We did it the way you propose previously, and found it to be a problem because sometimes we tried to read an offset file that was no longer there. Do we really read member files anywhere? I thought we only tried to read offset files. If we remove member files, what is it that we try to read and find not to be present? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Alvaro Herrera wrote: > Robert Haas wrote: > > > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid > > values which are equal to the next-mxid counter instead of the correct > > value; in other words, they are too new. > > What you describe is what happens if you upgrade from 9.2 or earlier. Oh, you're referring to pg_database values, not the ones in pg_control. Ugh :-( This invalidates my argument that there's nothing to fix, obviously ... it's clearly broken as is. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid >> values which are equal to the next-mxid counter instead of the correct >> value; in other words, they are too new. > > [ discussion of how the control file's oldestMultiXid gets set ] I'm talking about the datminmxid in pg_database. You're talking about the contents of pg_control. Those are two different things. The relevant code is not what you quote, but rather this: /* set pg_database.datminmxid */ PQclear(executeQueryOrDie(conn_template1, "UPDATE pg_catalog.pg_database " "SET datminmxid = '%u'", old_cluster.controldata.chkpnt_nxtmulti)); Tom previously observed this to be wrong, here: http://www.postgresql.org/message-id/9879.1405877821@sss.pgh.pa.us Although Tom was correct to note that it's wrong, nothing ever got fixed. :-( >> A. Most obviously, we should fix pg_upgrade so that it installs >> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so >> that we stop creating new instances of this problem. That won't get >> us out of the hole we've dug for ourselves, but we can at least try to >> stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the >> wrong thing - anyone want to double-check me on that one?) > > I don't think there's anything that we need to fix here. I see your followup now agreeing this is broken. Since I wrote the previous email, I've had two new ideas that I think are both better than the above. 1. Figure out the oldest multixact offset that actually exists in pg_multixacts/offsets, and use that value. If any older MXIDs still exist, they won't be able to be looked up anyway, so if they wrap around, it doesn't matter. The only value that needs to be reliable in order to do this is pg_controldata's NextMultiXactId, which to the best of my knowledge is not implicated in any of these bugs. pg_upgrade can check that the offsets file containing that value exists, and if not bail out. Then, start stepping backwards a file at a time. When it hits a missing file, the first multixact in the next file is a safe value of datfrozenxid for every database in the new cluster. If older MXIDs exist, they're unreadable anyway, so if they wrap, nothing lost. If the value is older than necessary, the first vacuum in each database will fix it. We have to be careful: if we step back too many files, such that our proposed datfrozenxid might wrap, then we've got a confusing situation and had better bail out - or at least think really carefully about what to do. 2. When we're upgrading from a version 9.3 or higher, copy the EXACT datminmxid from each old database to the corresponding new database. This seems like it ought to be really simple. >> - In DetermineSafeOldestOffset, find_multixact_start() is used to set >> MultiXactState->offsetStopLimit. If it fails here, we don't know when >> to refuse multixact creation to prevent wraparound. Again, in >> recovery, that's fine. If it happens in normal running, it's not >> clear what to do. Refusing multixact creation is an awfully blunt >> instrument. Maybe we can scan pg_multixact/offsets to determine a >> workable stop limit: the first file greater than the current file that >> exists, minus two segments, is a good stop point. Perhaps we ought to >> use this mechanism here categorically, not just when >> find_multixact_start() fails. It might be more robust than what we >> have now. > > Blunt instruments have the desirable property of being simple. We don't > want any more clockwork here, I think --- this stuff is pretty > complicated already. As far as I understand, if during normal running > we see that find_multixact_start has failed, sufficient vacuuming should > get it straight eventually with no loss of data. Unfortunately, I don't believe that to be true. If find_multixact_start() fails, we have no idea how close we are to the member wraparound point. Sure, we can start vacuuming, but the user can be creating new, large multixacts at top speed while we're doing that, which could cause us to wrap around before we can finish vacuuming. Furthermore, if we adopted the blunt instrument, users who are in this situation would update to 9.4.3 (or whenever these fixes get released) and find that they can't create new MXIDs for a possibly very protracted period of time. That amounts to an outage for which users won't thank us. Looking at the files in the directory seems pretty simple in this case, and quite a bit more fail-safe than what we're doing right now. The current logic purports to leave a one-file gap in the member space, but there's no guarantee that the gap really exists on disk the way we think it does. With this approach, we can be certain that there is a gap. And that is a darned good thing to be certain about. >> C. I think we should also change TruncateMultiXact() to truncate >> offsets first, and then members. As things stand, if we truncate >> members first, we increase the risk of seeing an offset that will fail >> when passed to find_multixact_start(), because TruncateMultiXact() >> might get interrupted before it finishes. That seem like an >> unnecessary risk. > > Not sure about this point. We did it the way you propose previously, > and found it to be a problem because sometimes we tried to read an > offset file that was no longer there. Do we really read member files > anywhere? I thought we only tried to read offset files. If we remove > member files, what is it that we try to read and find not to be present? Do you have a link to the previous discussion? I mean, the problem we're having right now is that sometimes we have an offset, but the corresponding member isn't there. So clearly offsets reference members. Do members also reference offsets? I didn't think so, but life is full of surprises. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > A. Most obviously, we should fix pg_upgrade so that it installs > chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so > that we stop creating new instances of this problem. That won't get > us out of the hole we've dug for ourselves, but we can at least try to > stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the > wrong thing - anyone want to double-check me on that one?) Yes, it seems like this could lead to truncation of multixacts still referenced by tuples, leading to errors when updating, locking, vacuuming. Why don't we have reports of that? > B. We need to change find_multixact_start() to fail softly. This is > important because it's legitimate for it to fail in recovery, as > discussed upthread, and also because we probably want to eliminate the > fail-to-start hazard introduced in 9.4.2 and 9.3.7. > find_multixact_start() is used in three places, and they each require > separate handling: Here is an experimental WIP patch that changes StartupMultiXact and SetMultiXactIdLimit to find the oldest multixact that exists on disk (by scanning the directory), and uses that if it is more recent than the oldestMultiXactId from shmem, when calling DetermineSafeOldestOffset. I'm not all that happy with it, see below, but let me know what you think. Using unpatched master, I reproduced the startup error with a bit of a short cut: 1. initdb, generate enough multixacts to get more than one offsets file 2. ALTER DATABASE template0 ALLOW_CONNECTION = true;, vacuumdb --freeze --all, CHECKPOINT 3. verify that pg_control now holds a large oldestMultiXactId, and note NextMultiXactId 4. shutdown, pg_resetxlog -m (NextMultiXactId from 3),1 pg_data 5. start up: fails Apply this patch, and it starts up successfully. What are the repro steps for the replay problem? Is a basebackup of a large database undergoing truncation and some good timing needed? > - In SetMultiXactIdLimit, find_multixact_start() is used to set > MultiXactState->oldestOffset, which is used to determine how > aggressively to vacuum. If find_multixact_start() fails, we don't > know how aggressively we need to vacuum to prevent members wraparound; > it's probably best to decide to vacuum as aggressively as possible. > Of course, if we're in recovery, we won't vacuum either way; the fact > that it fails softly is good enough. Isn't it enough to use the start offset for the most recent of the oldest multixact ID and the oldest multixact found by scanning pg_multixact/offsets? In this patch, it does that, but I'm not happy with the time the work is done, it just doesn't seem right for SetMultiXactIdLimit to be scanning that directory. The result of that operation should only change when files have been truncated anyway, and the truncate code was already doing a filesystem scan. Maybe the truncate code should store the earliest multixact ID found on disk in shared memory, so that SetMultiXactIdLimit can use it for free. I tried to get that working but couldn't figure out where it should be initialised -- StartupMultiXact is too late (StartupXLOG calls SetMultiXactIdLimit before that), but BootstrapMultiXact and MultiXactShmemInit didn't seem like the right places either. > - In DetermineSafeOldestOffset, find_multixact_start() is used to set > MultiXactState->offsetStopLimit. If it fails here, we don't know when > to refuse multixact creation to prevent wraparound. Again, in > recovery, that's fine. If it happens in normal running, it's not > clear what to do. Refusing multixact creation is an awfully blunt > instrument. Maybe we can scan pg_multixact/offsets to determine a > workable stop limit: the first file greater than the current file that > exists, minus two segments, is a good stop point. Perhaps we ought to > use this mechanism here categorically, not just when > find_multixact_start() fails. It might be more robust than what we > have now. Done in this patch -- the truncate code calls DetermineSafeOldestOffset with the earliest SLRU found by scanning if that's more recent than the shmem value, and then DetermineSafeOldestOffset applies the step-back-one-whole-segment logic to that as before. > - In TruncateMultiXact, find_multixact_start() is used to set the > truncation point for the members SLRU. If it fails here, I'm guessing > the right solution is not to truncate anything - instead, rely on > intense vacuuming to eventually advance oldestMXact to a value whose > member data still exists; truncate then. TruncateMultiXact already contained logic to do nothing at all if oldestMXact is older than the earliest it can find on disk. I moved that code into find_earliest_multixact_on_disk() to be able to use it elsewhere too, in this patch. > C. I think we should also change TruncateMultiXact() to truncate > offsets first, and then members. As things stand, if we truncate > members first, we increase the risk of seeing an offset that will fail > when passed to find_multixact_start(), because TruncateMultiXact() > might get interrupted before it finishes. That seem like an > unnecessary risk. I don't see why the order matters. find_multixact_start() doesn't read the members, only the offsets SLRU (ie the index into members, not the contents of members). As I understand it, the only time we need to access the members themselves is when we encounter multixacts in tuple headers (updating, locking or vacuuming). If you have truncated multixacts referenced in your tuples then you have a different form of corruption than the pg_upgrade-tramples-on-oldestMultiXactId case we're trying to handle gracefully here. -- Thomas Munro http://www.enterprisedb.com
Вложения
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Christoph Berg
Дата:
Re: Robert Haas 2015-05-29 <CA+TgmoZZDJN38TFqYdGAGj-ap+ZKRQSrGbq4Eu_ZreFRYk+osA@mail.gmail.com> > > FTR: Robert, you have been a Samurai on this issue. Our many thanks. > > Thanks! I really appreciate the kind words. I'm still watching with admiration. This list of steps-to-reproduce is the longest and at the same time best I've ever seen. If anyone ever asks me why debugging in open source projects is better than closed source, I'll point them to this thread. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> B. We need to change find_multixact_start() to fail softly. > Here is an experimental WIP patch that changes StartupMultiXact and > SetMultiXactIdLimit to find the oldest multixact that exists on disk > (by scanning the directory), and uses that if it is more recent than > the oldestMultiXactId from shmem, Not sure about the details of this patch, but I was planning to propose what I think is the same thing: the way to make find_multixact_start() fail softly is to have it find the oldest actually existing file if the one that should be there isn't. This would preserve the idea that we aren't trusting the multixact tracking data to be completely right, which was the point of 78db307bb and which evidently is still essential. It would also obviate the need for other places to contain similar logic. regards, tom lane
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Fri, May 29, 2015 at 10:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> B. We need to change find_multixact_start() to fail softly. > >> Here is an experimental WIP patch that changes StartupMultiXact and >> SetMultiXactIdLimit to find the oldest multixact that exists on disk >> (by scanning the directory), and uses that if it is more recent than >> the oldestMultiXactId from shmem, > > Not sure about the details of this patch, but I was planning to propose > what I think is the same thing: the way to make find_multixact_start() > fail softly is to have it find the oldest actually existing file if the > one that should be there isn't. Working on that now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Fri, May 29, 2015 at 12:43 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Working on that now. OK, here's a patch. Actually two patches, differing only in whitespace, for 9.3 and for master (ha!). I now think that the root of the problem here is that DetermineSafeOldestOffset() and SetMultiXactIdLimit() were largely ignorant of the possibility that they might be called at points in time when the cluster was inconsistent. SetMultiXactIdLimit() bracketed certain parts of its logic with if (!InRecovery), but those guards were ineffective because it gets called before InRecovery is set in the first place. It seems pretty clear that we can't effectively determine anything about member wraparound until the cluster is consistent. Before then, there might be files missing from the offsets or members SLRUs which get put back during replay. There could even be gaps in the sequence of files, with some things having made it to disk before the crash (or having made it into the backup) and others not. So all the work of determining what the safe stop points and vacuum thresholds for members are needs to be postponed until TrimMultiXact() time. And that's fine, because we don't need this information in recovery anyway - it only affects behavior in normal running. So this patch does the following: 1. Moves the call to DetermineSafeOldestOffset() that appears in StartupMultiXact() to TrimMultiXact(), so that we don't try to do this until we're consistent. Also, instead of passing MultiXactState->oldestMultiXactId, pass the newer of that value and the earliest offset that exists on disk. That way, it won't try to read data that's not there. Note that the second call to DetermineSafeOldestOffset() in TruncateMultiXact() doesn't need a similar guard, because we already bail out of that function early if the multixacts we're going to truncate away don't exist. 2. Adds a new flag MultiXactState->didTrimMultiXact indicate whether we've finished TrimMultiXact(), and arranges for SetMultiXactIdLimit() to use that rather than InRecovery to test whether it's safe to do complicated things that might require that the cluster is consistent. This is a slight behavior change, since formerly we would have tried to do that stuff very early in the startup process, and now it won't happen until somebody completes a vacuum operation. If that's a problem, we could consider doing it in TrimMultiXact(), but I don't think it's safe the way it was. The new flag also prevents oldestOffset from being set while in recovery; I think it would be safe to do that in recovery once we've reached consistency, but I don't believe it's necessary. 3. Arranges for TrimMultiXact() to set oldestOffset. This is necessary because, previously, we relied on SetMultiXactIdLimit doing that during early startup or during recovery, and that's no longer true. Here too we set oldestOffset keeping in mind that our notion of the oldest multixact may point to something that doesn't exist; if so, we use the oldest MXID that does. 4. Modifies TruncateMultiXact() so that it doesn't re-scan the SLRU directory on every call to find the oldest file that exists. Instead, it arranges to remember the value from the first scan and then updates it thereafter to reflect its own truncation activity. This isn't absolutely necessary, but because this oldest-file logic is used in multiple places (TrimMultiXact, SetMultiXactIdLimit, and TruncateMultiXact all need it directly or indirectly) caching the value seems like a better idea than recomputing it frequently. I have tested that this patch fixes Steve Kehlet's problem, or at least what I believe to be Steve Kehlet's problem based on the reproduction scenario I described upthread. I believe it will also fix the problems with starting up from a base backup with Alvaro mentioned upthread. It won't fix the fact that pg_upgrade is putting a wrong value into everybody's datminmxid field, which should really be addressed too, but I've been working on this for about three days virtually non-stop and I don't have the energy to tackle it right now. If anyone feels the urge to step into that breech, I think what it needs to do is: when upgrading from a 9.3-or-later instance, copy over each database's datminmxid into the corresponding database in the new cluster. Aside from that, it's very possible that despite my best efforts this has serious bugs. Review and testing would be very much appreciated. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Bruce Momjian
Дата:
On Thu, May 28, 2015 at 07:24:26PM -0400, Robert Haas wrote: > On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > FTR: Robert, you have been a Samurai on this issue. Our many thanks. > > Thanks! I really appreciate the kind words. > > So, in thinking through this situation further, it seems to me that > the situation is pretty dire: > > 1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid > or datminmxid values which are 1 instead of the correct value. > Setting the value to 1 was too far in the past if your MXID counter is > < 2B, and too far in the future if your MXID counter is > 2B. > > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid > values which are equal to the next-mxid counter instead of the correct > value; in other words, they are two new. > > 3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will > have the first problem for tables in template databases, and the > second one for the rest. (See 866f3017a.) I think we need to step back and look at the brain power required to unravel the mess we have made regarding multi-xact and fixes. (I bet few people can even remember which multi-xact fixes went into which releases --- I can't.) Instead of working on actual features, we are having to do this complex diagnosis because we didn't do a thorough analysis at the time a pattern of multi-xact bugs started to appear. Many projects deal with this compound bug debt regularly, but we have mostly avoided this fate. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Josh Berkus
Дата:
All, Just saw what looks like a report of this issue on 9.2. https://github.com/wal-e/wal-e/issues/177 -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Steve Kehlet
Дата:
On Fri, May 29, 2015 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
OK, here's a patch.
I grabbed branch REL9_4_STABLE from git, and Robert got me a 9.4-specific patch. I rebuilt, installed, and postgres started up successfully! I did a bunch of checks, had our app run several thousand SQL queries against it, had a colleague check it out, and it looks good. Looking at top and ps, I don't see anything funny (e.g. no processes spinning cpu, etc), things look normal. Let me know if I can provide anything else.
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Fri, May 29, 2015 at 5:14 PM, Josh Berkus <josh@agliodbs.com> wrote: > Just saw what looks like a report of this issue on 9.2. > > https://github.com/wal-e/wal-e/issues/177 Urk. That looks awfully similar, but I don't think any of the code that is affected here exists in 9.2, or that any of the fixes involved were back-patched to 9.2. So that might be something else altogether. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Sat, May 30, 2015 at 10:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 29, 2015 at 5:14 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Just saw what looks like a report of this issue on 9.2. >> >> https://github.com/wal-e/wal-e/issues/177 > > Urk. That looks awfully similar, but I don't think any of the code > that is affected here exists in 9.2, or that any of the fixes involved > were back-patched to 9.2. So that might be something else altogether. Not only that, the pg_multixact/members *directory* is reported missing, which is a different problem entirely. -- Thomas Munro http://www.enterprisedb.com
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-05-30 10:41:01 +1200, Thomas Munro wrote: > On Sat, May 30, 2015 at 10:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, May 29, 2015 at 5:14 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> Just saw what looks like a report of this issue on 9.2. > >> > >> https://github.com/wal-e/wal-e/issues/177 > > > > Urk. That looks awfully similar, but I don't think any of the code > > that is affected here exists in 9.2, or that any of the fixes involved > > were back-patched to 9.2. So that might be something else altogether. > > Not only that, the pg_multixact/members *directory* is reported > missing, which is a different problem entirely. I don't read that from the log, what does make you think that that's the case. Afaics it's "just" the 0146 member file that is missing?
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Sat, May 30, 2015 at 10:48 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-30 10:41:01 +1200, Thomas Munro wrote: >> On Sat, May 30, 2015 at 10:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> > On Fri, May 29, 2015 at 5:14 PM, Josh Berkus <josh@agliodbs.com> wrote: >> >> Just saw what looks like a report of this issue on 9.2. >> >> >> >> https://github.com/wal-e/wal-e/issues/177 >> > >> > Urk. That looks awfully similar, but I don't think any of the code >> > that is affected here exists in 9.2, or that any of the fixes involved >> > were back-patched to 9.2. So that might be something else altogether. >> >> Not only that, the pg_multixact/members *directory* is reported >> missing, which is a different problem entirely. > > I don't read that from the log, what does make you think that that's the > case. Afaics it's "just" the 0146 member file that is missing? That's the error message, but then further down: "I have confirmed that directory "pg_multixact/members" does not existing in the restored data directory. I can see this directory and the file if i restore a few days old backup. I have used WAL-E backups a number of time from this server but this is the first time I am running into this issue." -- Thomas Munro http://www.enterprisedb.com
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-05-30 10:55:30 +1200, Thomas Munro wrote: > That's the error message, but then further down: Ooops. > "I have confirmed that directory "pg_multixact/members" does not > existing in the restored data directory. > > I can see this directory and the file if i restore a few days old > backup. I have used WAL-E backups a number of time from this server > but this is the first time I am running into this issue." That sounds like a WAL-E problem rather than a postgres one then. I think it's exceedingly unlikely - although I perhaps should be careful with that given the recent history - that postgres contains a bug that deletes an entire directory on the standby/while cloning.
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-05-29 15:49:53 -0400, Bruce Momjian wrote: > I think we need to step back and look at the brain power required to > unravel the mess we have made regarding multi-xact and fixes. (I bet > few people can even remember which multi-xact fixes went into which > releases --- I can't.) Instead of working on actual features, we are > having to do this complex diagnosis because we didn't do a thorough > analysis at the time a pattern of multi-xact bugs started to appear. > Many projects deal with this compound bug debt regularly, but we have > mostly avoided this fate. What is the consequences of that observation you're imagining?
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-05-29 15:08:11 -0400, Robert Haas wrote: > It seems pretty clear that we can't effectively determine anything > about member wraparound until the cluster is consistent. I wonder if this doesn't actually hints at a bigger problem. Currently, to determine where we need to truncate SlruScanDirectory() is used. That, afaics, could actually be a problem during recovery when we're not consistent. Consider the scenario where a very large database is copied while running. At the start of the backup we'll determine at which checkpoint recovery will start and store it in the label. After that the copy will start, copying everything slowly. That works because we expect recovery to fix things up. The problem I see WRT multixacts is that the copied state of pg_multixact could be wildly different from the one at the label's checkpoint. During recovery, before reaching the first checkpoint, we'll create multixact files as used at the time of the checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts. For clog and such that's not a problem because the truncation etc. points are all stored in WAL - during recovery we just replay the truncations that happened on the master, there's no need to look at the data directory. And we won't access the clog before being consistent. But for multixacts is different. To avoid ending up with pg_multixact/*/* directories we have to do truncations during recovery. As there's currently no truncation records we have to do that scanning the data directory. But that state could be "from the future". I considered for a second whether the solution for that could be to not truncate while inconsistent - but I think that doesn't solve anything as then we can end up with directories where every single offsets/member file exists. We could possibly try to fix that by always truncating away slru segments in offsets that we know to be too old to exist in a valid database. But achieving the same for members fries my brain. It also seems awfully risky. I think at least for 9.5+ we should a) invent proper truncation records for pg_multixact b) start storing oldestValidMultiOffset in pg_control. The current hack of scanning the directories to get knowledge we should have is a pretty bad hack, and we should not continue using it forever. I think we might end up needing to do a) even in the backbranches. Am I missing something? This problem isn't conflicting with most of the fixes you describe, so I'll continue with reviewing those. Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Fri, May 29, 2015 at 3:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > It won't fix the fact that pg_upgrade is putting > a wrong value into everybody's datminmxid field, which should really > be addressed too, but I've been working on this for about three days > virtually non-stop and I don't have the energy to tackle it right now. > If anyone feels the urge to step into that breech, I think what it > needs to do is: when upgrading from a 9.3-or-later instance, copy over > each database's datminmxid into the corresponding database in the new > cluster. Bruce was kind enough to spend some time on IM with me this afternoon, and I think this may actually be OK. What pg_upgrade does is: 1. First, put next-xid into the relminmxid for all tables, including catalog tables. This is the correct behavior for upgrades from a pre-9.3 release, and is correct for catalog tables in general. 2. Next, restoring the schema dump will set the relminmxid values for all non-catalog tables to the value dumped from the old cluster. At this point, everything is fine provided that we are coming from a release 9.3 or newer. But if the old cluster is pre-9.3, it will have dumped *zero* values for all of its relminmxid values; so all of the user tables go from the correct value they had after step 1 to an incorrect value. 3. Finally, if the old cluster is pre-9.3, repeat step 1, undoing the damage done in step 2. This is a bit convoluted, but I don't know of a reason why it shouldn't work. Sorry for the false alarm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Fri, May 29, 2015 at 9:46 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem. Currently, > to determine where we need to truncate SlruScanDirectory() is > used. That, afaics, could actually be a problem during recovery when > we're not consistent. I agree. I actually meant to mention this in my previous email, but, owing to exhaustion and burnout, didn't. > I think at least for 9.5+ we should a) invent proper truncation records > for pg_multixact b) start storing oldestValidMultiOffset in pg_control. > The current hack of scanning the directories to get knowledge we should > have is a pretty bad hack, and we should not continue using it forever. > I think we might end up needing to do a) even in the backbranches. That may be the right thing to do. I'm concerned that changing the behavior of master too much will make it every subsequent fix twice as hard, because we'll have to do one fix in master and another fix in the back-branches. I'm also concerned that it will create even more convoluted failure scenarios. The failure-to-start problem discussed on this thread requires a chain of four (maybe three) different PostgreSQL versions in order to create it, and the more things we go change, the harder it's going to be to reason about this stuff. The diseased and rotting elephant in the room here is that clusters with bogus relminmxid, datminmxid, and/or oldestMultiXid values may exist in the wild and we really have no plan to get rid of them. 78db307bb may have helped somewhat - although I'm haven't grokked what it's about well enough to be sure - but it's certainly not a complete solution, as this bug report itself illustrates rather well. Unless we figure out some clever solution that is not now apparent to me, or impose a hard pg_upgrade compatibility break at some point, we basically can't count on pg_control's "oldest multixact" information to be correct ever again. We may be running into clusters 15 years from now that have problems that are just holdovers from what was fixed in 9.3.5. One thing I think we should definitely do is add one or two additional fields to pg_controldata that get filled in by pg_upgrade. One of them should be "the oldest known catversion in the lineage of this cluster" and the other should be "the most recent catverson in the lineage of this cluster before this one". Or maybe we should store PG_VERSION_NUM values. Or store both things. I think that would make troubleshooting this kind of problem a lot easier - just from the pg_controldata output, you'd be able to tell whether the cluster had been pg_upgraded, whether it had been pg_upgraded once or multiple times, and at least some of the versions involved, without relying on the user's memory of what they did and when. Fortunately, Steve Kellet had a pretty clear idea of what his history was, but not all users know that kind of thing, and I've wanted it more than once while troubleshooting. Another thing I think we should do is add a field to pg_class that is propagated by pg_upgrade and stores the most recent PG_VERSION_NUM that is known to have performed a scan_all vacuum of the table. This would allow us to do things in the future like (a) force a full-table vacuum of any table that hasn't been vacuumed since $BUGGYRELEASE or (b) advise users to manually inspect the values and manually perform said vacuum or (c) only believe that certain information about a table is accurate if it's been full-scanned by a vacuum newer than $BUGGYRELEASE. It could also be used as part of a strategy for reclaiming HEAP_MOVED_IN/HEAP_MOVED_OFF; e.g. you can't upgrade to 10.5, which repurposes those bits, unless you've done a scan_all vacuum on every table with a release new enough to guarantee that they're not used for their historical purpose. > This problem isn't conflicting with most of the fixes you describe, so > I'll continue with reviewing those. Thank you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Sat, May 30, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem. Currently, > to determine where we need to truncate SlruScanDirectory() is > used. That, afaics, could actually be a problem during recovery when > we're not consistent. > > Consider the scenario where a very large database is copied while > running. At the start of the backup we'll determine at which checkpoint > recovery will start and store it in the label. After that the copy will > start, copying everything slowly. That works because we expect recovery > to fix things up. The problem I see WRT multixacts is that the copied > state of pg_multixact could be wildly different from the one at the > label's checkpoint. During recovery, before reaching the first > checkpoint, we'll create multixact files as used at the time of the > checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts. Yes, I think the code in TruncateMultiXact that scans for the earliest multixact only works when the segment files span at most 2^31 of multixact space. If they span more than that, MultiXactIdPrecedes is no long able to provide a total ordering, so of the scan may be wrong, depending on the order that it encounters the files. Incidentally, your description of that scenario gave me an idea for how to reproduce a base backup that 9.4.2 or master can't start. I tried this first: 1. Set up with max_wal_senders = 1, wal_level = hot_standby, initdb 2. Create enough multixacts to fill a couple of segments in pg_multixacts/offsets using "explode_mxact_members 99 1000" (create foo table first) 3. Start a base backup with logs, but break in src/backend/replication/basebackup.c after sendFileWithContent(BACKUP_LABEL_FILE, labelfile); and before sending the contents of the data dir (including pg_multixacts)... (or just put a big sleep in there) 4. UPDATE pg_database SET datallowconn = true; vacuumdb --freeze --all; CHECKPOINT;, see that offsets/0000 is now gone and oldestMultiXid is 98001 in pg_control 5. ... allow the server backend to continue; the basebackup completes. Inspecting the new data directory, I see that offsets/0000 is not present as expected, and pg_control contains the oldestMultiXid 98001. Since pg_control was copied after pg_multixacts and my database didn't move between those copies, it points to a valid multixact (unlike the pg_upgrade scenario) and is able to start up, but does something different again which may or may not be good, I'm not sure: LOG: database system was interrupted; last known up at 2015-05-30 14:30:23 NZST LOG: file "pg_multixact/offsets/0000" doesn't exist, reading as zeroes LOG: redo starts at 0/7000028 LOG: consistent recovery state reached at 0/70C8898 LOG: redo done at 0/70C8898 LOG: last completed transaction was at log time 2015-05-30 14:30:17.261436+12 LOG: database system is ready to accept connections My next theory about how to get a FATAL during startup is something like this: Break in basebackup.c in between copying pg_multixacts and copying pg_control (simulating a very large/slow file copy, perhaps if 'base' happens to get copied after 'pg_multixacts', though I don't know if that's possible), and while it's stopped, generate some offsets segments, vacuum --freeze --all, checkpoint and then create a few more multixacts, then checkpoint again (so that oldestMultiXact is not equal to nextMultiXact). Continue. Now pg_control's oldestMultiXactId now points at a segment file that didn't exist when pg_multixacts was copied. I haven't managed to get this to work (ie produce a FATAL) and I'm out of time for a little while, but wanted to share this idea in case it helps someone. -- Thomas Munro http://www.enterprisedb.com
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Bruce Momjian wrote: > I think we need to step back and look at the brain power required to > unravel the mess we have made regarding multi-xact and fixes. (I bet > few people can even remember which multi-xact fixes went into which > releases --- I can't.) Instead of working on actual features, we are > having to do this complex diagnosis because we didn't do a thorough > analysis at the time a pattern of multi-xact bugs started to appear. > Many projects deal with this compound bug debt regularly, but we have > mostly avoided this fate. Well, it's pretty obvious that if we had had a glimpse of the nature of the issues back then, we wouldn't have committed the patch. The number of ends that we left loose we now know to be huge, but we didn't know that back then. (I, at least, certainly didn't.) Simon told me when this last one showed up that what we need at this point is a way to turn the whole thing off to stop it from affecting users anymore. I would love to be able to do that, because the whole situation has become stressing, but I don't see a way. Heck, if we could implement Heikki's TED idea or something similar, I would be up for back-patching it so that people can pg_upgrade from postgresql-9.3 to postgresql-ted-9.3, and just forget any further multixact pain. Don't think that's really doable, though. As far as I can see, for branches 9.3 and 9.4 the best we can do is soldier on and get these bugs fixed, hoping that this time they are really the last [serious] ones. For 9.5, I concur with Andres that we'd do good to change the way truncations are done by WAL-logging more stuff and keep more data in pg_control, to avoid all these nasty games. And for 9.6, find a better representation of the data so that the durable data is stored separately from the volatile data. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Andres Freund wrote: > I considered for a second whether the solution for that could be to not > truncate while inconsistent - but I think that doesn't solve anything as > then we can end up with directories where every single offsets/member > file exists. Hang on a minute. We don't need to scan any files to determine the truncate point for offsets; we have the valid range for them in pg_control, as nextMulti + oldestMulti. And using those end points, we can look for the offsets corresponding to each, and determine the member files corresponding to the whole set; it doesn't matter what other files exist, we just remove them all. In other words, maybe we can get away with considering truncation separately for offset and members on recovery: do it like today for offsets (i.e. at each restartpoint), but do it only in TrimMultiXact for members. One argument against this idea is that we may not want to keep a full set of member files on standbys (due to disk space usage), but that's what will happen unless we truncate during replay. > I think at least for 9.5+ we should a) invent proper truncation records > for pg_multixact b) start storing oldestValidMultiOffset in pg_control. > The current hack of scanning the directories to get knowledge we should > have is a pretty bad hack, and we should not continue using it forever. > I think we might end up needing to do a) even in the backbranches. Definitely agree with WAL-logging truncations; also +1 on backpatching that to 9.3. We already have experience with adding extra WAL records on minor releases, and it didn't seem to have bitten too hard. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-05-30 00:52:37 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > I considered for a second whether the solution for that could be to not > > truncate while inconsistent - but I think that doesn't solve anything as > > then we can end up with directories where every single offsets/member > > file exists. > > Hang on a minute. We don't need to scan any files to determine the > truncate point for offsets; we have the valid range for them in > pg_control, as nextMulti + oldestMulti. And using those end points, we > can look for the offsets corresponding to each, and determine the member > files corresponding to the whole set; it doesn't matter what other files > exist, we just remove them all. In other words, maybe we can get away > with considering truncation separately for offset and members on > recovery: do it like today for offsets (i.e. at each restartpoint), but > do it only in TrimMultiXact for members. Is oldestMulti, nextMulti - 1 really suitable for this? Are both actually guaranteed to exist in the offsets slru and be valid? Hm. I guess you intend to simply truncate everything else, but just in offsets? > One argument against this idea is that we may not want to keep a full > set of member files on standbys (due to disk space usage), but that's > what will happen unless we truncate during replay. I think that argument is pretty much the death-knell.= > > I think at least for 9.5+ we should a) invent proper truncation records > > for pg_multixact b) start storing oldestValidMultiOffset in pg_control. > > The current hack of scanning the directories to get knowledge we should > > have is a pretty bad hack, and we should not continue using it forever. > > I think we might end up needing to do a) even in the backbranches. > > Definitely agree with WAL-logging truncations; also +1 on backpatching > that to 9.3. We already have experience with adding extra WAL records > on minor releases, and it didn't seem to have bitten too hard. I'm inclined to agree. My only problem is that I'm not sure whether we can find a way of doing all this without adding a pg_control field. Let me try to sketch this out: 1) We continue determining the oldest SlruScanDirectory(SlruScanDirCbFindEarliest) on the master to find the oldest offsets segment to truncate. Alternatively, if we determine it to be safe, we could use oldestMulti to find that. 2) SlruScanDirCbRemoveMembers is changed to return the range of members to remove, instead of doing itself 3) We wal log [oldest offset segment guaranteed to not be alive, nextmulti) for offsets, and [oldest members segment guaranteed to not be alive, nextmultioff), and redo truncations for the entire range during recovery. I'm pretty tired right now, but this sounds doable. Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Sat, May 30, 2015 at 8:55 PM, Andres Freund <andres@anarazel.de> wrote: > Is oldestMulti, nextMulti - 1 really suitable for this? Are both > actually guaranteed to exist in the offsets slru and be valid? Hm. I > guess you intend to simply truncate everything else, but just in > offsets? oldestMulti in theory is the right thing, I think, but in actuality we know that some people have 1 here instead of the correct value. >> One argument against this idea is that we may not want to keep a full >> set of member files on standbys (due to disk space usage), but that's >> what will happen unless we truncate during replay. > > I think that argument is pretty much the death-knell.= Yes. Truncating on the standby is really not optional. >> > I think at least for 9.5+ we should a) invent proper truncation records >> > for pg_multixact b) start storing oldestValidMultiOffset in pg_control. >> > The current hack of scanning the directories to get knowledge we should >> > have is a pretty bad hack, and we should not continue using it forever. >> > I think we might end up needing to do a) even in the backbranches. >> >> Definitely agree with WAL-logging truncations; also +1 on backpatching >> that to 9.3. We already have experience with adding extra WAL records >> on minor releases, and it didn't seem to have bitten too hard. > > I'm inclined to agree. My only problem is that I'm not sure whether we > can find a way of doing all this without adding a pg_control field. Let > me try to sketch this out: > > 1) We continue determining the oldest SlruScanDirectory(SlruScanDirCbFindEarliest) > on the master to find the oldest offsets segment to > truncate. Alternatively, if we determine it to be safe, we could use > oldestMulti to find that. > 2) SlruScanDirCbRemoveMembers is changed to return the range of members > to remove, instead of doing itself > 3) We wal log [oldest offset segment guaranteed to not be alive, > nextmulti) for offsets, and [oldest members segment guaranteed to not be alive, > nextmultioff), and redo truncations for the entire range during > recovery. > > I'm pretty tired right now, but this sounds doable. I'm probably biased here, but I think we should finish reviewing, testing, and committing my patch before we embark on designing this. So far we have no reports of trouble attributable to the lack of the WAL-logging support discussed here, as opposed to several reports of trouble from the status quo within days of release. I'm having trouble reconstructing the series of events where what you're worried about here really becomes a problem, and I think we ought to have a very clear idea about that before back-patching changes of this type. It's true that if the state of the SLRU directory is in the future, because recovery was restarted from an earlier checkpoint, we might replay a checkpoint and remove some of those files from the future. But so what? By the time we've reached the minimum recovery point, they will have been recreated by the same WAL records that created them in the first place. If, in the previous replay, we had wrapped all the way around, some of the stuff we keep may actually already have been overwritten by future WAL records, but they'll be overwritten again. Now, that could mess up our determination of which members to remove, I guess, but I'm not clear that actually matters either: if the offsets space has wrapped around, the members space will certainly have wrapped around as well, so we can remove anything we like at this stage and we're still OK. I agree this is ugly the way it is, but where is the actual bug? As far as your actual outline goes, I think if we do this, we need to be very careful about step #2. Right now, we decide what we need to keep and then remove everything else, but that's kind of wonky because new stuff may be getting created at the same time, so we keep adjusting our idea of exactly what needs to be removed. It would be far better to invert that logic: decide what needs to be removed - presumably, everything from the oldest member that now exists up until some later point - and then remove precisely that stuff and nothing else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Incomplete review, done in a relative rush: On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: > OK, here's a patch. Actually two patches, differing only in > whitespace, for 9.3 and for master (ha!). I now think that the root > of the problem here is that DetermineSafeOldestOffset() and > SetMultiXactIdLimit() were largely ignorant of the possibility that > they might be called at points in time when the cluster was > inconsistent. A cause perhaps closer to the root is commit f741300 moving truncation from VACUUM to checkpoints. CLOG has given us deep experience with VACUUM-time truncation. Commit f6a6c46d and this patch are about bringing CHECKPOINT-time truncation up to the same level. Speaking of commit f6a6c46d, it seems logical that updating allocation stop limits should happen proximate to truncation. That's currently the case for CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's TruncateMultiXact() call does both). However, pg_multixact/offsets is truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit. I did not distill an errant test case, but this is fishy. > SetMultiXactIdLimit() bracketed certain parts of its > logic with if (!InRecovery), but those guards were ineffective because > it gets called before InRecovery is set in the first place. SetTransactionIdLimit() checks InRecovery for the same things, and it is called at nearly the same moments as SetMultiXactIdLimit(). Do you have a sense of whether it is subject to similar problems as a result? > 1. Moves the call to DetermineSafeOldestOffset() that appears in > StartupMultiXact() to TrimMultiXact(), so that we don't try to do this > until we're consistent. Also, instead of passing > MultiXactState->oldestMultiXactId, pass the newer of that value and > the earliest offset that exists on disk. That way, it won't try to > read data that's not there. Perhaps emit a LOG message when we do that, since it's our last opportunity to point to the potential data loss? > + * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1 > + * had bugs that could allow users who reached those release through s/release/releases/ > @@ -2859,6 +2947,14 @@ TruncateMultiXact(void) > SimpleLruTruncate(MultiXactOffsetCtl, > MultiXactIdToOffsetPage(oldestMXact)); > > + /* Update oldest-on-disk value in shared memory. */ > + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE; > + if (earliest < FirstMultiXactId) > + earliest = FirstMultiXactId; > + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); > + Assert(MultiXactState->oldestMultiXactOnDiskValid); > + MultiXactState->oldestMultiXactOnDiskValid = earliest; That last line needs s/Valid//, I presume. Is it okay that oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its Valid indicator remaining true? > +static MultiXactOffset > +GetOldestMultiXactOnDisk(void) > +{ > + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc); > + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE; > + if (earliest < FirstMultiXactId) > + earliest = FirstMultiXactId; SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not represent a wrapped state. When the files do represent a wrapped state, MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest() result is sensitive to readdir() order. This code exists specifically to help us make do in the face of wrong catalog and pg_control entries. We may have wrapped as a result of those of same catalog/pg_control entries, so I think this function ought to account for wrap. I haven't given enough thought to what exactly it should do. Thanks, nm
On Fri, May 29, 2015 at 10:37:57AM +1200, Thomas Munro wrote: > On Fri, May 29, 2015 at 7:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > - There's a third possible problem related to boundary cases in > > SlruScanDirCbRemoveMembers, but I don't understand that one well > > enough to explain it. Maybe Thomas can jump in here and explain the > > concern. > > I noticed something in passing which is probably not harmful, and not > relevant to this bug report, it was just a bit confusing while > testing: SlruScanDirCbRemoveMembers never deletes any files if > rangeStart == rangeEnd. In practice, if you have an idle cluster with > a lot of multixact data and you VACUUM FREEZE all databases and then > CHECKPOINT, you might be surprised to see no member files going away > quite yet, but they'll eventually be truncated by a future checkpoint, > once rangeEnd has had a chance to advance to the next page due to more > multixacts being created. > > If we want to fix this one day, maybe the right thing to do is to > treat the rangeStart == rangeEnd case the same way we treat rangeStart > < rangeEnd, that is, to assume that the range of pages isn't > wrapped/inverted in this case. I agree. Because we round rangeStart down to a segment boundary, oldest and next member offsets falling on the same page typically implies rangeStart<rangeEnd. Only when the page they share happens to be the first page of a segment does one observe rangeStart==RangeEnd. While testing this (with inconsistent-multixact-fix-master.patch applied, FWIW), I noticed a nearby bug with a similar symptom. TruncateMultiXact() omits the nextMXact==oldestMXact special case found in each other find_multixact_start() caller, so it reads the offset of a not-yet-created MultiXactId. The usual outcome is to get rangeStart==0, so we truncate less than we could. This can't make us truncate excessively, because nextMXact==oldestMXact implies no table contains any mxid. If nextMXact happens to be the first of a segment, an error is possible. Procedure: 1. Make a fresh cluster. 2. UPDATE pg_database SET datallowconn = true 3. Consume precisely 131071 mxids. Number of offsets per mxid is unimportant. 4. vacuumdb --freeze --all Expected state after those steps: $ pg_controldata | grep NextMultiXactId Latest checkpoint's NextMultiXactId: 131072 Checkpoint will fail like this: 26699 2015-05-31 17:22:33.134 GMT LOG: statement: checkpoint 26661 2015-05-31 17:22:33.134 GMT DEBUG: performing replication slot checkpoint 26661 2015-05-31 17:22:33.136 GMT ERROR: could not access status of transaction 131072 26661 2015-05-31 17:22:33.136 GMT DETAIL: Could not open file "pg_multixact/offsets/0002": No such file or directory. 26699 2015-05-31 17:22:33.234 GMT ERROR: checkpoint request failed 26699 2015-05-31 17:22:33.234 GMT HINT: Consult recent messages in the server log for details. 26699 2015-05-31 17:22:33.234 GMT STATEMENT: checkpoint This does not block startup, and creating one mxid hides the problem again. Thus, it is not a top-priority bug like some other parts of this thread. I mention it today mostly so it doesn't surprise hackers testing other fixes. Thanks, nm
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-05-31 07:51:59 -0400, Robert Haas wrote: > > 1) We continue determining the oldest SlruScanDirectory(SlruScanDirCbFindEarliest) > > on the master to find the oldest offsets segment to > > truncate. Alternatively, if we determine it to be safe, we could use > > oldestMulti to find that. > > 2) SlruScanDirCbRemoveMembers is changed to return the range of members > > to remove, instead of doing itself > > 3) We wal log [oldest offset segment guaranteed to not be alive, > > nextmulti) for offsets, and [oldest members segment guaranteed to not be alive, > > nextmultioff), and redo truncations for the entire range during > > recovery. > > > > I'm pretty tired right now, but this sounds doable. > > I'm probably biased here, but I think we should finish reviewing, > testing, and committing my patch before we embark on designing this. Probably, yes. I am wondering whether doing this immediately won't end up making some things simpler and more robust though. > So far we have no reports of trouble attributable to the lack of the > WAL-logging support discussed here, as opposed to several reports of > trouble from the status quo within days of release. The lack of WAL logging actually has caused problems in the 9.3.3 (?) era, where we didn't do any truncation during recovery... > By the time we've reached the minimum recovery point, they will have > been recreated by the same WAL records that created them in the first > place. I'm not sure that's true. I think we could end up errorneously removing files that were included in the base backup. Anyway, let's focus on your patch for now. > If, in the previous > replay, we had wrapped all the way around, some of the stuff we keep > may actually already have been overwritten by future WAL records, but > they'll be overwritten again. Now, that could mess up our > determination of which members to remove, I guess, but I'm not clear > that actually matters either: if the offsets space has wrapped around, > the members space will certainly have wrapped around as well, so we > can remove anything we like at this stage and we're still OK. I agree > this is ugly the way it is, but where is the actual bug? I'm more worried about the cases where we didn't ever actually "badly wrap around" (i.e. overwrite needed data); but where that's not clear on the standby because the base backup isn't in a consistent state. Greetings, Andres Freund
On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@leadboat.com> wrote: > Incomplete review, done in a relative rush: Thanks. > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: >> OK, here's a patch. Actually two patches, differing only in >> whitespace, for 9.3 and for master (ha!). I now think that the root >> of the problem here is that DetermineSafeOldestOffset() and >> SetMultiXactIdLimit() were largely ignorant of the possibility that >> they might be called at points in time when the cluster was >> inconsistent. > > A cause perhaps closer to the root is commit f741300 moving truncation from > VACUUM to checkpoints. CLOG has given us deep experience with VACUUM-time > truncation. Commit f6a6c46d and this patch are about bringing CHECKPOINT-time > truncation up to the same level. > > Speaking of commit f6a6c46d, it seems logical that updating allocation stop > limits should happen proximate to truncation. That's currently the case for > CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's > TruncateMultiXact() call does both). However, pg_multixact/offsets is > truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit. > I did not distill an errant test case, but this is fishy. Good point. Because we confine ourselves to using half the offset space, it seems much harder for us to get into trouble here than it is with members. The first scenario that occurred to me is that the SLRU might actually wrap. That seems tough, though: between one checkpoint and the next, vacuum would need to advance oldest_datminmxid by 2^31 MXIDs while generating 2^31 new ones, or something like that. That doesn't seem real plausible. But then it occurred to me that it's probably sufficient to advance the head of the SLRU far enough that TruncateMultiXact things that the tail is in the future instead of in the past. I see no reason why that couldn't happen. Then we'd end up leaving some files behind that we should have removed. I'm not sure exactly what problem that would cause; would they just get overwritten on the next pass through the space, or would they cause errors? I have not had time to check. >> SetMultiXactIdLimit() bracketed certain parts of its >> logic with if (!InRecovery), but those guards were ineffective because >> it gets called before InRecovery is set in the first place. > > SetTransactionIdLimit() checks InRecovery for the same things, and it is > called at nearly the same moments as SetMultiXactIdLimit(). Do you have a > sense of whether it is subject to similar problems as a result? Well, I think it's pretty weird that those things will get done before beginning recovery, even on an inconsistent cluster, but not during recovery. That is pretty strange. I don't see that it can actually do any worse than emit a few log messages at the start of recovery that won't show up again until the end of recovery, though. >> 1. Moves the call to DetermineSafeOldestOffset() that appears in >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this >> until we're consistent. Also, instead of passing >> MultiXactState->oldestMultiXactId, pass the newer of that value and >> the earliest offset that exists on disk. That way, it won't try to >> read data that's not there. > > Perhaps emit a LOG message when we do that, since it's our last opportunity to > point to the potential data loss? If the problem is just that somebody minmxid got set to 1 instead of the real value, I think that there is no data loss, because none of those older values are actually present there. But we could add a LOG message anyway. How do you suggest that we phrase that? >> + * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1 >> + * had bugs that could allow users who reached those release through > > s/release/releases/ Fixed. >> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void) >> SimpleLruTruncate(MultiXactOffsetCtl, >> MultiXactIdToOffsetPage(oldestMXact)); >> >> + /* Update oldest-on-disk value in shared memory. */ >> + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE; >> + if (earliest < FirstMultiXactId) >> + earliest = FirstMultiXactId; >> + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); >> + Assert(MultiXactState->oldestMultiXactOnDiskValid); >> + MultiXactState->oldestMultiXactOnDiskValid = earliest; > > That last line needs s/Valid//, I presume. Is it okay that > oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its > Valid indicator remaining true? Ay yai yay. Yes, s/Valid//. I'm not sure what you mean about it becoming too old. At least with that fix, it should get updated to exactly the first file that we didn't remove. Isn't that right? >> +static MultiXactOffset >> +GetOldestMultiXactOnDisk(void) >> +{ > >> + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc); >> + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE; >> + if (earliest < FirstMultiXactId) >> + earliest = FirstMultiXactId; > > SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not > represent a wrapped state. When the files do represent a wrapped state, > MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest() > result is sensitive to readdir() order. This code exists specifically to help > us make do in the face of wrong catalog and pg_control entries. We may have > wrapped as a result of those of same catalog/pg_control entries, so I think > this function ought to account for wrap. I haven't given enough thought to > what exactly it should do. I can see that there might be an issue there, but I can't quite put my finger on it well enough to say that it definitely is an issue. This code is examining the offsets space rather than the members space, and the protections against offsets wraparound have been there since the original commit of this feature (0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have no concrete evidence that there's ever been a problem in this area. It seems like it might be safer to rejigger that code so that it considers distance-behind-current rather than using the wrapped comparison logic, but I'm reluctant to start rejiggering more things without knowing what specifically I'm fixing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund <andres@anarazel.de> wrote: >> I'm probably biased here, but I think we should finish reviewing, >> testing, and committing my patch before we embark on designing this. > > Probably, yes. I am wondering whether doing this immediately won't end > up making some things simpler and more robust though. I'm open to being convinced of that, but as of this moment I'm not seeing any clear-cut evidence that we need to go so far. >> So far we have no reports of trouble attributable to the lack of the >> WAL-logging support discussed here, as opposed to several reports of >> trouble from the status quo within days of release. > > The lack of WAL logging actually has caused problems in the 9.3.3 (?) > era, where we didn't do any truncation during recovery... Right, but now we're piggybacking on the checkpoint records, and I don't have any evidence that this approach can't be made robust. It's possible that it can't be made robust, but that's not currently clear. >> By the time we've reached the minimum recovery point, they will have >> been recreated by the same WAL records that created them in the first >> place. > > I'm not sure that's true. I think we could end up errorneously removing > files that were included in the base backup. Anyway, let's focus on your > patch for now. OK, but I am interested in discussing the other thing too. I just can't piece together the scenario myself - there may well be one. The base backup will begin replay from the checkpoint caused by pg_start_backup() and remove anything that wasn't there at the start of the backup. But all of that stuff should get recreated by the time we reach the minimum recovery point (end of backup). >> If, in the previous >> replay, we had wrapped all the way around, some of the stuff we keep >> may actually already have been overwritten by future WAL records, but >> they'll be overwritten again. Now, that could mess up our >> determination of which members to remove, I guess, but I'm not clear >> that actually matters either: if the offsets space has wrapped around, >> the members space will certainly have wrapped around as well, so we >> can remove anything we like at this stage and we're still OK. I agree >> this is ugly the way it is, but where is the actual bug? > > I'm more worried about the cases where we didn't ever actually "badly > wrap around" (i.e. overwrite needed data); but where that's not clear on > the standby because the base backup isn't in a consistent state. I agree. The current patch tries to make it so that we never call find_multixact_start() while in recovery, but it doesn't quite succeed: the call in TruncateMultiXact still happens during recovery, but only once we're sure that the mxact we plan to call it on actually exists on disk. That won't be called until we replay the first checkpoint, but that might still be prior to consistency. Since I forgot to attach the revised patch with fixes for the points Noah mentioned to that email, here it is attached to this one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Thomas Munro wrote: > > - There's a third possible problem related to boundary cases in > > SlruScanDirCbRemoveMembers, but I don't understand that one well > > enough to explain it. Maybe Thomas can jump in here and explain the > > concern. > > I noticed something in passing which is probably not harmful, and not > relevant to this bug report, it was just a bit confusing while > testing: Another thing I noticed in passing is that SimpleLruDoesPhysicalFileExist uses SLRU_OPEN_FAILED in slru_errcase for the case where lseek() fails, rather than SLRU_SEEK_FAILED. This is probably completely harmless, because that's lseek(..., SEEK_END), but seems rather odd anyway. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Alvaro Herrera wrote: > Robert Haas wrote: > > In the process of investigating this, we found a few other things that > > seem like they may also be bugs: > > > > - As noted upthread, replaying an older checkpoint after a newer > > checkpoint has already happened may lead to similar problems. This > > may be possible when starting from an online base backup; or when > > restarting a standby that did not perform a restartpoint when > > replaying the last checkpoint before the shutdown. > > I'm going through this one now, as it's closely related and caused > issues for us. FWIW I've spent a rather long while trying to reproduce the issue, but haven't been able to figure out. Thomas already commented on it: the server notices that a file is missing and, instead of failing, it "reads the file as zeroes". This is because of this hack in slru.c, which is there to cover for a pg_clog replay consideration: /* * In a crash-and-restart situation, it's possible for us to receive * commands to set the commit status of transactions whose bits are in * already-truncated segments of the commit log (see notes in * SlruPhysicalWritePage). Hence, if we are InRecovery, allow the case * where the file doesn't exist, and return zeroes instead. */ fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); if (fd < 0) { if (errno != ENOENT || !InRecovery) { slru_errcause = SLRU_OPEN_FAILED; slru_errno = errno; return false; } ereport(LOG, (errmsg("file \"%s\" doesn't exist, reading as zeroes", path))); MemSet(shared->page_buffer[slotno], 0, BLCKSZ); return true; } I was able to cause an actual problem by #ifdefing out the "if errno/inRcovery" line, of course, but how can I be sure that fixing that would also fix my customer's problem? That's no good. Anyway here's a quick script to almost-reproduce the problem. I constructed many variations of this, trying to find the failing one, to no avail. Note I'm using the pg_burn_multixact() function to create multixacts quickly (this is cheating in itself, but it seems to me that if I'm not able to reproduce the problem with this, I'm not able to do so without it either.) This script takes an exclusive backup (cp -pr) excluding pg_multixact, then does some multixact stuff (including truncation), then copies pg_multixact. Evidently I'm missing some critical element but I can't see what it is. Another notable variant was to copy pg_multixact first, then do stuff (create lots of multixacts, then mxact-freeze/checkpoint), then copy the rest of the data dir. No joy either: when replay starts, the missing multixacts are created on the standby and so by the time we reach the checkpoint record the pg_multixact/offset file has already grown to the necessary size. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Alvaro Herrera wrote: > Anyway here's a quick script to almost-reproduce the problem. Meh. Really attached now. I also wanted to post the error messages we got: 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset 147456:Success. 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure We pg_xlogdumped the offending segment and see that there is a checkpoint record with oldestMulti=4624559. Curiously, there are no other records with rmgr=MultiXact in that segment. Note that the file exists (otherwise the error would say "could not open" rather than "could not read"); this is also why errno says "Success" rather than some actual error; this is the code: errno = 0; if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ) { slru_errcause = SLRU_READ_FAILED; slru_errno = errno; CloseTransientFile(fd); return false; } My guess is that the file existed, and perhaps had one or more pages, but the wanted page doesn't exist, so we tried to read but got 0 bytes back. read() returns 0 in this case but doesn't set errno. I didn't find a way to set things so that the file exists but is of shorter contents than oldestMulti by the time the checkpoint record is replayed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@leadboat.com> wrote: > > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: > >> SetMultiXactIdLimit() bracketed certain parts of its > >> logic with if (!InRecovery), but those guards were ineffective because > >> it gets called before InRecovery is set in the first place. > > > > SetTransactionIdLimit() checks InRecovery for the same things, and it is > > called at nearly the same moments as SetMultiXactIdLimit(). Do you have a > > sense of whether it is subject to similar problems as a result? > > Well, I think it's pretty weird that those things will get done before > beginning recovery, even on an inconsistent cluster, but not during > recovery. That is pretty strange. I don't see that it can actually > do any worse than emit a few log messages at the start of recovery > that won't show up again until the end of recovery, though. Granted. Would it be better to update both functions at the same time, and perhaps to make that a master-only change? Does the status quo cause more practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()? > >> 1. Moves the call to DetermineSafeOldestOffset() that appears in > >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this > >> until we're consistent. Also, instead of passing > >> MultiXactState->oldestMultiXactId, pass the newer of that value and > >> the earliest offset that exists on disk. That way, it won't try to > >> read data that's not there. > > > > Perhaps emit a LOG message when we do that, since it's our last opportunity to > > point to the potential data loss? > > If the problem is just that somebody minmxid got set to 1 instead of > the real value, I think that there is no data loss, because none of > those older values are actually present there. But we could add a LOG > message anyway. How do you suggest that we phrase that? There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set to 1. Otherwise, data loss is possible. I don't hope for an actionable message, but we might want a reporter to grep logs for it when we diagnose future reports. Perhaps this: "missing pg_multixact/members files; begins at MultiXactId %u, expected %u" For good measure, perhaps emit this when lastCheckpointedOldest > earliest by more than one segment: "excess pg_multixact/members files; begins at MultiXactId %u, expected %u" > >> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void) > >> SimpleLruTruncate(MultiXactOffsetCtl, > >> MultiXactIdToOffsetPage(oldestMXact)); > >> > >> + /* Update oldest-on-disk value in shared memory. */ > >> + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE; > >> + if (earliest < FirstMultiXactId) > >> + earliest = FirstMultiXactId; > >> + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); > >> + Assert(MultiXactState->oldestMultiXactOnDiskValid); > >> + MultiXactState->oldestMultiXactOnDiskValid = earliest; > > > > That last line needs s/Valid//, I presume. Is it okay that > > oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its > > Valid indicator remaining true? > > Ay yai yay. Yes, s/Valid//. > > I'm not sure what you mean about it becoming too old. At least with > that fix, it should get updated to exactly the first file that we > didn't remove. Isn't that right? Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk() only in that it never reads or writes the cache. I might expect oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that does hold most of the time. It does not always hold between the start of the quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment. That's because raw_GOMXOD() changes at the instant we unlink the oldest segment, but oldestMultiXactOnDisk changes later. Any backend may call GetOldestMultiXactOnDisk() via SetMultiXactIdLimit(). If a backend does that concurrent with the checkpointer running TruncateMultiXact() and sees a stale oldestMultiXactOnDisk, is that harmless? > >> +static MultiXactOffset > >> +GetOldestMultiXactOnDisk(void) > >> +{ > > > >> + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc); > >> + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE; > >> + if (earliest < FirstMultiXactId) > >> + earliest = FirstMultiXactId; > > > > SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not > > represent a wrapped state. When the files do represent a wrapped state, > > MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest() > > result is sensitive to readdir() order. This code exists specifically to help > > us make do in the face of wrong catalog and pg_control entries. We may have > > wrapped as a result of those of same catalog/pg_control entries, so I think > > this function ought to account for wrap. I haven't given enough thought to > > what exactly it should do. > > I can see that there might be an issue there, but I can't quite put my > finger on it well enough to say that it definitely is an issue. This > code is examining the offsets space rather than the members space, and > the protections against offsets wraparound have been there since the > original commit of this feature > (0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have > no concrete evidence that there's ever been a problem in this area. > It seems like it might be safer to rejigger that code so that it > considers distance-behind-current rather than using the wrapped > comparison logic, but I'm reluctant to start rejiggering more things > without knowing what specifically I'm fixing. Anything that could cause the pg_multixact/offsets tail to rotate from being in the past to being in the future poses this risk. (This is the tail from the perspective of files on disk; pg_control, datminmxid, and MultiXactState notions of the tail play no part here.) I had in mind that the pg_upgrade datminmxid=1 bug could be a tool for achieving that, but I've been unsuccessful so far at building a credible thought experiment around it. Near the beginning of your reply, you surmised that this could happen between a VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact(). Another vector is unlink() failure on a segment file. SlruDeleteSegment() currently ignores the unlink() return value; the only harm has been some disk usage. With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is mandatory to forestall the appearance of a wrapped state. Thanks, nm
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-01 14:22:32 -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund <andres@anarazel.de> wrote: > > The lack of WAL logging actually has caused problems in the 9.3.3 (?) > > era, where we didn't do any truncation during recovery... > > Right, but now we're piggybacking on the checkpoint records, and I > don't have any evidence that this approach can't be made robust. It's > possible that it can't be made robust, but that's not currently clear. Well, it's possible that it can be made work without problems. But I think robust is something different. Having to look at slrus, during recovery, to find out what to truncate puts more intelligence/complexity in the recovery path than I'm comfortable with. > >> By the time we've reached the minimum recovery point, they will have > >> been recreated by the same WAL records that created them in the first > >> place. > > > > I'm not sure that's true. I think we could end up errorneously removing > > files that were included in the base backup. Anyway, let's focus on your > > patch for now. > > OK, but I am interested in discussing the other thing too. I just > can't piece together the scenario myself - there may well be one. The > base backup will begin replay from the checkpoint caused by > pg_start_backup() and remove anything that wasn't there at the start > of the backup. But all of that stuff should get recreated by the time > we reach the minimum recovery point (end of backup). I'm not sure if it's reprouceably borked. What about this scenario: 1) pg_start_backup() is called, creates a checkpoint. 2) 2**31 multixacts are created, possibly with several checkpoints inbetween 3) pg_multixact is copied 4) basebackup finishes Unless I'm missing something this will lead to a crash recovery startup where the first call to TruncateMultiXact() will have MultiXactState->lastCheckpointedOldest wildly inconsistent with GetOldestMultiXactOnDisk() return value. Possibly leading to truncation being skipped errorneously. Whether that's a problem I'm not yet entirely sure. But what *definitely* looks wrong to me is that a TruncateMultiXact() in this scenario now (since a couple weeks ago) does a SimpleLruReadPage_ReadOnly() in the members slru via find_multixact_start(). That just won't work acceptably when we're not yet consistent. There very well could not be a valid members segment at that point? Am I missing something? > > I'm more worried about the cases where we didn't ever actually "badly > > wrap around" (i.e. overwrite needed data); but where that's not clear on > > the standby because the base backup isn't in a consistent state. > > I agree. The current patch tries to make it so that we never call > find_multixact_start() while in recovery, but it doesn't quite > succeed: the call in TruncateMultiXact still happens during recovery, > but only once we're sure that the mxact we plan to call it on actually > exists on disk. That won't be called until we replay the first > checkpoint, but that might still be prior to consistency. It'll pretty much *always* be before we reach consistency, right? It'll called on the checkpoint created by pg_start_backup()? I don't think the presence check (that's GetOldestMultiXactOnDisk() in TruncateMultiXact(), right) helps very much. There's no guarantee at all that offsets and members are in any way consistent with each other. Or in themselves for that matter, the copy could very well have been in the middle of a write the slru page. I think at the very least we'll have to skip this step while not yet consistent. That really sucks, because we'll possibly end up with multixacts that are completely filled by the time we've reached consistency. Greetings, Andres Freund
On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch <noah@leadboat.com> wrote: > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: >> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@leadboat.com> wrote: >> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: >> >> SetMultiXactIdLimit() bracketed certain parts of its >> >> logic with if (!InRecovery), but those guards were ineffective because >> >> it gets called before InRecovery is set in the first place. >> > >> > SetTransactionIdLimit() checks InRecovery for the same things, and it is >> > called at nearly the same moments as SetMultiXactIdLimit(). Do you have a >> > sense of whether it is subject to similar problems as a result? >> >> Well, I think it's pretty weird that those things will get done before >> beginning recovery, even on an inconsistent cluster, but not during >> recovery. That is pretty strange. I don't see that it can actually >> do any worse than emit a few log messages at the start of recovery >> that won't show up again until the end of recovery, though. > > Granted. Would it be better to update both functions at the same time, and > perhaps to make that a master-only change? Does the status quo cause more > practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()? It does in the case of the call to find_multixact_start(). If that fails, we take the server down for no good reason, as demonstrated by the original report. I'll revert the changes to the other two things in this function that I changed to be protected by did_trim. There's no special reason to think that's a necessary change. >> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in >> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this >> >> until we're consistent. Also, instead of passing >> >> MultiXactState->oldestMultiXactId, pass the newer of that value and >> >> the earliest offset that exists on disk. That way, it won't try to >> >> read data that's not there. >> > >> > Perhaps emit a LOG message when we do that, since it's our last opportunity to >> > point to the potential data loss? >> >> If the problem is just that somebody minmxid got set to 1 instead of >> the real value, I think that there is no data loss, because none of >> those older values are actually present there. But we could add a LOG >> message anyway. How do you suggest that we phrase that? > > There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set > to 1. Otherwise, data loss is possible. Yes, but in that scenario, the log message you propose wouldn't be triggered. If true_minmxid > 2^31, then the stored minmxid will not precede the files on disk; it will follow it (assuming the older stuff hasn't been truncated, as is likely). So the message would be essentially: LOG: you didn't lose data. but if exactly the opposite of what this message is telling you about had happened, then you would have. DETAIL: Have a nice day. > I don't hope for an actionable > message, but we might want a reporter to grep logs for it when we diagnose > future reports. Perhaps this: > > "missing pg_multixact/members files; begins at MultiXactId %u, expected %u" This seems misleading. In the known failure case, it's not that the pg_multixact files are unexpectedly missing; it's that we incorrectly think that they should still be there. Maybe: oldest MultiXactId on disk %u follows expected oldest MultiXact %u > For good measure, perhaps emit this when lastCheckpointedOldest > earliest by > more than one segment: > > "excess pg_multixact/members files; begins at MultiXactId %u, expected %u" So, this scenario will happen whenever the system was interrupted in the middle of a truncation, or when the system is started from a base backup and a truncation happened after files were copied. I'm wary of giving users the idea that this is an atypical event. Perhaps a message at DEBUG1? >> I'm not sure what you mean about it becoming too old. At least with >> that fix, it should get updated to exactly the first file that we >> didn't remove. Isn't that right? > > Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk() > only in that it never reads or writes the cache. I might expect > oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that > does hold most of the time. It does not always hold between the start of the > quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment. > That's because raw_GOMXOD() changes at the instant we unlink the oldest > segment, but oldestMultiXactOnDisk changes later. Any backend may call > GetOldestMultiXactOnDisk() via SetMultiXactIdLimit(). If a backend does that > concurrent with the checkpointer running TruncateMultiXact() and sees a stale > oldestMultiXactOnDisk, is that harmless? As far as I can tell, it's pretty much harmless. I mean, we've already discussed the risk that the head and tail could get too far apart, because really it should be TruncateMultiXact(), not SetMultiXactIdLimit(), that establishes the new stop point. But that problem exists independent of what you're talking about here. >> I can see that there might be an issue there, but I can't quite put my >> finger on it well enough to say that it definitely is an issue. This >> code is examining the offsets space rather than the members space, and >> the protections against offsets wraparound have been there since the >> original commit of this feature >> (0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have >> no concrete evidence that there's ever been a problem in this area. >> It seems like it might be safer to rejigger that code so that it >> considers distance-behind-current rather than using the wrapped >> comparison logic, but I'm reluctant to start rejiggering more things >> without knowing what specifically I'm fixing. > > Anything that could cause the pg_multixact/offsets tail to rotate from being > in the past to being in the future poses this risk. (This is the tail from > the perspective of files on disk; pg_control, datminmxid, and MultiXactState > notions of the tail play no part here.) I had in mind that the pg_upgrade > datminmxid=1 bug could be a tool for achieving that, but I've been > unsuccessful so far at building a credible thought experiment around it. Near > the beginning of your reply, you surmised that this could happen between a > VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact(). > Another vector is unlink() failure on a segment file. SlruDeleteSegment() > currently ignores the unlink() return value; the only harm has been some disk > usage. With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is > mandatory to forestall the appearance of a wrapped state. I'm having trouble figuring out what to do about this. I mean, the essential principle of this patch is that if we can't count on relminmxid, datminmxid, or the control file to be accurate, we can at least look at what is present on the disk. If we also cannot count on that to be accurate, we are left without any reliable source of information. Consider a hypothetical cluster where all our stored minmxids of whatever form are corrupted (say, all change to 1) and in addition there are stray files in pg_multixact. I don't think there's really any way to get ourselves out of trouble in that scenario. There are some things that we could do that might help a little. For example, if the control file's oldest-multi value points to a file that is not on disk, we could advance it a file at a time until it lands on a file that is present. This would help if the oldest-multi value is pointing to the future, but everything on disk is in the past. Another idea is that we could rely on the "on disk" value only until MultiXactState->lastCheckpointedOldest actually points to a real file. After we get out of this kind of trouble, we should never get back into it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-02 11:16:22 -0400, Robert Haas wrote: > I'm having trouble figuring out what to do about this. I mean, the > essential principle of this patch is that if we can't count on > relminmxid, datminmxid, or the control file to be accurate, we can at > least look at what is present on the disk. If we also cannot count on > that to be accurate, we are left without any reliable source of > information. Consider a hypothetical cluster where all our stored > minmxids of whatever form are corrupted (say, all change to 1) and in > addition there are stray files in pg_multixact. I don't think there's > really any way to get ourselves out of trouble in that scenario. If we were to truncate after vacuum, and only on the primary (via WAL logging), we could, afaics, just rely on all the values to be recomputed. I mean relminmxid will be recomputed after a vacuum, and thus, after some time, will datminmxid and the control file value. We could just force a value of 1 to always trigger anti-wraparound vacuums (or wait for that to happen implicitly, to delay the impact?). That'll then should then fix the problem in a relatively short amount of time?
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund <andres@anarazel.de> wrote: > But what *definitely* looks wrong to me is that a TruncateMultiXact() in > this scenario now (since a couple weeks ago) does a > SimpleLruReadPage_ReadOnly() in the members slru via > find_multixact_start(). That just won't work acceptably when we're not > yet consistent. There very well could not be a valid members segment at > that point? Am I missing something? Yes: that code isn't new. TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in 9.3.0 and every subsequent release until 9.3.7/9.4.2. The only thing that's changed is that we've moved that logic into a function called find_multixact_start() instead of having it directly inside that function. We did that because we needed to use the same logic in some other places. The reason why 9.3.7/9.4.2 are causing problems for people that they didn't have previously is because those new, additional call sites were poorly chosen and didn't include adequate protection against calling that function with an invalid input value. What this patch is about is getting back to the situation that we were in from 9.3.0 - 9.3.6 and 9.4.0 - 9.4.1, where TruncateMultiXact() did the thing that you're complaining about here but no one else did. From my point of view, I think that you are absolutely right to question what's going on in TruncateMultiXact(). It's kooky, and there may well be bugs buried there. But I don't think fixing that should be the priority right now, because we have zero reports of problems attributable to that logic. I think the priority should be on undoing the damage that we did in 9.3.7/9.4.2, when we made other places to do the same thing. We started getting trouble reports attributable to those changes *almost immediately*, which means that whether or not TruncateMultiXact() is broken, these new call sites definitely are. I think we really need to fix those new places ASAP. > I think at the very least we'll have to skip this step while not yet > consistent. That really sucks, because we'll possibly end up with > multixacts that are completely filled by the time we've reached > consistency. That would be a departure from the behavior of every existing release that includes this code based on, to my knowledge, zero trouble reports. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-02 11:29:24 -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund <andres@anarazel.de> wrote: > > But what *definitely* looks wrong to me is that a TruncateMultiXact() in > > this scenario now (since a couple weeks ago) does a > > SimpleLruReadPage_ReadOnly() in the members slru via > > find_multixact_start(). That just won't work acceptably when we're not > > yet consistent. There very well could not be a valid members segment at > > that point? Am I missing something? > > Yes: that code isn't new. Good point. > TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in > 9.3.0 and every subsequent release until 9.3.7/9.4.2. But back then TruncateMultiXact() wasn't called during recovery. But you're right in that it didn't seem to have reproduced attributable bugreprorts since 9.3.2 where vacuuming during recovery was introduced. So it indeed doesn't seem as urgent as fixing the new callsites. > That would be a departure from the behavior of every existing release > that includes this code based on, to my knowledge, zero trouble > reports. On the other hand we're now at about bug #5 attributeable to the odd way truncation works for multixacts. It's obviously complex and hard to get right. It makes it harder to cope with the wrong values left in datminxid etc. So I'm still wondering whether fixing this for good isn't the better approach. Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-02 11:16:22 -0400, Robert Haas wrote: >> I'm having trouble figuring out what to do about this. I mean, the >> essential principle of this patch is that if we can't count on >> relminmxid, datminmxid, or the control file to be accurate, we can at >> least look at what is present on the disk. If we also cannot count on >> that to be accurate, we are left without any reliable source of >> information. Consider a hypothetical cluster where all our stored >> minmxids of whatever form are corrupted (say, all change to 1) and in >> addition there are stray files in pg_multixact. I don't think there's >> really any way to get ourselves out of trouble in that scenario. > > If we were to truncate after vacuum, and only on the primary (via WAL > logging), we could, afaics, just rely on all the values to be > recomputed. I mean relminmxid will be recomputed after a vacuum, and > thus, after some time, will datminmxid and the control file value. We > could just force a value of 1 to always trigger anti-wraparound vacuums > (or wait for that to happen implicitly, to delay the impact?). That'll > then should then fix the problem in a relatively short amount of time? The exact circumstances under which we're willing to replace a relminmxid with a newly-computed one that differs are not altogether clear to me, but there's an "if" statement protecting that logic, so there are some circumstances in which we'll leave the existing value intact. If we force non-stop vacuuming in that scenario, autovacuum will just run like crazy without accomplishing anything, which wouldn't be good. It would similarly do so when the oldest MXID reference in the relation is in fact 1, but that value can't be vacuumed away yet. Also, the database might be really big. Even if it were true that a full scan of every table would get us out of this state, describing the time that it would take to do that as "relatively short" seems to me to be considerably understating the impact of a full-cluster VACUUM. With regard to the more general question of WAL-logging this, are you going to work on that? Are you hoping Alvaro or I will work on that? Should we draw straws? It seems like somebody needs to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-02 11:37:02 -0400, Robert Haas wrote: > The exact circumstances under which we're willing to replace a > relminmxid with a newly-computed one that differs are not altogether > clear to me, but there's an "if" statement protecting that logic, so > there are some circumstances in which we'll leave the existing value > intact. I guess we'd have to change that then. > It would similarly do so when the oldest MXID reference in the > relation is in fact 1, but that value can't be vacuumed away yet. I'd thought of just forcing consumption of one multixact in that case. Not pretty, but imo acceptable. > Also, the database might be really big. Even if it were true that a > full scan of every table would get us out of this state, describing > the time that it would take to do that as "relatively short" seems to > me to be considerably understating the impact of a full-cluster > VACUUM. Well. We're dealing with a corrupted cluster. Having a way out that's done automatically, even if it takes a long while, is pretty good already. In many cases the corruption (i.e. pg_upgrade) happened long ago, so the table's relminmxid will already have been recomputed. I think that's acceptable. > With regard to the more general question of WAL-logging this, are you > going to work on that? Are you hoping Alvaro or I will work on that? > Should we draw straws? It seems like somebody needs to do it. I'm willing to invest the time to develop an initial version, but I'll need help evaluating it. I don't have many testing resources available atm, and I'm not going to trust stuff I developed while travelling by just looking at the code. Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote: >> That would be a departure from the behavior of every existing release >> that includes this code based on, to my knowledge, zero trouble >> reports. > > On the other hand we're now at about bug #5 attributeable to the odd way > truncation works for multixacts. It's obviously complex and hard to get > right. It makes it harder to cope with the wrong values left in > datminxid etc. So I'm still wondering whether fixing this for good isn't > the better approach. It may well be. But I think we should do something more surgical first. Perhaps we can justify the pain and risk of making changes to the WAL format in the back-branches, but let's not do it in a rush. If we can get this patch to a state where it undoes the damage inflicted in 9.3.7/9.4.2, then we will be in a state where we have as much reliability as we had in 9.3.6 plus the protections against member-space wraparound added in 9.3.7 - which, like the patch I'm proposing now, were directly motivated by multiple, independent bug reports. That seems like a good place to get to. If nothing else, it will buy us some time to figure out what else we want to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-02 11:37:02 -0400, Robert Haas wrote: >> The exact circumstances under which we're willing to replace a >> relminmxid with a newly-computed one that differs are not altogether >> clear to me, but there's an "if" statement protecting that logic, so >> there are some circumstances in which we'll leave the existing value >> intact. > > I guess we'd have to change that then. Yeah, but first we'd need to assess why it's like that. Tom was the one who installed the current logic, but I haven't been able to fully understand it. >> It would similarly do so when the oldest MXID reference in the >> relation is in fact 1, but that value can't be vacuumed away yet. > > I'd thought of just forcing consumption of one multixact in that > case. Not pretty, but imo acceptable. What if multixact 1 still has living members? >> Also, the database might be really big. Even if it were true that a >> full scan of every table would get us out of this state, describing >> the time that it would take to do that as "relatively short" seems to >> me to be considerably understating the impact of a full-cluster >> VACUUM. > > Well. We're dealing with a corrupted cluster. Having a way out that's > done automatically, even if it takes a long while, is pretty good > already. In many cases the corruption (i.e. pg_upgrade) happened long > ago, so the table's relminmxid will already have been recomputed. I > think that's acceptable. I'm a long way from being convinced the automated recovery is possible. There are so many different scenarios here that it's very difficult to reason generally about what the "right" thing to do is. I agree it would be nice if we had it, though. >> With regard to the more general question of WAL-logging this, are you >> going to work on that? Are you hoping Alvaro or I will work on that? >> Should we draw straws? It seems like somebody needs to do it. > > I'm willing to invest the time to develop an initial version, but I'll > need help evaluating it. I don't have many testing resources available > atm, and I'm not going to trust stuff I developed while travelling by > just looking at the code. I'm willing to help with that. Hopefully I'm not the only one, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch <noah@leadboat.com> wrote: > > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: > > Granted. Would it be better to update both functions at the same time, and > > perhaps to make that a master-only change? Does the status quo cause more > > practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()? > > It does in the case of the call to find_multixact_start(). If that > fails, we take the server down for no good reason, as demonstrated by > the original report. I'll revert the changes to the other two things > in this function that I changed to be protected by did_trim. Sounds good. > > There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set > > to 1. Otherwise, data loss is possible. > > Yes, but in that scenario, the log message you propose wouldn't be > triggered. If true_minmxid > 2^31, then the stored minmxid will not > precede the files on disk; it will follow it (assuming the older stuff > hasn't been truncated, as is likely). Ah, quite right. > > "missing pg_multixact/members files; begins at MultiXactId %u, expected %u" > > This seems misleading. In the known failure case, it's not that the > pg_multixact files are unexpectedly missing; it's that we incorrectly > think that they should still be there. Maybe: > > oldest MultiXactId on disk %u follows expected oldest MultiXact %u Your wording is better. > > For good measure, perhaps emit this when lastCheckpointedOldest > earliest by > > more than one segment: > > > > "excess pg_multixact/members files; begins at MultiXactId %u, expected %u" > > So, this scenario will happen whenever the system was interrupted in > the middle of a truncation, or when the system is started from a base > backup and a truncation happened after files were copied. I'm wary of > giving users the idea that this is an atypical event. Perhaps a > message at DEBUG1? DEBUG1 works for me, or feel free to leave it out. > >> I can see that there might be an issue there, but I can't quite put my > >> finger on it well enough to say that it definitely is an issue. This > >> code is examining the offsets space rather than the members space, and > >> the protections against offsets wraparound have been there since the > >> original commit of this feature > >> (0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have > >> no concrete evidence that there's ever been a problem in this area. > >> It seems like it might be safer to rejigger that code so that it > >> considers distance-behind-current rather than using the wrapped > >> comparison logic, but I'm reluctant to start rejiggering more things > >> without knowing what specifically I'm fixing. > > > > Anything that could cause the pg_multixact/offsets tail to rotate from being > > in the past to being in the future poses this risk. (This is the tail from > > the perspective of files on disk; pg_control, datminmxid, and MultiXactState > > notions of the tail play no part here.) I had in mind that the pg_upgrade > > datminmxid=1 bug could be a tool for achieving that, but I've been > > unsuccessful so far at building a credible thought experiment around it. Near > > the beginning of your reply, you surmised that this could happen between a > > VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact(). > > Another vector is unlink() failure on a segment file. SlruDeleteSegment() > > currently ignores the unlink() return value; the only harm has been some disk > > usage. With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is > > mandatory to forestall the appearance of a wrapped state. > > I'm having trouble figuring out what to do about this. I mean, the > essential principle of this patch is that if we can't count on > relminmxid, datminmxid, or the control file to be accurate, we can at > least look at what is present on the disk. If we also cannot count on > that to be accurate, we are left without any reliable source of > information. Consider a hypothetical cluster where all our stored > minmxids of whatever form are corrupted (say, all change to 1) and in > addition there are stray files in pg_multixact. I don't think there's > really any way to get ourselves out of trouble in that scenario. We could notice the problem and halt. You mentioned above the possibility to have SlruScanDirCbFindEarliest() check "distance-behind-current". Suppose it used the current oldestMulti from pg_control as a reference point and discovered the multixact on disk (a) most far behind that reference point and (b) most far ahead of that reference point. If MultiXactIdPrecedes(a, b), we're good; adopt (a) as the new datminmxid. Account for the possibility that the space isn't really wrapped, but out reference point was totally wrong. Once we determine that the space is wrapped, bail. As you discuss downthread, the mxids actually present in t_max define the ones we need. We could scan them all and set authoritative datminmxid and pg_control entries. > There are some things that we could do that might help a little. For > example, if the control file's oldest-multi value points to a file > that is not on disk, we could advance it a file at a time until it > lands on a file that is present. This would help if the oldest-multi > value is pointing to the future, but everything on disk is in the > past. That's worth examining further. > Another idea is that we could rely on the "on disk" value only until > MultiXactState->lastCheckpointedOldest actually points to a real file. When wrapping confuses the "on disk" value, I think any reliance on that value is a dead end. > After we get out of this kind of trouble, we should never get back > into it. Oh, we excel at that.
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-02 11:49:56 -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-06-02 11:37:02 -0400, Robert Haas wrote: > >> The exact circumstances under which we're willing to replace a > >> relminmxid with a newly-computed one that differs are not altogether > >> clear to me, but there's an "if" statement protecting that logic, so > >> there are some circumstances in which we'll leave the existing value > >> intact. > > > > I guess we'd have to change that then. > > Yeah, but first we'd need to assess why it's like that. Tom was the > one who installed the current logic, but I haven't been able to fully > understand it. We're talking about: /* Similarly for relminmxid */ if (MultiXactIdIsValid(minmulti) && pgcform->relminmxid != minmulti && (MultiXactIdPrecedes(pgcform->relminmxid, minmulti) || MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid))) { pgcform->relminmxid = minmulti; dirty = true; } right? Before that change (78db307bb/87f830e0ce) we only updated relminmxid if the new value was newer than the old one. That's to avoid values from going backwards, e.g. when a relation is first VACUUM FREEZEd and then a normal VACUUM is performed (these values are unfortunately based on the cutoff values instead of the observed minima). The new thing is the || MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid) which is there to recognize values from the future. E.g. the 1 errorneously left in place by pg_upgrade. Makes sense? > >> It would similarly do so when the oldest MXID reference in the > >> relation is in fact 1, but that value can't be vacuumed away yet. > > > > I'd thought of just forcing consumption of one multixact in that > > case. Not pretty, but imo acceptable. > > What if multixact 1 still has living members? I think we should just trigger the logic if 1 is below the multi freezing horizon - in which case a) the 1 will automatically be replaced, because the horizon is newer b) it can't have a living member anyway. - Andres
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-01 14:22:32 -0400, Robert Haas wrote: > commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9 > Author: Robert Haas <rhaas@postgresql.org> > Date: Fri May 29 14:35:53 2015 -0400 > > foo Hehe! > diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c > index 9568ff1..aca829d 100644 > --- a/src/backend/access/transam/multixact.c > +++ b/src/backend/access/transam/multixact.c > @@ -199,8 +199,9 @@ typedef struct MultiXactStateData > MultiXactOffset nextOffset; > > /* > - * Oldest multixact that is still on disk. Anything older than this > - * should not be consulted. These values are updated by vacuum. > + * Oldest multixact that may still be referenced from a relation. > + * Anything older than this should not be consulted. These values are > + * updated by vacuum. > */ > MultiXactId oldestMultiXactId; > Oid oldestMultiXactDB; > @@ -213,6 +214,18 @@ typedef struct MultiXactStateData > */ > MultiXactId lastCheckpointedOldest; > > + /* > + * This is the oldest file that actually exist on the disk. This value > + * is initialized by scanning pg_multixact/offsets, and subsequently > + * updated each time we complete a truncation. We need a flag to > + * indicate whether this has been initialized yet. > + */ > + MultiXactId oldestMultiXactOnDisk; > + bool oldestMultiXactOnDiskValid; > + > + /* Has TrimMultiXact been called yet? */ > + bool didTrimMultiXact; I'm not really convinced tying things closer to having done trimming is easier to understand than tying things to recovery having finished. E.g. if (did_trim) oldestOffset = GetOldestReferencedOffset(oldest_datminmxid); imo is harder to understand than if (!InRecovery). Maybe we could just name it finishedStartup and rename the functions accordingly? > @@ -1956,12 +1971,6 @@ StartupMultiXact(void) > */ > pageno = MXOffsetToMemberPage(offset); > MultiXactMemberCtl->shared->latest_page_number = pageno; > - > - /* > - * compute the oldest member we need to keep around to avoid old member > - * data overrun. > - */ > - DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId); > } Maybe it's worthwhile to add a 'NB: At this stage the data directory is not yet necessarily consistent' StartupMultiXact's comments, to avoid reintroducing problems like this? > /* > + * We can read this without a lock, because it only changes when nothing > + * else is running. > + */ > + did_trim = MultiXactState->didTrimMultiXact; Err, Hot Standby? It might be ok to not lock, but the comment is definitely wrong. I'm inclined to simply use locking, this doesn't look sufficiently critical performancewise. > +static MultiXactOffset > +GetOldestReferencedOffset(MultiXactId oldestMXact) > +{ > + MultiXactId earliest; > + MultiXactOffset oldestOffset; > + > + /* > + * Because of bugs in early 9.3.X and 9.4.X releases (see comments in > + * TrimMultiXact for details), oldest_datminmxid might point to a > + * nonexistent multixact. If so, use the oldest one that actually > + * exists. Anything before this can't be successfully used anyway. > + */ > + earliest = GetOldestMultiXactOnDisk(); > + if (MultiXactIdPrecedes(oldestMXact, earliest)) > + oldestMXact = earliest; Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning the disk it'll always get one at a segment boundary, right? I'm not sure that's actually ok; because the value at the beginning of the segment can very well end up being a 0, as MaybeExtendOffsetSlru() will have filled the page with zeros. I think that should be harmless, the worst that can happen is that oldestOffset errorneously is 0, which should be correct, even though possibly overly conservative, in these cases. Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Tue, Jun 2, 2015 at 4:19 PM, Andres Freund <andres@anarazel.de> wrote: > I'm not really convinced tying things closer to having done trimming is > easier to understand than tying things to recovery having finished. > > E.g. > if (did_trim) > oldestOffset = GetOldestReferencedOffset(oldest_datminmxid); > imo is harder to understand than if (!InRecovery). > > Maybe we could just name it finishedStartup and rename the functions accordingly? Basing that particular call site on InRecovery doesn't work, because InRecovery isn't set early enough. But I'm fine to rename it to whatever. > Maybe it's worthwhile to add a 'NB: At this stage the data directory is > not yet necessarily consistent' StartupMultiXact's comments, to avoid > reintroducing problems like this? Sure. >> /* >> + * We can read this without a lock, because it only changes when nothing >> + * else is running. >> + */ >> + did_trim = MultiXactState->didTrimMultiXact; > > Err, Hot Standby? It might be ok to not lock, but the comment is > definitely wrong. I'm inclined to simply use locking, this doesn't look > sufficiently critical performancewise. /me nods. Good point. > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning > the disk it'll always get one at a segment boundary, right? I'm not sure > that's actually ok; because the value at the beginning of the segment > can very well end up being a 0, as MaybeExtendOffsetSlru() will have > filled the page with zeros. > > I think that should be harmless, the worst that can happen is that > oldestOffset errorneously is 0, which should be correct, even though > possibly overly conservative, in these cases. Uh oh. That seems like a real bad problem for this approach. What keeps that from being the opposite of too conservative? There's no "safe" value in a circular numbering space. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning > > the disk it'll always get one at a segment boundary, right? I'm not sure > > that's actually ok; because the value at the beginning of the segment > > can very well end up being a 0, as MaybeExtendOffsetSlru() will have > > filled the page with zeros. > > > > I think that should be harmless, the worst that can happen is that > > oldestOffset errorneously is 0, which should be correct, even though > > possibly overly conservative, in these cases. > > Uh oh. That seems like a real bad problem for this approach. What > keeps that from being the opposite of too conservative? There's no > "safe" value in a circular numbering space. I think it *might* (I'm really jetlagged) be fine because that'll only happen after a upgrade from < 9.3. And in that case we initialize nextOffset to 0. That ought to safe us? Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > My guess is that the file existed, and perhaps had one or more pages, > but the wanted page doesn't exist, so we tried to read but got 0 bytes > back. read() returns 0 in this case but doesn't set errno. > > I didn't find a way to set things so that the file exists but is of > shorter contents than oldestMulti by the time the checkpoint record is > replayed. I'm just starting to learn about the recovery machinery, so forgive me if I'm missing something basic here, but I just don't get this. As I understand it, offsets/0046 should either have been copied with that page present in it if it existed before the backup started (apparently not in this case), or extended to contain it by WAL records that come after the backup label but before the checkpoint record that references it (also apparently not in this case). If neither of these things happened then that is completely different from the segment-does-not-exist case where we read zeroes if in recovery on the assumption that later WAL records must be about to delete the file. There is no way that future WAL records will make an existing segment file shorter! So at this point don't we know that there is something wrong with the backup itself? Put another way, if you bring this up under 9.4.1, won't it also be unable to access multixact 4624559 at this point? Of course it won't try to do so during recovery like 9.4.2 does, but I'm just trying to understand how this is supposed to work for 9.4.1 if it needs to access that multixact for other reasons once normal running is reached (say you recover up to that checkpoint, and then run pg_get_multixact_members, or a row has that xmax and its members to be looked up by a vacuum or any normal transaction). In other words, isn't this a base backup that is somehow broken, not at all like the pg_upgrade corruption case which involved the specific case of multixact 1 and an entirely missing segment file, and 9.4.2 just tells you about it sooner than 9.4.1? For what it's worth, I've also spent a lot of time trying to reproduce basebackup problems with multixact creation, vacuums and checkpoints injected at various points between copying backup label, pg_multixact, and pg_control. I have so far failed to produce anything more interesting than the 'reading zeroes' case (see attached "copy-after-trunction.sh") and a case where the control file points at a segment that doesn't exist, but it doesn't matter because the backup label points at a checkpoint from a time when it did and oldestMultiXactId is updated from there, and then procedes exactly as it should (see "copy-before-truncation.sh"). I updated my scripts to look a bit more like your nicely automated example (though mine use a different trick to create small quantities of multixacts so they run against unpatched master). I have also been considering a scenario where multixact ID wraparound occurs during basebackup with some ordering that causes trouble, but I don't yet see why it would break if you replay the WAL from the backup label checkpoint (and I think the repro would take days/weeks to run...) -- Thomas Munro http://www.enterprisedb.com
Вложения
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Thomas Munro wrote: > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > My guess is that the file existed, and perhaps had one or more pages, > > but the wanted page doesn't exist, so we tried to read but got 0 bytes > > back. read() returns 0 in this case but doesn't set errno. > > > > I didn't find a way to set things so that the file exists but is of > > shorter contents than oldestMulti by the time the checkpoint record is > > replayed. > > I'm just starting to learn about the recovery machinery, so forgive me > if I'm missing something basic here, but I just don't get this. As I > understand it, offsets/0046 should either have been copied with that > page present in it if it existed before the backup started (apparently > not in this case), or extended to contain it by WAL records that come > after the backup label but before the checkpoint record that > references it (also apparently not in this case). Exactly --- that's the spot at which I am, also. I have had this spinning in my head for three days now, and tried every single variation that I could think of, but like you I was unable to reproduce the issue. However, our customer took a second base backup and it failed in exactly the same way, module some changes to the counters (the file that didn't exist was 004B rather than 0046). I'm still at a loss at what the failure mode is. We must be missing some crucial detail ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thomas Munro wrote: >> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > My guess is that the file existed, and perhaps had one or more pages, >> > but the wanted page doesn't exist, so we tried to read but got 0 bytes >> > back. read() returns 0 in this case but doesn't set errno. >> > >> > I didn't find a way to set things so that the file exists but is of >> > shorter contents than oldestMulti by the time the checkpoint record is >> > replayed. >> >> I'm just starting to learn about the recovery machinery, so forgive me >> if I'm missing something basic here, but I just don't get this. As I >> understand it, offsets/0046 should either have been copied with that >> page present in it if it existed before the backup started (apparently >> not in this case), or extended to contain it by WAL records that come >> after the backup label but before the checkpoint record that >> references it (also apparently not in this case). > > Exactly --- that's the spot at which I am, also. I have had this > spinning in my head for three days now, and tried every single variation > that I could think of, but like you I was unable to reproduce the issue. > However, our customer took a second base backup and it failed in exactly > the same way, module some changes to the counters (the file that > didn't exist was 004B rather than 0046). I'm still at a loss at what > the failure mode is. We must be missing some crucial detail ... I have finally reproduced that error! See attached repro shell script. The conditions are: 1. next multixact == oldest multixact (no active multixacts, pointing past the end) 2. next multixact would be the first item on a new page (multixact % 2048 == 0) 3. the page must not be the first in a segment (or we'd get the read-zeroes case) That gives you odds of 1/2048 * 31/32 * (probability of a wraparound vacuum followed by no multixact creations right before your backup checkpoint). That seems like reasonably low odds... if it happened twice in a row, maybe I'm missing something here and there is some other way to get this... I realise now that this is actually a symptom of a problem spotted by Noah recently: http://www.postgresql.org/message-id/20150601045534.GB23587@tornado.leadboat.com He noticed the problem for segment boundaries, when not in recovery. In recovery, segment boundaries don't raise an error (the read-zeroes case applies), but page boundaries do. The fix is probably to do nothing if they are the same, as we do elsewhere, like in the attached patch. -- Thomas Munro http://www.enterprisedb.com
Вложения
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Wed, Jun 3, 2015 at 4:48 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Thomas Munro wrote: >>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> > My guess is that the file existed, and perhaps had one or more pages, >>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes >>> > back. read() returns 0 in this case but doesn't set errno. >>> > >>> > I didn't find a way to set things so that the file exists but is of >>> > shorter contents than oldestMulti by the time the checkpoint record is >>> > replayed. >>> >>> I'm just starting to learn about the recovery machinery, so forgive me >>> if I'm missing something basic here, but I just don't get this. As I >>> understand it, offsets/0046 should either have been copied with that >>> page present in it if it existed before the backup started (apparently >>> not in this case), or extended to contain it by WAL records that come >>> after the backup label but before the checkpoint record that >>> references it (also apparently not in this case). >> >> Exactly --- that's the spot at which I am, also. I have had this >> spinning in my head for three days now, and tried every single variation >> that I could think of, but like you I was unable to reproduce the issue. >> However, our customer took a second base backup and it failed in exactly >> the same way, module some changes to the counters (the file that >> didn't exist was 004B rather than 0046). I'm still at a loss at what >> the failure mode is. We must be missing some crucial detail ... > > I have finally reproduced that error! See attached repro shell script. > > The conditions are: > > 1. next multixact == oldest multixact (no active multixacts, pointing > past the end) > 2. next multixact would be the first item on a new page (multixact % 2048 == 0) > 3. the page must not be the first in a segment (or we'd get the > read-zeroes case) > > That gives you odds of 1/2048 * 31/32 * (probability of a wraparound > vacuum followed by no multixact creations right before your backup > checkpoint). That seems like reasonably low odds... if it happened > twice in a row, maybe I'm missing something here and there is some > other way to get this... > > I realise now that this is actually a symptom of a problem spotted by > Noah recently: > > http://www.postgresql.org/message-id/20150601045534.GB23587@tornado.leadboat.com > > He noticed the problem for segment boundaries, when not in recovery. > In recovery, segment boundaries don't raise an error (the read-zeroes > case applies), but page boundaries do. The fix is probably to do > nothing if they are the same, as we do elsewhere, like in the attached > patch. Actually, we still need to call DetermineSafeOldestOffset in that case. Otherwise, if someone goes from lots of multixacts to none, the stop point won't advance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning >> > the disk it'll always get one at a segment boundary, right? I'm not sure >> > that's actually ok; because the value at the beginning of the segment >> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have >> > filled the page with zeros. >> > >> > I think that should be harmless, the worst that can happen is that >> > oldestOffset errorneously is 0, which should be correct, even though >> > possibly overly conservative, in these cases. >> >> Uh oh. That seems like a real bad problem for this approach. What >> keeps that from being the opposite of too conservative? There's no >> "safe" value in a circular numbering space. > > I think it *might* (I'm really jetlagged) be fine because that'll only > happen after a upgrade from < 9.3. And in that case we initialize > nextOffset to 0. That ought to safe us? That's pretty much betting the farm on the bugs we know about today being the only ones there are. That seems imprudent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Thomas Munro wrote: > I have finally reproduced that error! See attached repro shell script. > > The conditions are: > > 1. next multixact == oldest multixact (no active multixacts, pointing > past the end) > 2. next multixact would be the first item on a new page (multixact % 2048 == 0) > 3. the page must not be the first in a segment (or we'd get the > read-zeroes case) > > That gives you odds of 1/2048 * 31/32 * (probability of a wraparound > vacuum followed by no multixact creations right before your backup > checkpoint). That seems like reasonably low odds... if it happened > twice in a row, maybe I'm missing something here and there is some > other way to get this... Thanks, this is pretty cool (as far as these things go), but it's not the scenario I see, in which the complained-about segment is very old by the time it's truncated away by a checkpoint after freeze. Segment requested by checkpoint.oldestMulti is 0046 (offset 140k something -- just to clarify it's not at segment boundary), and the base backup contains segments from 004B to 00D5. My problem scenario has oldestMulti close to 5 million and nextMulti close to 14 million. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: > Thomas Munro wrote: > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > My guess is that the file existed, and perhaps had one or more pages, > > > but the wanted page doesn't exist, so we tried to read but got 0 bytes > > > back. read() returns 0 in this case but doesn't set errno. > > > > > > I didn't find a way to set things so that the file exists but is of > > > shorter contents than oldestMulti by the time the checkpoint record is > > > replayed. > > > > I'm just starting to learn about the recovery machinery, so forgive me > > if I'm missing something basic here, but I just don't get this. As I > > understand it, offsets/0046 should either have been copied with that > > page present in it if it existed before the backup started (apparently > > not in this case), or extended to contain it by WAL records that come > > after the backup label but before the checkpoint record that > > references it (also apparently not in this case). That's not necessarily the case though, given how the code currently works. In a bunch of places the SLRUs are accessed *before* having been made consistent by WAL replay. Especially if several checkpoints/vacuums happened during the base backup the assumed state (i.e. the mxacts checkpoints refer to) of the data directory soon after the initial start, and the state of pg_multixact/ won't necessarily match at all. > Exactly --- that's the spot at which I am, also. I have had this > spinning in my head for three days now, and tried every single variation > that I could think of, but like you I was unable to reproduce the issue. > However, our customer took a second base backup and it failed in exactly > the same way, module some changes to the counters (the file that > didn't exist was 004B rather than 0046). I'm still at a loss at what > the failure mode is. We must be missing some crucial detail ... I might have missed it in this already long thread. Could you share a bunch of details about hte case? It'd be very interesting to see the contents of the backup label (to see where start/end are), the contents of the initial checkpoint (to see which mxacts we assume to exist at start) and what the initial contents of pg_multixact are (to match up). Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Andres Freund wrote: > On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: > > Thomas Munro wrote: > > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > My guess is that the file existed, and perhaps had one or more pages, > > > > but the wanted page doesn't exist, so we tried to read but got 0 bytes > > > > back. read() returns 0 in this case but doesn't set errno. > > > > > > > > I didn't find a way to set things so that the file exists but is of > > > > shorter contents than oldestMulti by the time the checkpoint record is > > > > replayed. > > > > > > I'm just starting to learn about the recovery machinery, so forgive me > > > if I'm missing something basic here, but I just don't get this. As I > > > understand it, offsets/0046 should either have been copied with that > > > page present in it if it existed before the backup started (apparently > > > not in this case), or extended to contain it by WAL records that come > > > after the backup label but before the checkpoint record that > > > references it (also apparently not in this case). > > That's not necessarily the case though, given how the code currently > works. In a bunch of places the SLRUs are accessed *before* having been > made consistent by WAL replay. Especially if several checkpoints/vacuums > happened during the base backup the assumed state (i.e. the mxacts > checkpoints refer to) of the data directory soon after the initial > start, and the state of pg_multixact/ won't necessarily match at all. Well, the log extract is quite simple; it says that as soon as the standby starts replay WAL, it fetches exactly one WAL segment, which contains exactly one online checkpoint record; this record contains oldestMulti=4624559, which belongs in offset file 0046. But the basebackup only contains files starting from 004B onwards. The base backup takes a long time, so my guess of what happened is that the backup label points to the start of the WAL stream (obviously), then the data files are copied to the standby; during this long process, a multixact truncation happens. So by the time the base backup reaches the pg_multixact directory, the 0046 file has been removed. 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset 147456:Success. 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure One idea I had was: what if the oldestMulti pointed to another multi earlier in the same 0046 file, so that it is read-as-zeroes (and the file is created), and then a subsequent multixact truncate tries to read a later page in the file. In SlruPhysicalReadPage() this would give a change for open() to not fail, and then read() can fail as above. However, we would have an earlier LOG message about "reading as zeroes". Really, the whole question of how this code goes past the open() failure in SlruPhysicalReadPage baffles me. I don't see any possible way for the file to be created ... > > Exactly --- that's the spot at which I am, also. I have had this > > spinning in my head for three days now, and tried every single variation > > that I could think of, but like you I was unable to reproduce the issue. > > However, our customer took a second base backup and it failed in exactly > > the same way, module some changes to the counters (the file that > > didn't exist was 004B rather than 0046). I'm still at a loss at what > > the failure mode is. We must be missing some crucial detail ... > > I might have missed it in this already long thread. Could you share a > bunch of details about hte case? It'd be very interesting to see the > contents of the backup label (to see where start/end are), the contents > of the initial checkpoint (to see which mxacts we assume to exist at > start) and what the initial contents of pg_multixact are (to match up). pg_xlogdump says about the checkpoint record: rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 73DD/AD834470, prev 73DD/AD8343B8, bkp: 0000, desc:checkpoint: redo 73DD/4003F648; tli 1; prev tli 1; fpw true; xid 1/3530575042; oid 81303440; multi 14003656; offset66311341; oldest xid 2530472730 in DB 17081243; oldest multi 4624559 in DB 17081243; oldest running xid 3530551992;online I don't have the backup label. But a listing of the files in pg_multixact/offsets in the master has this: ./offsets: total 35816 drwx------ 2 postgres postgres 4096 May 27 16:25 . drwx------ 4 postgres postgres 4096 Jan 3 03:32 .. -rw------- 1 postgres postgres 262144 Apr 20 13:07 004C -rw------- 1 postgres postgres 262144 Apr 20 14:36 004D -rw------- 1 postgres postgres 262144 Apr 20 14:39 004E -rw------- 1 postgres postgres 262144 Apr 20 14:43 004F -rw------- 1 postgres postgres 262144 Apr 20 15:06 0050 -rw------- 1 postgres postgres 262144 Apr 20 15:08 0051 -rw------- 1 postgres postgres 262144 Apr 20 15:11 0052 -rw------- 1 postgres postgres 262144 Apr 20 15:15 0053 -rw------- 1 postgres postgres 262144 Apr 20 15:20 0054 -rw------- 1 postgres postgres 262144 Apr 20 15:23 0055 -rw------- 1 postgres postgres 262144 Apr 20 15:25 0056 -rw------- 1 postgres postgres 262144 Apr 20 15:28 0057 -rw------- 1 postgres postgres 262144 Apr 20 15:31 0058 -rw------- 1 postgres postgres 262144 Apr 20 15:33 0059 -rw------- 1 postgres postgres 262144 Apr 20 15:38 005A -rw------- 1 postgres postgres 262144 Apr 20 15:39 005B -rw------- 1 postgres postgres 262144 Apr 20 15:41 005C -rw------- 1 postgres postgres 262144 Apr 20 15:42 005D -rw------- 1 postgres postgres 262144 Apr 20 15:47 005E -rw------- 1 postgres postgres 262144 Apr 20 16:12 005F -rw------- 1 postgres postgres 262144 Apr 20 16:41 0060 -rw------- 1 postgres postgres 262144 Apr 20 16:50 0061 -rw------- 1 postgres postgres 262144 Apr 20 17:51 0062 -rw------- 1 postgres postgres 262144 Apr 20 19:23 0063 -rw------- 1 postgres postgres 262144 Apr 20 22:08 0064 -rw------- 1 postgres postgres 262144 Apr 21 07:29 0065 -rw------- 1 postgres postgres 262144 Apr 21 11:43 0066 -rw------- 1 postgres postgres 262144 Apr 21 14:35 0067 -rw------- 1 postgres postgres 262144 Apr 21 16:02 0068 -rw------- 1 postgres postgres 262144 Apr 21 21:51 0069 -rw------- 1 postgres postgres 262144 Apr 22 08:40 006A -rw------- 1 postgres postgres 262144 Apr 22 12:18 006B -rw------- 1 postgres postgres 262144 Apr 22 16:22 006C -rw------- 1 postgres postgres 262144 Apr 23 00:01 006D -rw------- 1 postgres postgres 262144 Apr 23 09:08 006E -rw------- 1 postgres postgres 262144 Apr 23 13:28 006F -rw------- 1 postgres postgres 262144 Apr 23 17:35 0070 -rw------- 1 postgres postgres 262144 Apr 24 06:37 0071 -rw------- 1 postgres postgres 262144 Apr 24 12:27 0072 -rw------- 1 postgres postgres 262144 Apr 24 16:27 0073 -rw------- 1 postgres postgres 262144 Apr 25 06:37 0074 -rw------- 1 postgres postgres 262144 Apr 25 17:18 0075 -rw------- 1 postgres postgres 262144 Apr 26 15:05 0076 -rw------- 1 postgres postgres 262144 Apr 27 07:48 0077 -rw------- 1 postgres postgres 262144 Apr 27 13:11 0078 -rw------- 1 postgres postgres 262144 Apr 27 18:13 0079 -rw------- 1 postgres postgres 262144 Apr 28 08:41 007A -rw------- 1 postgres postgres 262144 Apr 28 14:36 007B -rw------- 1 postgres postgres 262144 Apr 29 00:34 007C -rw------- 1 postgres postgres 262144 Apr 29 09:44 007D -rw------- 1 postgres postgres 262144 Apr 29 14:49 007E -rw------- 1 postgres postgres 262144 Apr 29 19:53 007F -rw------- 1 postgres postgres 262144 Apr 30 08:43 0080 -rw------- 1 postgres postgres 262144 Apr 30 12:41 0081 -rw------- 1 postgres postgres 262144 Apr 30 15:16 0082 -rw------- 1 postgres postgres 262144 Apr 30 15:21 0083 -rw------- 1 postgres postgres 262144 Apr 30 17:15 0084 -rw------- 1 postgres postgres 262144 May 1 07:45 0085 -rw------- 1 postgres postgres 262144 May 1 17:06 0086 -rw------- 1 postgres postgres 262144 May 2 08:37 0087 -rw------- 1 postgres postgres 262144 May 3 00:14 0088 -rw------- 1 postgres postgres 262144 May 3 18:46 0089 -rw------- 1 postgres postgres 262144 May 4 06:33 008A -rw------- 1 postgres postgres 262144 May 4 09:29 008B -rw------- 1 postgres postgres 262144 May 4 13:42 008C -rw------- 1 postgres postgres 262144 May 4 18:08 008D -rw------- 1 postgres postgres 262144 May 5 06:56 008E -rw------- 1 postgres postgres 262144 May 5 11:06 008F -rw------- 1 postgres postgres 262144 May 5 14:47 0090 -rw------- 1 postgres postgres 262144 May 5 19:39 0091 -rw------- 1 postgres postgres 262144 May 6 07:36 0092 -rw------- 1 postgres postgres 262144 May 6 10:30 0093 -rw------- 1 postgres postgres 262144 May 6 15:23 0094 -rw------- 1 postgres postgres 262144 May 6 22:18 0095 -rw------- 1 postgres postgres 262144 May 7 08:22 0096 -rw------- 1 postgres postgres 262144 May 7 11:31 0097 -rw------- 1 postgres postgres 262144 May 7 14:18 0098 -rw------- 1 postgres postgres 262144 May 7 16:34 0099 -rw------- 1 postgres postgres 262144 May 7 18:39 009A -rw------- 1 postgres postgres 262144 May 7 21:42 009B -rw------- 1 postgres postgres 262144 May 8 07:57 009C -rw------- 1 postgres postgres 262144 May 8 12:34 009D -rw------- 1 postgres postgres 262144 May 8 17:53 009E -rw------- 1 postgres postgres 262144 May 9 07:22 009F -rw------- 1 postgres postgres 262144 May 9 16:42 00A0 -rw------- 1 postgres postgres 262144 May 10 08:43 00A1 -rw------- 1 postgres postgres 262144 May 10 19:23 00A2 -rw------- 1 postgres postgres 262144 May 11 08:20 00A3 -rw------- 1 postgres postgres 262144 May 11 08:48 00A4 -rw------- 1 postgres postgres 262144 May 11 12:23 00A5 -rw------- 1 postgres postgres 262144 May 11 15:36 00A6 -rw------- 1 postgres postgres 262144 May 11 21:24 00A7 -rw------- 1 postgres postgres 262144 May 12 08:15 00A8 -rw------- 1 postgres postgres 262144 May 12 09:20 00A9 -rw------- 1 postgres postgres 262144 May 12 09:42 00AA -rw------- 1 postgres postgres 262144 May 12 11:04 00AB -rw------- 1 postgres postgres 262144 May 12 11:06 00AC -rw------- 1 postgres postgres 262144 May 12 11:09 00AD -rw------- 1 postgres postgres 262144 May 12 11:26 00AE -rw------- 1 postgres postgres 262144 May 12 11:26 00AF -rw------- 1 postgres postgres 262144 May 12 11:31 00B0 -rw------- 1 postgres postgres 262144 May 12 12:06 00B1 -rw------- 1 postgres postgres 262144 May 12 12:37 00B2 -rw------- 1 postgres postgres 262144 May 12 13:18 00B3 -rw------- 1 postgres postgres 262144 May 12 14:34 00B4 -rw------- 1 postgres postgres 262144 May 12 14:29 00B5 -rw------- 1 postgres postgres 262144 May 12 14:44 00B6 -rw------- 1 postgres postgres 262144 May 12 15:16 00B7 -rw------- 1 postgres postgres 262144 May 12 15:26 00B8 -rw------- 1 postgres postgres 262144 May 12 15:35 00B9 -rw------- 1 postgres postgres 262144 May 12 15:47 00BA -rw------- 1 postgres postgres 262144 May 12 16:02 00BB -rw------- 1 postgres postgres 262144 May 12 16:18 00BC -rw------- 1 postgres postgres 262144 May 12 17:00 00BD -rw------- 1 postgres postgres 262144 May 12 18:48 00BE -rw------- 1 postgres postgres 262144 May 12 19:59 00BF -rw------- 1 postgres postgres 262144 May 12 21:39 00C0 -rw------- 1 postgres postgres 262144 May 13 05:31 00C1 -rw------- 1 postgres postgres 262144 May 13 07:35 00C2 -rw------- 1 postgres postgres 262144 May 13 08:16 00C3 -rw------- 1 postgres postgres 262144 May 13 08:30 00C4 -rw------- 1 postgres postgres 262144 May 13 09:00 00C5 -rw------- 1 postgres postgres 262144 May 13 09:19 00C6 -rw------- 1 postgres postgres 262144 May 13 09:41 00C7 -rw------- 1 postgres postgres 262144 May 13 10:03 00C8 -rw------- 1 postgres postgres 262144 May 13 10:17 00C9 -rw------- 1 postgres postgres 262144 May 13 10:40 00CA -rw------- 1 postgres postgres 262144 May 13 11:05 00CB -rw------- 1 postgres postgres 262144 May 13 11:28 00CC -rw------- 1 postgres postgres 262144 May 13 12:02 00CD -rw------- 1 postgres postgres 262144 May 13 12:58 00CE -rw------- 1 postgres postgres 262144 May 13 13:06 00CF -rw------- 1 postgres postgres 262144 May 13 13:12 00D0 -rw------- 1 postgres postgres 262144 May 13 13:21 00D1 -rw------- 1 postgres postgres 262144 May 13 13:31 00D2 -rw------- 1 postgres postgres 262144 May 13 13:40 00D3 -rw------- 1 postgres postgres 262144 May 13 13:53 00D4 -rw------- 1 postgres postgres 188416 May 27 16:40 00D5 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Alvaro Herrera wrote: > Really, the whole question of how this code goes past the open() failure > in SlruPhysicalReadPage baffles me. I don't see any possible way for > the file to be created ... Hmm, the checkpointer can call TruncateMultiXact when in recovery, on restartpoints. I wonder if in recovery this could trigger file creation. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > That's not necessarily the case though, given how the code currently > > works. In a bunch of places the SLRUs are accessed *before* having been > > made consistent by WAL replay. Especially if several checkpoints/vacuums > > happened during the base backup the assumed state (i.e. the mxacts > > checkpoints refer to) of the data directory soon after the initial > > start, and the state of pg_multixact/ won't necessarily match at all. > > Well, the log extract is quite simple; it says that as soon as the > standby starts replay WAL, it fetches exactly one WAL segment, which > contains exactly one online checkpoint record; this record contains > oldestMulti=4624559, which belongs in offset file 0046. But the > basebackup only contains files starting from 004B onwards. The base > backup takes a long time, so my guess of what happened is that the > backup label points to the start of the WAL stream (obviously), then the > data files are copied to the standby; during this long process, a > multixact truncation happens. So by the time the base backup reaches > the pg_multixact directory, the 0046 file has been removed. Yes. That's precisely what I was concerned about above and elsewhere in the thread. It's simply not ok to access a SLRU while not yet consistent... > One idea I had was: what if the oldestMulti pointed to another multi > earlier in the same 0046 file, so that it is read-as-zeroes (and the > file is created), and then a subsequent multixact truncate tries to read > a later page in the file. In SlruPhysicalReadPage() this would give a > change for open() to not fail, and then read() can fail as above. > However, we would have an earlier LOG message about "reading as zeroes". > > Really, the whole question of how this code goes past the open() failure > in SlruPhysicalReadPage baffles me. I don't see any possible way for > the file to be created ... Wouldn't a previous WAL record zeroing another part of that segment explain this? A zero sized segment pretty much would lead to this error, right? Or were you able to check how things look after the failure? > 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode > 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive > 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559 > 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset147456: Success. > 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1 > 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure From this isn't not entirely clear where this error was triggered from. Greetings, Andres Freund
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Andres Freund wrote: > On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: > > One idea I had was: what if the oldestMulti pointed to another multi > > earlier in the same 0046 file, so that it is read-as-zeroes (and the > > file is created), and then a subsequent multixact truncate tries to read > > a later page in the file. In SlruPhysicalReadPage() this would give a > > change for open() to not fail, and then read() can fail as above. > > However, we would have an earlier LOG message about "reading as zeroes". > > > > Really, the whole question of how this code goes past the open() failure > > in SlruPhysicalReadPage baffles me. I don't see any possible way for > > the file to be created ... > > Wouldn't a previous WAL record zeroing another part of that segment > explain this? A zero sized segment pretty much would lead to this error, > right? Or were you able to check how things look after the failure? But why would there be a previous WAL record zeroing another part of that segment? Note that this segment is very old -- hasn't been written in quite a while, it's certainly not in slru buffers anymore. > > 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode > > 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive > > 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559 > > 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset147456: Success. > > 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1 > > 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure > > From this isn't not entirely clear where this error was triggered from. Well, reading code, it seems reasonable that to assume that replay of the checkpoint record I mentioned leads to that error message when the file exists but is not long enough to contain the given offset. There are not MultiXact wal records in the segment. Also note that there's no other "restored log file" message after the "entering standby mode" message. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund <andres@anarazel.de> wrote: >>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning >>> > the disk it'll always get one at a segment boundary, right? I'm not sure >>> > that's actually ok; because the value at the beginning of the segment >>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have >>> > filled the page with zeros. >>> > >>> > I think that should be harmless, the worst that can happen is that >>> > oldestOffset errorneously is 0, which should be correct, even though >>> > possibly overly conservative, in these cases. >>> >>> Uh oh. That seems like a real bad problem for this approach. What >>> keeps that from being the opposite of too conservative? There's no >>> "safe" value in a circular numbering space. >> >> I think it *might* (I'm really jetlagged) be fine because that'll only >> happen after a upgrade from < 9.3. And in that case we initialize >> nextOffset to 0. That ought to safe us? > > That's pretty much betting the farm on the bugs we know about today > being the only ones there are. That seems imprudent. So here's a patch taking a different approach. In this approach, if the multixact whose members we want to look up doesn't exist, we don't use a later one (that might or might not be valid). Instead, we attempt to cope with the unknown. That means: 1. In TruncateMultiXact(), we don't truncate. 2. If setting the offset stop limit (the point where we refuse to create new multixact space), we don't arm the stop point. This means that if you're in this situation, you run without member wraparound protection until it's corrected. A message gets logged once per checkpoint telling you that you have this problem, and another message gets logged when things get straightened out and the guards are enabled. 3. If setting the vacuum force point, we assume that it's appropriate to immediately force vacuum. I've only tested this very lightly - this is just to see what you and Noah and others think of the approach. As compared with the previous approach, it has the advantage of making minimal assumptions about the sanity of what's on disk. It has the disadvantage that, for some people, the member-wraparound guard won't be enabled at startup -- but note that those people can't start 9.3.7/9.4.2 *at all*, so currently they are either running without member wraparound protection anyway (if they haven't upgraded to those releases) or they're down entirely. Another disadvantage is that we'll be triggering what may be quite a bit of autovacuum activity for some people, which could be painful. On the plus side, they'll hopefully end up with sane relminmxid and datminmxid guards afterwards. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Robert Haas wrote: > So here's a patch taking a different approach. I tried to apply this to 9.3 but it's messy because of pgindent. Anyone would have a problem with me backpatching a pgindent run of multixact.c? Also, you have a new function SlruPageExists, but we already have SimpleLruDoesPhysicalPageExist. No need for two copies of pretty much the same code, I think. Your code uses fstat() instead of lseek(.., SEEK_END) but the exact technique used is probably not relevant. I think I like this approach better than your other patch, FWIW -- mainly because it seems simpler. Will review. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch <noah@leadboat.com> wrote: > While testing this (with inconsistent-multixact-fix-master.patch applied, > FWIW), I noticed a nearby bug with a similar symptom. TruncateMultiXact() > omits the nextMXact==oldestMXact special case found in each other > find_multixact_start() caller, so it reads the offset of a not-yet-created > MultiXactId. The usual outcome is to get rangeStart==0, so we truncate less > than we could. This can't make us truncate excessively, because > nextMXact==oldestMXact implies no table contains any mxid. If nextMXact > happens to be the first of a segment, an error is possible. Procedure: > > 1. Make a fresh cluster. > 2. UPDATE pg_database SET datallowconn = true > 3. Consume precisely 131071 mxids. Number of offsets per mxid is unimportant. > 4. vacuumdb --freeze --all > > Expected state after those steps: > $ pg_controldata | grep NextMultiXactId > Latest checkpoint's NextMultiXactId: 131072 > > Checkpoint will fail like this: > 26699 2015-05-31 17:22:33.134 GMT LOG: statement: checkpoint > 26661 2015-05-31 17:22:33.134 GMT DEBUG: performing replication slot checkpoint > 26661 2015-05-31 17:22:33.136 GMT ERROR: could not access status of transaction 131072 > 26661 2015-05-31 17:22:33.136 GMT DETAIL: Could not open file "pg_multixact/offsets/0002": No such file or directory. > 26699 2015-05-31 17:22:33.234 GMT ERROR: checkpoint request failed > 26699 2015-05-31 17:22:33.234 GMT HINT: Consult recent messages in the server log for details. > 26699 2015-05-31 17:22:33.234 GMT STATEMENT: checkpoint > > This does not block startup, and creating one mxid hides the problem again. > Thus, it is not a top-priority bug like some other parts of this thread. I > mention it today mostly so it doesn't surprise hackers testing other fixes. Thanks. As mentioned elsewhere in the thread, I discovered that the same problem exists for page boundaries, with a different error message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 and master with the same results: FATAL: could not access status of transaction 2048 DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset 8192: Undefined error: 0. FATAL: could not access status of transaction 131072 DETAIL: Could not open file "pg_multixact/offsets/0002": No such file or directory. But, yeah, this isn't the bug we're looking for. -- Thomas Munro http://www.enterprisedb.com
Вложения
On Wed, Jun 03, 2015 at 04:53:46PM -0400, Robert Haas wrote: > So here's a patch taking a different approach. In this approach, if > the multixact whose members we want to look up doesn't exist, we don't > use a later one (that might or might not be valid). Instead, we > attempt to cope with the unknown. That means: > > 1. In TruncateMultiXact(), we don't truncate. I like that change a lot. It's much easier to seek forgiveness for wasting <= 28 GiB of disk than for deleting visibility information wrongly. > 2. If setting the offset stop limit (the point where we refuse to > create new multixact space), we don't arm the stop point. This means > that if you're in this situation, you run without member wraparound > protection until it's corrected. A message gets logged once per > checkpoint telling you that you have this problem, and another message > gets logged when things get straightened out and the guards are > enabled. > > 3. If setting the vacuum force point, we assume that it's appropriate > to immediately force vacuum. Those seem reasonable, too. > I've only tested this very lightly - this is just to see what you and > Noah and others think of the approach. As compared with the previous > approach, it has the advantage of making minimal assumptions about the > sanity of what's on disk. It has the disadvantage that, for some > people, the member-wraparound guard won't be enabled at startup -- but > note that those people can't start 9.3.7/9.4.2 *at all*, so currently > they are either running without member wraparound protection anyway > (if they haven't upgraded to those releases) or they're down entirely. That disadvantage is negligible, considering. > Another disadvantage is that we'll be triggering what may be quite a > bit of autovacuum activity for some people, which could be painful. > On the plus side, they'll hopefully end up with sane relminmxid and > datminmxid guards afterwards. That sounds good so long as each table requires just one successful emergency autovacuum. I'm not seeing code to ensure that the launched autovacuum will indeed perform a full-table scan and update relminmxid; is it there? For sites that can't tolerate an autovacuum storm, what alternative can we provide? Is "SET vacuum_multixact_freeze_table_age = 0; VACUUM <table>" of every table, done before applying the minor update, sufficient? > static void > -DetermineSafeOldestOffset(MultiXactId oldestMXact) > +DetermineSafeOldestOffset(MultiXactOffset oldestMXact) Leftover change from an earlier iteration? The values passed continue to be MultiXactId values. > /* move back to start of the corresponding segment */ > - oldestOffset -= oldestOffset % > - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT); > + offsetStopLimit = oldestOffset - (oldestOffset % > + (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT)); > + /* always leave one segment before the wraparound point */ > + offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT); > + > + /* if nothing has changed, we're done */ > + if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit) > + return; > > LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); > - /* always leave one segment before the wraparound point */ > - MultiXactState->offsetStopLimit = oldestOffset - > - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT); > + MultiXactState->offsetStopLimit = oldestOffset; That last line needs s/oldestOffset/offsetStopLimit/, I presume. Thanks, nm
On Thu, Jun 4, 2015 at 2:42 AM, Noah Misch <noah@leadboat.com> wrote: > I like that change a lot. It's much easier to seek forgiveness for wasting <= > 28 GiB of disk than for deleting visibility information wrongly. I'm glad you like it. I concur. >> 2. If setting the offset stop limit (the point where we refuse to >> create new multixact space), we don't arm the stop point. This means >> that if you're in this situation, you run without member wraparound >> protection until it's corrected. A message gets logged once per >> checkpoint telling you that you have this problem, and another message >> gets logged when things get straightened out and the guards are >> enabled. >> >> 3. If setting the vacuum force point, we assume that it's appropriate >> to immediately force vacuum. > > Those seem reasonable, too. Cool. >> I've only tested this very lightly - this is just to see what you and >> Noah and others think of the approach. As compared with the previous >> approach, it has the advantage of making minimal assumptions about the >> sanity of what's on disk. It has the disadvantage that, for some >> people, the member-wraparound guard won't be enabled at startup -- but >> note that those people can't start 9.3.7/9.4.2 *at all*, so currently >> they are either running without member wraparound protection anyway >> (if they haven't upgraded to those releases) or they're down entirely. > > That disadvantage is negligible, considering. All right. >> Another disadvantage is that we'll be triggering what may be quite a >> bit of autovacuum activity for some people, which could be painful. >> On the plus side, they'll hopefully end up with sane relminmxid and >> datminmxid guards afterwards. > > That sounds good so long as each table requires just one successful emergency > autovacuum. I'm not seeing code to ensure that the launched autovacuum will > indeed perform a full-table scan and update relminmxid; is it there? No. Oops. > For sites that can't tolerate an autovacuum storm, what alternative can we > provide? Is "SET vacuum_multixact_freeze_table_age = 0; VACUUM <table>" of > every table, done before applying the minor update, sufficient? I don't know. In practical terms, they probably need to ensure that if pg_multixact/offsets/0000 does not exist, no relations have relminmxid = 1 and no remaining databases have datminmxid = 1. Exactly what it will take to get there is possibly dependent on which minor release you are running; on current minor releases, I am hopeful that what you propose is sufficient. >> static void >> -DetermineSafeOldestOffset(MultiXactId oldestMXact) >> +DetermineSafeOldestOffset(MultiXactOffset oldestMXact) > > Leftover change from an earlier iteration? The values passed continue to be > MultiXactId values. Oopsie. >> /* move back to start of the corresponding segment */ >> - oldestOffset -= oldestOffset % >> - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT); >> + offsetStopLimit = oldestOffset - (oldestOffset % >> + (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT)); >> + /* always leave one segment before the wraparound point */ >> + offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT); >> + >> + /* if nothing has changed, we're done */ >> + if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit) >> + return; >> >> LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); >> - /* always leave one segment before the wraparound point */ >> - MultiXactState->offsetStopLimit = oldestOffset - >> - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT); >> + MultiXactState->offsetStopLimit = oldestOffset; > > That last line needs s/oldestOffset/offsetStopLimit/, I presume. Another oops. Thanks for the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Thanks for the review. Here's a new version. I've fixed the things Alvaro and Noah noted, and some compiler warnings about set but unused variables. I also tested it, and it doesn't quite work as hoped. If started on a cluster where oldestMultiXid is incorrectly set to 1, it starts up and indicates that the member wraparound guards are disabled. But even after everything is fixed, they don't get enabled until after the next full restart. I think that's because TruncateMultiXact() bails out too early, without calling DetermineSafeOldestOffset. My attempt at a quick fix for that problem didn't work out, so I'm posting this version for now to facilitate further review and testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
Hi, On 2015-06-04 12:57:42 -0400, Robert Haas wrote: > + /* > + * Do we need an emergency autovacuum? If we're not sure, assume yes. > + */ > + return !oldestOffsetKnown || > + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD); I think without teaching autovac about those rules, this might just lead to lots of autovac processes starting without knowing they should do something? They know about autovacuum_multixact_freeze_age, but they know neither about !oldestOffsetKnown nor, afaics, about pending offset wraparounds? > -static MultiXactOffset > -find_multixact_start(MultiXactId multi) > +static bool > +find_multixact_start(MultiXactId multi, MultiXactOffset *result) > { > MultiXactOffset offset; > int pageno; > @@ -2630,6 +2741,9 @@ find_multixact_start(MultiXactId multi) > pageno = MultiXactIdToOffsetPage(multi); > entryno = MultiXactIdToOffsetEntry(multi); > > + if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno)) > + return false; > + > /* lock is acquired by SimpleLruReadPage_ReadOnly */ > slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi); > offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno]; > @@ -2642,25 +2756,31 @@ find_multixact_start(MultiXactId multi) > I think it'd be a good idea to also return false in case of a InvalidMultiXactId - that'll be returned if the page has been zeroed. Andres
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Robert Haas
Дата:
On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-04 12:57:42 -0400, Robert Haas wrote: >> + /* >> + * Do we need an emergency autovacuum? If we're not sure, assume yes. >> + */ >> + return !oldestOffsetKnown || >> + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD); > > I think without teaching autovac about those rules, this might just lead > to lots of autovac processes starting without knowing they should do > something? They know about autovacuum_multixact_freeze_age, but they > know neither about !oldestOffsetKnown nor, afaics, about pending offset > wraparounds? You're right, but that's why the latest patch has changes in MultiXactMemberFreezeThreshold. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Alvaro Herrera wrote: > Robert Haas wrote: > > > So here's a patch taking a different approach. > > I tried to apply this to 9.3 but it's messy because of pgindent. Anyone > would have a problem with me backpatching a pgindent run of multixact.c? Done. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Thanks for the review. > > Here's a new version. I've fixed the things Alvaro and Noah noted, > and some compiler warnings about set but unused variables. > > I also tested it, and it doesn't quite work as hoped. If started on a > cluster where oldestMultiXid is incorrectly set to 1, it starts up and > indicates that the member wraparound guards are disabled. But even > after everything is fixed, they don't get enabled until after the next > full restart. I think that's because TruncateMultiXact() bails out > too early, without calling DetermineSafeOldestOffset. > > My attempt at a quick fix for that problem didn't work out, so I'm > posting this version for now to facilitate further review and testing. Here's a new version with some more fixes and improvements: - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset when the oldest offset became known if the now-known value happened to be zero. Fixed. - SetOffsetVacuumLimit now logs useful information at the DEBUG1 level, so that you can see that it's doing what it's supposed to. - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the offsetStopLimit even if it can't truncate anything. This seems useless, but it's not, because it may be that the last checkpoint advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real value, and now we can actually set offsetStopLimit properly. - TruncateMultiXact no longer calls find_multixact_start when there are no remaining multixacts. This is actually a completely separate bug that goes all the way back to 9.3.0 and can potentially cause TruncateMultiXact to remove every file in pg_multixact/offsets. Restarting the cluster becomes impossible because TrimMultiXact barfs. - TruncateMultiXact now logs a message if the oldest multixact does not precede the earliest one on disk and is not equal to the next multixact and yet does not exist. The value of the log message is that it discovered the bug mentioned in the previous line, so I think it's earning its keep. With this version, I'm able to see that when you start up a 9.3.latest+this patch with a cluster that has a bogus value of 1 in relminmxid, datminmxid, and the control file, autovacuum vacuums everything in sight, all the values get set back to the right thing, and the next checkpoint enables the member-wraparound guards. This works with both autovacuum=on and autovacuum=off; the emergency mechanism kicks in as intended. We'll want to warn people with big databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may want to pre-vacuum those tables before upgrading to avoid a vacuum storm. But generally I'm pretty happy with this: forcing those values to get fixed so that we can guard against member-space wraparound seems like the right thing to do. So, to summarize, this patch does the following: - Fixes the failure-to-start problems introduced in 9.4.2 in complicated pg_upgrade scenarios. - Prevents the new calls to find_multixact_start we added in 9.4.2 from happening during recovery, where they can only create failure scenarios. The call in TruncateMultiXact that has been there all along is not eliminated, but now handles failure more gracefully. - Fixes possible incorrect removal of every single pg_multixact/offsets file when no multixacts exist; one file should be kept. - Forces aggressive autovacuuming when the control file's oldestMultiXid doesn't point to a valid MultiXact and enables member wraparound at the next checkpoint following the correction of that problem. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, Jun 4, 2015 at 5:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > - Forces aggressive autovacuuming when the control file's > oldestMultiXid doesn't point to a valid MultiXact and enables member > wraparound at the next checkpoint following the correction of that > problem. Err, enables member wraparound *protection* at the next checkpoint, not the wraparound itself. It's worth noting that every startup will now include one of the following two messages: LOG: MultiXact member wraparound protections are now enabled Or: LOG: MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk ...where %u is probably 1 If you get the second one, you'll get the first one later after vacuum has done its thing and a checkpoint has happened. This is, obviously, some log chatter for people who don't have a problem and never have, but I think it's worth emitting the first message at startup even when there's no problem, so that people don't have to make inferences from the absence of a message. We can tell people very simply that (1) if they see the first message, everything is fine; (2) if they see the second message, autovacuum is going to clean things up and they will eventually see the first message; and (3) if they see neither message, they haven't upgraded to a fixed version yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Here's a new version with some more fixes and improvements: > > - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset > when the oldest offset became known if the now-known value happened to > be zero. Fixed. > > - SetOffsetVacuumLimit now logs useful information at the DEBUG1 > level, so that you can see that it's doing what it's supposed to. > > - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the > offsetStopLimit even if it can't truncate anything. This seems > useless, but it's not, because it may be that the last checkpoint > advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real > value, and now we can actually set offsetStopLimit properly. > > - TruncateMultiXact no longer calls find_multixact_start when there > are no remaining multixacts. This is actually a completely separate > bug that goes all the way back to 9.3.0 and can potentially cause > TruncateMultiXact to remove every file in pg_multixact/offsets. > Restarting the cluster becomes impossible because TrimMultiXact barfs. > > - TruncateMultiXact now logs a message if the oldest multixact does > not precede the earliest one on disk and is not equal to the next > multixact and yet does not exist. The value of the log message is > that it discovered the bug mentioned in the previous line, so I think > it's earning its keep. > > With this version, I'm able to see that when you start up a > 9.3.latest+this patch with a cluster that has a bogus value of 1 in > relminmxid, datminmxid, and the control file, autovacuum vacuums > everything in sight, all the values get set back to the right thing, > and the next checkpoint enables the member-wraparound guards. This > works with both autovacuum=on and autovacuum=off; the emergency > mechanism kicks in as intended. We'll want to warn people with big > databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may > want to pre-vacuum those tables before upgrading to avoid a vacuum > storm. But generally I'm pretty happy with this: forcing those values > to get fixed so that we can guard against member-space wraparound > seems like the right thing to do. > > So, to summarize, this patch does the following: > > - Fixes the failure-to-start problems introduced in 9.4.2 in > complicated pg_upgrade scenarios. > - Prevents the new calls to find_multixact_start we added in 9.4.2 > from happening during recovery, where they can only create failure > scenarios. The call in TruncateMultiXact that has been there all > along is not eliminated, but now handles failure more gracefully. > - Fixes possible incorrect removal of every single > pg_multixact/offsets file when no multixacts exist; one file should be > kept. > - Forces aggressive autovacuuming when the control file's > oldestMultiXid doesn't point to a valid MultiXact and enables member > wraparound at the next checkpoint following the correction of that > problem. With this patch, when I run the script "checkpoint-segment-boundary.sh" from http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=UDg@mail.gmail.com I see the following during shutdown checkpoint: LOG: could not truncate directory "pg_multixact/offsets": apparent wraparound That message comes from SimpleLruTruncate. -- Thomas Munro http://www.enterprisedb.com
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Here's a new version with some more fixes and improvements: >> >> - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset >> when the oldest offset became known if the now-known value happened to >> be zero. Fixed. >> >> - SetOffsetVacuumLimit now logs useful information at the DEBUG1 >> level, so that you can see that it's doing what it's supposed to. >> >> - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the >> offsetStopLimit even if it can't truncate anything. This seems >> useless, but it's not, because it may be that the last checkpoint >> advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real >> value, and now we can actually set offsetStopLimit properly. >> >> - TruncateMultiXact no longer calls find_multixact_start when there >> are no remaining multixacts. This is actually a completely separate >> bug that goes all the way back to 9.3.0 and can potentially cause >> TruncateMultiXact to remove every file in pg_multixact/offsets. >> Restarting the cluster becomes impossible because TrimMultiXact barfs. >> >> - TruncateMultiXact now logs a message if the oldest multixact does >> not precede the earliest one on disk and is not equal to the next >> multixact and yet does not exist. The value of the log message is >> that it discovered the bug mentioned in the previous line, so I think >> it's earning its keep. >> >> With this version, I'm able to see that when you start up a >> 9.3.latest+this patch with a cluster that has a bogus value of 1 in >> relminmxid, datminmxid, and the control file, autovacuum vacuums >> everything in sight, all the values get set back to the right thing, >> and the next checkpoint enables the member-wraparound guards. This >> works with both autovacuum=on and autovacuum=off; the emergency >> mechanism kicks in as intended. We'll want to warn people with big >> databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may >> want to pre-vacuum those tables before upgrading to avoid a vacuum >> storm. But generally I'm pretty happy with this: forcing those values >> to get fixed so that we can guard against member-space wraparound >> seems like the right thing to do. >> >> So, to summarize, this patch does the following: >> >> - Fixes the failure-to-start problems introduced in 9.4.2 in >> complicated pg_upgrade scenarios. >> - Prevents the new calls to find_multixact_start we added in 9.4.2 >> from happening during recovery, where they can only create failure >> scenarios. The call in TruncateMultiXact that has been there all >> along is not eliminated, but now handles failure more gracefully. >> - Fixes possible incorrect removal of every single >> pg_multixact/offsets file when no multixacts exist; one file should be >> kept. >> - Forces aggressive autovacuuming when the control file's >> oldestMultiXid doesn't point to a valid MultiXact and enables member >> wraparound at the next checkpoint following the correction of that >> problem. > > With this patch, when I run the script > "checkpoint-segment-boundary.sh" from > http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=UDg@mail.gmail.com > I see the following during shutdown checkpoint: > > LOG: could not truncate directory "pg_multixact/offsets": apparent wraparound > > That message comes from SimpleLruTruncate. Suggested patch attached. -- Thomas Munro http://www.enterprisedb.com
Вложения
On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: > Here's a new version with some more fixes and improvements: I read through this version and found nothing to change. I encourage other hackers to study the patch, though. The surrounding code is challenging. > With this version, I'm able to see that when you start up a > 9.3.latest+this patch with a cluster that has a bogus value of 1 in > relminmxid, datminmxid, and the control file, autovacuum vacuums > everything in sight, all the values get set back to the right thing, > and the next checkpoint enables the member-wraparound guards. Nice.
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Fri, Jun 5, 2015 at 1:47 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Here's a new version with some more fixes and improvements: >>> [...] >> >> With this patch, when I run the script >> "checkpoint-segment-boundary.sh" from >> http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=UDg@mail.gmail.com >> I see the following during shutdown checkpoint: >> >> LOG: could not truncate directory "pg_multixact/offsets": apparent wraparound >> >> That message comes from SimpleLruTruncate. > > Suggested patch attached. Is it a problem that we don't drop/forget page buffers from the members SLRU (unlike SimpleLruTruncate, which is used for the offsets SLRU)? I may be missing something but it seems to me that it isn't, because (1) CheckPointMultiXact is called to flush any dirty pages to disk before TruncateMultiXact is called and (2) no pages older than the one holding the oldest offset should be dirtied after CheckPointMultiXact runs (member space is 'append only', at least until it is recycled), so any pages in the SLRU whose underlying file has been truncated should just naturally fall out of the LRU slots. So they can't create problems by being written to disk after the unlink. -- Thomas Munro http://www.enterprisedb.com
On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote: > On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: >> Here's a new version with some more fixes and improvements: > > I read through this version and found nothing to change. I encourage other > hackers to study the patch, though. The surrounding code is challenging. Andres tested this and discovered that my changes to find_multixact_start() were far more creative than intended. Committed and back-patched with a trivial fix for that stupidity and a novel-length explanation of the changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote: >> I read through this version and found nothing to change. I encourage other >> hackers to study the patch, though. The surrounding code is challenging. > Andres tested this and discovered that my changes to > find_multixact_start() were far more creative than intended. > Committed and back-patched with a trivial fix for that stupidity and a > novel-length explanation of the changes. So where are we on this? Are we ready to schedule a new set of back-branch releases? If not, what issues remain to be looked at? regards, tom lane
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-05 11:43:45 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote: > >> I read through this version and found nothing to change. I encourage other > >> hackers to study the patch, though. The surrounding code is challenging. > > > Andres tested this and discovered that my changes to > > find_multixact_start() were far more creative than intended. > > Committed and back-patched with a trivial fix for that stupidity and a > > novel-length explanation of the changes. > > So where are we on this? Are we ready to schedule a new set of > back-branch releases? If not, what issues remain to be looked at? We're currently still doing bad things while the database is in an inconsistent state (i.e. read from SLRUs and truncate based on the results of that). It's quite easy to reproduce base backup startup failures. On the other hand, that's not new. And the fix requires, afaics, a new type of WAL record (issued very infrequently). I'll post a first version of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I guess we can then decide what we'd like to do.
On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-05 11:43:45 -0400, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >> > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote: >> >> I read through this version and found nothing to change. I encourage other >> >> hackers to study the patch, though. The surrounding code is challenging. >> >> > Andres tested this and discovered that my changes to >> > find_multixact_start() were far more creative than intended. >> > Committed and back-patched with a trivial fix for that stupidity and a >> > novel-length explanation of the changes. >> >> So where are we on this? Are we ready to schedule a new set of >> back-branch releases? If not, what issues remain to be looked at? > > We're currently still doing bad things while the database is in an > inconsistent state (i.e. read from SLRUs and truncate based on the > results of that). It's quite easy to reproduce base backup startup > failures. > > On the other hand, that's not new. And the fix requires, afaics, a new > type of WAL record (issued very infrequently). I'll post a first version > of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I > guess we can then decide what we'd like to do. There are at least two other known issues that seem like they should be fixed before we release: 1. The problem that we might truncate an SLRU members page away when it's in the buffers, but not drop it from the buffers, leading to a failure when we try to write it later. 2. Thomas's bug fix for another longstanding but that occurs when you run his checkpoint-segment-boundary.sh script. I think we might want to try to fix one or both of those before cutting a new release. I'm less sold on the idea of installing WAL-logging in this minor release. That probably needs to be done, but right now we've got stuff that worked in early 9.3.X release and is now broken, and I'm in favor of fixing that first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2015-06-05 11:43:45 -0400, Tom Lane wrote: >>> So where are we on this? Are we ready to schedule a new set of >>> back-branch releases? If not, what issues remain to be looked at? >> We're currently still doing bad things while the database is in an >> inconsistent state (i.e. read from SLRUs and truncate based on the >> results of that). It's quite easy to reproduce base backup startup >> failures. >> >> On the other hand, that's not new. And the fix requires, afaics, a new >> type of WAL record (issued very infrequently). I'll post a first version >> of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I >> guess we can then decide what we'd like to do. > There are at least two other known issues that seem like they should > be fixed before we release: > 1. The problem that we might truncate an SLRU members page away when > it's in the buffers, but not drop it from the buffers, leading to a > failure when we try to write it later. > 2. Thomas's bug fix for another longstanding but that occurs when you > run his checkpoint-segment-boundary.sh script. > I think we might want to try to fix one or both of those before > cutting a new release. I'm less sold on the idea of installing > WAL-logging in this minor release. That probably needs to be done, > but right now we've got stuff that worked in early 9.3.X release and > is now broken, and I'm in favor of fixing that first. Okay, but if we're not committing today to a release wrap on Monday, I don't see it happening till after PGCon. regards, tom lane
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > There are at least two other known issues that seem like they should > > be fixed before we release: > > > 1. The problem that we might truncate an SLRU members page away when > > it's in the buffers, but not drop it from the buffers, leading to a > > failure when we try to write it later. > > > 2. Thomas's bug fix for another longstanding but that occurs when you > > run his checkpoint-segment-boundary.sh script. > > > I think we might want to try to fix one or both of those before > > cutting a new release. I'm less sold on the idea of installing > > WAL-logging in this minor release. That probably needs to be done, > > but right now we've got stuff that worked in early 9.3.X release and > > is now broken, and I'm in favor of fixing that first. > > Okay, but if we're not committing today to a release wrap on Monday, > I don't see it happening till after PGCon. In that case, I think we should get a release out next week. The current situation is rather badly broken and dangerous, and the above two bugs are nowhere as problematic. If we can get fixes for these over the weekend, that would be additional bonus. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-05 14:33:12 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > 1. The problem that we might truncate an SLRU members page away when > > it's in the buffers, but not drop it from the buffers, leading to a > > failure when we try to write it later. I've got a fix for this, and about three other issues I found during development of the new truncation codepath. I'll commit the fix tomorrow. > > I think we might want to try to fix one or both of those before > > cutting a new release. I'm less sold on the idea of installing > > WAL-logging in this minor release. That probably needs to be done, > > but right now we've got stuff that worked in early 9.3.X release and > > is now broken, and I'm in favor of fixing that first. I've implemented this, and so far it removes more code than it adds. It's imo also a pretty clear win in how understandable the code is. The remaining work, besides testing, is primarily going over lots of comment and updating them. Some of them are outdated by the patch, and some already were. Will post tonight, together with the other fixes, after I get back from climbing. My gut feeling right now is that it's a significant improvement, and that it'll be reasonable to include it. But I'd definitely like some independent testing for it, and I'm not sure if that's doable in time for the wrap. > Okay, but if we're not committing today to a release wrap on Monday, > I don't see it happening till after PGCon. I wonder if, with all the recent, err, training, we could wrap it on Tuesday instead. Independent of the truncation rework going in or not, an additional work day to go over all the changes and do some more testing would be good from my POV. Greetings, Andres Freund
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Robert Haas wrote: > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote: > > On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: > >> Here's a new version with some more fixes and improvements: > > > > I read through this version and found nothing to change. I encourage other > > hackers to study the patch, though. The surrounding code is challenging. > > Andres tested this and discovered that my changes to > find_multixact_start() were far more creative than intended. > Committed and back-patched with a trivial fix for that stupidity and a > novel-length explanation of the changes. I think novel-length is fine. The bug itself is pretty complicated, and so is the solution. Many thanks for working through this. FWIW I tested with the (attached) reproducer script(*) for my customer's problem, and it works fine now where it failed before. One thing which surprised me a bit, but in hindsight should have been pretty obvious, is that the "multixact member protections are fully armed" message is only printed once the standby gets out of recovery, instead of when it reaches consistent state or some such earlier point. (*) Actually the script cheats to get past an issue, which I couldn't actually figure out, that a file can't be "seeked"; I just do a "touch" to create an empty file there, which causes the same error situation as on my customer's log. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Steve Kehlet
Дата:
On Fri, Jun 5, 2015 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:
But I'd definitely like some
independent testing for it, and I'm not sure if that's doable in time
for the wrap.
I'd be happy to test on my database that was broken, for however much that helps. It's a VM so I can easily revert back as needed. I'm just losing track of all the patches, and what's committed and what I need to manually apply :-). I was about to test what's on REL9_4_STABLE. Let me know if I should do this.
Thanks so much everyone.
On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-05 14:33:12 -0400, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >> > 1. The problem that we might truncate an SLRU members page away when >> > it's in the buffers, but not drop it from the buffers, leading to a >> > failure when we try to write it later. > > I've got a fix for this, and about three other issues I found during > development of the new truncation codepath. > > I'll commit the fix tomorrow. OK. Then I think we should release next week, so we get the fixes we have out before PGCon. The current situation is not good. >> > I think we might want to try to fix one or both of those before >> > cutting a new release. I'm less sold on the idea of installing >> > WAL-logging in this minor release. That probably needs to be done, >> > but right now we've got stuff that worked in early 9.3.X release and >> > is now broken, and I'm in favor of fixing that first. > > I've implemented this, and so far it removes more code than it > adds. It's imo also a pretty clear win in how understandable the code > is. The remaining work, besides testing, is primarily going over lots > of comment and updating them. Some of them are outdated by the patch, > and some already were. > > Will post tonight, together with the other fixes, after I get back from > climbing. > > My gut feeling right now is that it's a significant improvement, and > that it'll be reasonable to include it. But I'd definitely like some > independent testing for it, and I'm not sure if that's doable in time > for the wrap. I think we would be foolish to rush that part into the tree. We probably got here in the first place by rushing the last round of fixes too much; let's try not to double down on that mistake. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 5, 2015 at 2:36 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: > >> > There are at least two other known issues that seem like they should >> > be fixed before we release: >> >> > 1. The problem that we might truncate an SLRU members page away when >> > it's in the buffers, but not drop it from the buffers, leading to a >> > failure when we try to write it later. >> >> > 2. Thomas's bug fix for another longstanding but that occurs when you >> > run his checkpoint-segment-boundary.sh script. >> >> > I think we might want to try to fix one or both of those before >> > cutting a new release. I'm less sold on the idea of installing >> > WAL-logging in this minor release. That probably needs to be done, >> > but right now we've got stuff that worked in early 9.3.X release and >> > is now broken, and I'm in favor of fixing that first. >> >> Okay, but if we're not committing today to a release wrap on Monday, >> I don't see it happening till after PGCon. > > In that case, I think we should get a release out next week. The > current situation is rather badly broken and dangerous, and the above > two bugs are nowhere as problematic. If we can get fixes for these over > the weekend, that would be additional bonus. Yeah, I think I agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote: >On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund <andres@anarazel.de> >wrote: >> On 2015-06-05 14:33:12 -0400, Tom Lane wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>> > 1. The problem that we might truncate an SLRU members page away >when >>> > it's in the buffers, but not drop it from the buffers, leading to >a >>> > failure when we try to write it later. >> >> I've got a fix for this, and about three other issues I found during >> development of the new truncation codepath. >> >> I'll commit the fix tomorrow. > >OK. Then I think we should release next week, so we get the fixes we >have out before PGCon. The current situation is not good. > >>> > I think we might want to try to fix one or both of those before >>> > cutting a new release. I'm less sold on the idea of installing >>> > WAL-logging in this minor release. That probably needs to be >done, >>> > but right now we've got stuff that worked in early 9.3.X release >and >>> > is now broken, and I'm in favor of fixing that first. >> >> I've implemented this, and so far it removes more code than it >> adds. It's imo also a pretty clear win in how understandable the code >> is. The remaining work, besides testing, is primarily going over >lots >> of comment and updating them. Some of them are outdated by the patch, >> and some already were. >> >> Will post tonight, together with the other fixes, after I get back >from >> climbing. >> >> My gut feeling right now is that it's a significant improvement, and >> that it'll be reasonable to include it. But I'd definitely like some >> independent testing for it, and I'm not sure if that's doable in time >> for the wrap. > >I think we would be foolish to rush that part into the tree. We >probably got here in the first place by rushing the last round of >fixes too much; let's try not to double down on that mistake. My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks. I havevery little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the state managementin shared memory and recovery are correct. There's just too many non-local subtleties here. I don't know what the right thing to do here is. --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Andres Freund <andres@anarazel.de> writes: > On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote: >> I think we would be foolish to rush that part into the tree. We >> probably got here in the first place by rushing the last round of >> fixes too much; let's try not to double down on that mistake. > My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks. Ihave very little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the state managementin shared memory and recovery are correct. There's just too many non-local subtleties here. > I don't know what the right thing to do here is. My gut feeling is that rushing to make a release date is the wrong thing. If we have confidence that we can ship something on Monday that is materially more trustworthy than the current releases, then let's aim to do that; but let's ship only patches we are confident in. We can do another set of releases later that incorporate additional fixes. (As some wise man once said, there's always another bug.) If what you're saying is that you don't trust the already-committed patch very much, then maybe we'd better hold off another couple weeks for more review and testing. regards, tom lane
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
"Joshua D. Drake"
Дата:
On 06/05/2015 01:56 PM, Tom Lane wrote: > If we have confidence that we can ship something on Monday that is > materially more trustworthy than the current releases, then let's aim to > do that; but let's ship only patches we are confident in. We can do > another set of releases later that incorporate additional fixes. (As some > wise man once said, there's always another bug.) > > If what you're saying is that you don't trust the already-committed patch > very much, then maybe we'd better hold off another couple weeks for more > review and testing. > > regards, tom lane > I believe there are likely quite a few parties willing to help test, if we knew how? Sincerely, jD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Joshua D. Drake wrote: > I believe there are likely quite a few parties willing to help test, if we > knew how? The code involved is related to checkpoints, pg_basebackups that take a long time to run, and multixact freezing and truncation. If you can set up test servers that eat lots of multixacts(*), then have many multixact freezes and truncations occur, that would probably hit the right spots. (You can set very frequent freezing by lowering vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. Perhaps changing multixact_freeze_max_age would lead to other interesting results too. Truncation occurs during checkpoint, some time after freezing, so it's probably good that those are frequent too.) Also, pg_upgrade prior to 9.3.4 is able to produce database with invalid oldestMulti=1, if you start from a 9.2-or-earlier database that has already consumed some number of multis. It would be good to test starting from those, too, just to make sure the mechanism that deals with that is good. There are at least two variations: those that have nextMulti larger than 65k but less than 2 billion, and those that have nextMulti closer to 4 billion. (I think a 9.2 database with nextMulti less than 65k is uninteresting, because the resulting oldestMulti=1 is the correct value there.) (*) Thomas Munro posted a sample program that does that; I believe with minimal changes you could turn it into infinite looping instead of a pre-set number of iteration. Also, perhaps it's possible to come up with programs that consume multixacts even faster than that, and that create larger multixacts too. All variations are useful. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 5, 2015 at 4:40 PM, Andres Freund <andres@anarazel.de> wrote: >>I think we would be foolish to rush that part into the tree. We >>probably got here in the first place by rushing the last round of >>fixes too much; let's try not to double down on that mistake. > > My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks. Ihave very little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the state managementin shared memory and recovery are correct. There's just too many non-local subtleties here. > > I don't know what the right thing to do here is. That may be true, but we don't need to get to perfect to be better than 9.4.2 and 9.4.3, where some people can't start the database. I will grant you that, if the patch I committed today introduces some regression that is even worse, life will suck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-05 20:47:33 +0200, Andres Freund wrote: > On 2015-06-05 14:33:12 -0400, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > 1. The problem that we might truncate an SLRU members page away when > > > it's in the buffers, but not drop it from the buffers, leading to a > > > failure when we try to write it later. > > I've got a fix for this, and about three other issues I found during > development of the new truncation codepath. > > I'll commit the fix tomorrow. I've looked through multixact.c/slru.c and afaics there currently is, as observed by Thomas, no codepath that exercises the broken behaviour. Due to the way checkpoints and SLRU truncation are linked "problematic" pages will have been flushed beforehand. I think we should fix this either way as it seems like a bad trap, but I'd rather commit it after the the next minor releases are out.
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-05 16:56:18 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote: > >> I think we would be foolish to rush that part into the tree. We > >> probably got here in the first place by rushing the last round of > >> fixes too much; let's try not to double down on that mistake. > > > My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks.I have very little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the statemanagement in shared memory and recovery are correct. There's just too many non-local subtleties here. > > > I don't know what the right thing to do here is. > > My gut feeling is that rushing to make a release date is the wrong thing. > > If we have confidence that we can ship something on Monday that is > materially more trustworthy than the current releases, then let's aim to > do that; but let's ship only patches we are confident in. We can do > another set of releases later that incorporate additional fixes. (As some > wise man once said, there's always another bug.) I've tortured hardware a fair bit with HEAD. So far it looks much better than 9.4.2+ et al. I've noticed a bunch of, to me at least, new issues: 1) the autovacuum trigger logic isn't perfect yet. I.e. especially with autovacuum=off you can get into situations where emergency vacuums aren't started when necessary. This is particularly likely to happen if either very large multixacts are used, or if the server has been shut down while emergency autovacuum where happening. No corruption ensues, but it's not easy to get out of. 2) I've managed to corrupt a cluster when a standby performed restartpoints less frequently than the master performed checkpoints. Because truncations happen in the checkpointer it's not that hard to end up with entirely full multixact slrus. This is a problem on several fronts. We can IIUC end up truncating away the wrong data, and we can be in a bad state upon promotion. None of that is new. 3) It's really confusing that truncation (and thus the limits in shared memory) happens in checkpoints. If you hit a limit and manually do all the necessary vacuums you'll see a "good" limit in pg_database.datminmxid, but you'll still into the error. You manually have to force a checkpoint for the truncation to actually happen. That's particularly problematic because larger installations, where I presume wraparound issues are more likely, often have a large checkpoint_timeout setting. Since none of these are really new, I don't think they should prevent us from doing a back branch release. While I'm still not convinced we're better of with 9.4.4 than with 9.4.1, we're certainly better of than with 9.4.[23] et al. If we want to go ahead with the release I plan to do a bit more testing today and tomorrow. If not I'm first going to continue working on fixing the above. I've a "good" fix for 1). I'm not 100% sure I'll feel confident with pushing if we wrap today. I am wondering if we shouldn't at least apply the portion that unconditionally sends a signal in the ERROR case. That's still an improvement. One more thing: Our testing infrastructure sucks. Without writing C code it's basically impossible to test wraparounds and such. Even if not particularly useful for non-devs, I really think we should have functions for creating burning xids/multixacts in core. Or at least in some extension.
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-08 15:15:04 +0200, Andres Freund wrote: > 1) the autovacuum trigger logic isn't perfect yet. I.e. especially with > autovacuum=off you can get into situations where emergency vacuums > aren't started when necessary. This is particularly likely to happen > if either very large multixacts are used, or if the server has been > shut down while emergency autovacuum where happening. No corruption > ensues, but it's not easy to get out of. A first version to address this problem can be found appended to this email. Basically it does: * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal autovacuum once per members segment * For both members and offsets, once hitting the hard limits, signal autovacuum everytime. Otherwise we loose the information when restarting the database, or when autovac is killed. I ran into this a bunch of times while testing. Regards, Andres
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Andres Freund wrote: > A first version to address this problem can be found appended to this > email. > > Basically it does: > * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal > autovacuum once per members segment > * For both members and offsets, once hitting the hard limits, signal > autovacuum everytime. Otherwise we loose the information when > restarting the database, or when autovac is killed. I ran into this a > bunch of times while testing. I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER only causes things to happen (i.e. a new worker to be started) when autovacuum is disabled. If autovacuum is enabled, postmaster receives the signal and doesn't do anything about it, because the launcher is already running. Of course, regularly scheduled autovac workers will eventually start running, but perhaps this is not good enough. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >Andres Freund wrote: > >> A first version to address this problem can be found appended to this >> email. >> >> Basically it does: >> * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal >> autovacuum once per members segment >> * For both members and offsets, once hitting the hard limits, signal >> autovacuum everytime. Otherwise we loose the information when >> restarting the database, or when autovac is killed. I ran into this >a >> bunch of times while testing. > >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER >only >causes things to happen (i.e. a new worker to be started) when >autovacuum is disabled. If autovacuum is enabled, postmaster receives >the signal and doesn't do anything about it, because the launcher is >already running. Of course, regularly scheduled autovac workers will >eventually start running, but perhaps this is not good enough. Well that's just the same for the plain xid precedent? I'd not mind improving further, but that seems like a separate thing. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Andres Freund wrote: > On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER > >only causes things to happen (i.e. a new worker to be started) when > >autovacuum is disabled. If autovacuum is enabled, postmaster > >receives the signal and doesn't do anything about it, because the > >launcher is already running. Of course, regularly scheduled autovac > >workers will eventually start running, but perhaps this is not good > >enough. > > Well that's just the same for the plain xid precedent? I'd not mind > improving further, but that seems like a separate thing. Sure. I just concern that we might be putting excessive trust on emergency workers being launched at a high pace. With normally configured systems (naptime=1min) it shouldn't be a problem, but we have seen systems with naptime set to one hour or so, and those might feel some pain; and it would get worse the more databases you have, because people might feel the need to space the autovac runs even more. (My personal alarm bells go off when I see autovac_naptime=15min or more, but apparently not everybody sees things that way.) > --- > Please excuse brevity and formatting - I am writing this on my mobile phone. I wonder if these notices are useful at all. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Andres Freund wrote: >> On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER >> >only causes things to happen (i.e. a new worker to be started) when >> >autovacuum is disabled. If autovacuum is enabled, postmaster >> >receives the signal and doesn't do anything about it, because the >> >launcher is already running. Of course, regularly scheduled autovac >> >workers will eventually start running, but perhaps this is not good >> >enough. >> >> Well that's just the same for the plain xid precedent? I'd not mind >> improving further, but that seems like a separate thing. > > Sure. I just concern that we might be putting excessive trust on > emergency workers being launched at a high pace. With normally > configured systems (naptime=1min) it shouldn't be a problem, but we have > seen systems with naptime set to one hour or so, and those might feel > some pain; and it would get worse the more databases you have, because > people might feel the need to space the autovac runs even more. > > (My personal alarm bells go off when I see autovac_naptime=15min or > more, but apparently not everybody sees things that way.) Uh, I'd echo that sentiment if you did s/15min/1min/ I think Andres's patch is just improving the existing mechanism so that it's reliable, and you're proposing something notably different which might be better, but which is really a different proposal altogether. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Andres Freund
Дата:
On 2015-06-08 14:23:32 -0300, Alvaro Herrera wrote: > Sure. I just concern that we might be putting excessive trust on > emergency workers being launched at a high pace. I'm not sure what to do about that. I mean, it'd not be hard to simply ignore naptime upon wraparound, but I'm not sure that'd be well received. > (My personal alarm bells go off when I see autovac_naptime=15min or > more, but apparently not everybody sees things that way.) Understandably so. I'd be alarmed at much lower values than that actually. > > --- > > Please excuse brevity and formatting - I am writing this on my mobile phone. > > I wonder if these notices are useful at all. I only know that I'm less annoyed at reading a untrimmed/badly wrapped email if it's sent from a mobile phone, where it's hard to impossible to write a well formatted email, than when sent from a full desktop. That's why I added the notice... Andres
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Robert Haas wrote: > On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > (My personal alarm bells go off when I see autovac_naptime=15min or > > more, but apparently not everybody sees things that way.) > > Uh, I'd echo that sentiment if you did s/15min/1min/ Yeah, well, that too I guess. > I think Andres's patch is just improving the existing mechanism so > that it's reliable, and you're proposing something notably different > which might be better, but which is really a different proposal > altogether. Fair enough. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote: > One more thing: > Our testing infrastructure sucks. Without writing C code it's basically > impossible to test wraparounds and such. Even if not particularly useful > for non-devs, I really think we should have functions for creating > burning xids/multixacts in core. Or at least in some extension. +1. This keeps coming up, so it's worth maintaining a verified and speedy implementation.
On Wed, Jun 10, 2015 at 7:16 PM, Noah Misch <noah@leadboat.com> wrote:
On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote:
> One more thing:
> Our testing infrastructure sucks. Without writing C code it's basically
> impossible to test wraparounds and such. Even if not particularly useful
> for non-devs, I really think we should have functions for creating
> burning xids/multixacts in core. Or at least in some extension.
+1. This keeps coming up, so it's worth maintaining a verified and speedy
implementation.
+1 from me as well.
Also, I've pretty much given up on testing this area myself, because of the issue pointed out here:
I think this is the same issue as part of Andres' point 1.
It is pretty frustrating and futile to test wrap around when the database doesn't live long enough to wrap around under the high-stress conditions.
I had thought that all changes to ShmemVariableCache except nextXid should be WAL logged at the time they occur, not just at the next checkpoint. But that wouldn't fix the problem, as the change to ShmemVariableCache has to be transactional with the change to pg_database. So it would have to be WAL logged inside the commit record or any transaction which changes pg_database.
Cheers,
Jeff
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Steve Kehlet
Дата:
Just wanted to report that I rolled back my VM to where it was with 9.4.2 installed and it wouldn't start. I installed 9.4.4 and now it starts up just fine:
> 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG: database system was shut down at 2015-05-27 13:12:55 PDT
> 2015-06-12 16:05:58 PDT [6453]: [2-1] LOG: MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact 1 does not exist on disk
> 2015-06-12 16:05:58 PDT [6457]: [1-1] LOG: autovacuum launcher started
> 2015-06-12 16:05:58 PDT [6452]: [1-1] LOG: database system is ready to accept connections
> done
> server started
And this is showing up in my serverlog periodically as the emergency autovacuums are running:
> 2015-06-12 16:13:44 PDT [6454]: [1-1] LOG: MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact 1 does not exist on disk
**Thank you Robert and all involved for the resolution to this.**
> With the fixes introduced in this release, such a situation will result in immediate emergency autovacuuming until a correct oldestMultiXid value can be determined
Okay, I notice these vacuums are of the "to prevent wraparound" type (like VACUUM FREEZE), that do hold locks preventing ALTER TABLEs and such. Good to know, we'll plan our software updates accordingly.
Is there any risk until these autovacuums finish?
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Andres Freund wrote: > A first version to address this problem can be found appended to this > email. > > Basically it does: > * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal > autovacuum once per members segment > * For both members and offsets, once hitting the hard limits, signal > autovacuum everytime. Otherwise we loose the information when > restarting the database, or when autovac is killed. I ran into this a > bunch of times while testing. Sounds reasonable. I see another hole in this area. See do_start_worker() -- there we only consider the offsets limit to determine a database to be in almost-wrapped-around state (causing emergency attention). If the database in members trouble has no pgstat entry, it might get completely ignored. I think the way to close this hole is to find_multixact_start() in the autovac launcher for the database with the oldest datminmxid, to determine whether we need to activate emergency mode for it. (Maybe instead of having this logic in autovacuum, it should be a new function that receives database datminmulti and returns a boolean indicating whether the database is in trouble or not.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 12, 2015 at 7:27 PM, Steve Kehlet <steve.kehlet@gmail.com> wrote: > Just wanted to report that I rolled back my VM to where it was with 9.4.2 > installed and it wouldn't start. I installed 9.4.4 and now it starts up just > fine: > >> 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG: database system was shut down >> at 2015-05-27 13:12:55 PDT >> 2015-06-12 16:05:58 PDT [6453]: [2-1] LOG: MultiXact member wraparound >> protections are disabled because oldest checkpointed MultiXact 1 does not >> exist on disk >> 2015-06-12 16:05:58 PDT [6457]: [1-1] LOG: autovacuum launcher started >> 2015-06-12 16:05:58 PDT [6452]: [1-1] LOG: database system is ready to >> accept connections >> done >> server started > > And this is showing up in my serverlog periodically as the emergency > autovacuums are running: > >> 2015-06-12 16:13:44 PDT [6454]: [1-1] LOG: MultiXact member wraparound >> protections are disabled because oldest checkpointed MultiXact 1 does not >> exist on disk > > **Thank you Robert and all involved for the resolution to this.** > >> With the fixes introduced in this release, such a situation will result in >> immediate emergency autovacuuming until a correct oldestMultiXid value can >> be determined > > Okay, I notice these vacuums are of the "to prevent wraparound" type (like > VACUUM FREEZE), that do hold locks preventing ALTER TABLEs and such. Good to > know, we'll plan our software updates accordingly. > > Is there any risk until these autovacuums finish? As long as you see only a modest number of files in pg_multixact/members, you're OK. But in theory, until that emergency autovacuuming finishes, there's nothing keeping that directory from wrapping around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Alvaro Herrera wrote: > I see another hole in this area. See do_start_worker() -- there we only > consider the offsets limit to determine a database to be in > almost-wrapped-around state (causing emergency attention). If the > database in members trouble has no pgstat entry, it might get completely > ignored. For the record -- it was pointed out to me that this was actually fixed by 53bb309d2. \o/ -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Alvaro Herrera
Дата:
Thomas Munro wrote: > Thanks. As mentioned elsewhere in the thread, I discovered that the > same problem exists for page boundaries, with a different error > message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 > and master with the same results: > > FATAL: could not access status of transaction 2048 > DETAIL: Could not read from file "pg_multixact/offsets/0000" at > offset 8192: Undefined error: 0. > > FATAL: could not access status of transaction 131072 > DETAIL: Could not open file "pg_multixact/offsets/0002": No such file > or directory. So I checked this bug against current master, because it's claimed to be closed. The first script doesn't emit a message at all; the second script does emit a message: LOG: could not truncate directory "pg_multixact/offsets": apparent wraparound If you start and stop again, there's no more noise in the logs. That's pretty innocuous -- great. But then I modified your script to do two segments instead of one. Then after the second cycle is done, start the server and stop it again. The end result is a bit surprising: you end up with no files in pg_multixact/offsets at all! I applied Andres' patch to WAL-log truncations and rerun the test, and the files do not automatically disappear anymore. So after that patch we might be fine, although I'd like to actually understand the failure mode here to see whether it's actually fixed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
От
Thomas Munro
Дата:
On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thomas Munro wrote: > >> Thanks. As mentioned elsewhere in the thread, I discovered that the >> same problem exists for page boundaries, with a different error >> message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 >> and master with the same results: >> >> FATAL: could not access status of transaction 2048 >> DETAIL: Could not read from file "pg_multixact/offsets/0000" at >> offset 8192: Undefined error: 0. >> >> FATAL: could not access status of transaction 131072 >> DETAIL: Could not open file "pg_multixact/offsets/0002": No such file >> or directory. > > So I checked this bug against current master, because it's claimed to be > closed. The first script doesn't emit a message at all; the second > script does emit a message: > > LOG: could not truncate directory "pg_multixact/offsets": apparent wraparound > > If you start and stop again, there's no more noise in the logs. That's > pretty innocuous -- great. Right, I included a fix for this in https://commitfest.postgresql.org/5/265/ which handles both pg_subtrans and pg_multixact, since it was lost in the noise in this thread... Hopefully someone can review that. > But then I modified your script to do two segments instead of one. Then > after the second cycle is done, start the server and stop it again. The > end result is a bit surprising: you end up with no files in > pg_multixact/offsets at all! Ouch. I see why: latest_page_number gets initialised to a different value when you restart (computed from oldest multixact ID, whereas during normal running it remembers the last created page number), so in this case (next == oldest, next % 2048 == 0), restarting the server moves latest_page_number forwards by one, so SimpleLruTruncate no longer bails out with the above error message and it happily deletes all files. That is conceptually OK (there are no multixacts, so no files should be OK), but see below... Applying the page linked above prevents this problem (it always keeps at least one multixact and therefore at least one page and therefore at least one segment, because it steps back one multixact to avoid boundary problems when oldest == next). As for whether it's actually OK to have no files in pg_multixact/offsets, it seems that if you restart *twice* after running checkpoint-segment-boundary.sh, you finish up with earliest = 4294965248 in TruncateMultiXact, because this code assumes that there was at least one file found and then proceeds to assign (-1 * 2048) to earliest (which is unsigned). trunc.earliestExistingPage = -1; SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc); earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE; if (earliest < FirstMultiXactId) earliest = FirstMultiXactId; I think this should bail out if earliestExistingPage is still -1 after the call to SlruScanDirectory. -- Thomas Munro http://www.enterprisedb.com