Обсуждение: multixacts woes
My colleague Thomas Munro and I have been working with Alvaro, and also with Kevin and Amit, to fix bug #12990, a multixact-related data corruption bug. I somehow did not realize until very recently that we actually use two SLRUs to keep track of multixacts: one for the multixacts themselves (pg_multixacts/offsets) and one for the members (pg_multixacts/members). Confusingly, members are sometimes called offsets, and offsets are sometimes called IDs, or multixacts. If either of these SLRUs wraps around, we get data loss. This comment in multixact.c explains it well: /* * Since multixacts wrap differently from transaction IDs, this logic is * not entirely correct: insome scenarios we could go for longer than 2 * billion multixacts without seeing any data loss, and in some others we * could get in trouble before that if the new pg_multixact/members data * stomps on the previouscycle's data. For lack of a better mechanism we * use the same logic as for transaction IDs, that is, start taking action * halfway around the oldest potentially-existing multixact. */ multiWrapLimit = oldest_datminmxid+ (MaxMultiXactId >> 1); if (multiWrapLimit < FirstMultiXactId) multiWrapLimit += FirstMultiXactId; Apparently, we have been hanging our hat since the release of 9.3.0 on the theory that the average multixact won't ever have more than two members, and therefore the members SLRU won't overwrite itself and corrupt data. This is not good enough: we need to prevent multixact IDs from wrapping around, and we separately need to prevent multixact members from wrapping around, and the current code was conflating those things in a way that simply didn't work. Recent commits by Alvaro and by me have mostly fixed this, but there are a few loose ends: 1. I believe that there is still a narrow race condition that cause the multixact code to go crazy and delete all of its data when operating very near the threshold for member space exhaustion. See http://www.postgresql.org/message-id/CA+TgmoZiHwybETx8NZzPtoSjprg2Kcr-NaWGajkzcLcbVJ1pKQ@mail.gmail.com for the scenario and proposed fix. 2. We have some logic that causes autovacuum to run in spite of autovacuum=off when wraparound threatens. My commit 53bb309d2d5a9432d2602c93ed18e58bd2924e15 provided most of the anti-wraparound protections for multixact members that exist for multixact IDs and for regular XIDs, but this remains an outstanding issue. I believe I know how to fix this, and will work up an appropriate patch based on some of Thomas's earlier work. 3. It seems to me that there is a danger that some users could see extremely frequent anti-mxid-member-wraparound vacuums as a result of this work. Granted, that beats data corruption or errors, but it could still be pretty bad. The default value of autovacuum_multixact_freeze_max_age is 400000000. Anti-mxid-member-wraparound vacuums kick in when you exceed 25% of the addressable space, or 1073741824 total members. So, if your typical multixact has more than 1073741824/400000000 = ~2.68 members, you're going to see more autovacuum activity as a result of this change. We're effectively capping autovacuum_multixact_freeze_max_age at 1073741824/(average size of your multixacts). If your multixacts are just a couple of members (like 3 or 4) this is probably not such a big deal. If your multixacts typically run to 50 or so members, your effective freeze age is going to drop from 400m to ~21.4m. At that point, I think it's possible that relminmxid advancement might start to force full-table scans more often than would be required for relfrozenxid advancement. If so, that may be a problem for some users. What can we do about this? Alvaro proposed back-porting his fix for bug #8470, which avoids locking a row if a parent subtransaction already has the same lock. Alvaro tells me (via chat) that on some workloads this can dramatically reduce multixact size, which is certainly appealing. But the fix looks fairly invasive - it changes the return value of HeapTupleSatisfiesUpdate in certain cases, for example - and I'm not sure it's been thoroughly code-reviewed by anyone, so I'm a little nervous about the idea of back-porting it at this point. I am inclined to think it would be better to release the fixes we have - after handling items 1 and 2 - and then come back to this issue. Another thing to consider here is that if the high rate of multixact consumption is organic rather than induced by lots of subtransactions of the same parent locking the same tuple, this fix won't help. Another thought that occurs to me is that if we had a freeze map, it would radically decrease the severity of this problem, because freezing would become vastly cheaper. I wonder if we ought to try to get that into 9.5, even if it means holding up 9.5. Quite aside from multixacts, repeated wraparound autovacuuming of static data is a progressively more serious problem as data set sizes and transaction volumes increase. The possibility that multixact freezing may in some scenarios exacerbate that problem is just icing on the cake. The fundamental problem is that a 32-bit address space just isn't that big on modern hardware, and the problem is worse for multixact members than it is for multixact IDs, because a given multixact only uses consumes one multixact ID, but as many slots in the multixact member space as it has members. Thoughts, advice, etc. are most welcome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2015-05-08 14:15:44 -0400, Robert Haas wrote: > Apparently, we have been hanging our hat since the release of 9.3.0 on > the theory that the average multixact won't ever have more than two > members, and therefore the members SLRU won't overwrite itself and > corrupt data. It's essentially a much older problem - it has essentially existed since multixacts were introduced (8.1?). The consequences of it were much lower before 9.3 though. > 3. It seems to me that there is a danger that some users could see > extremely frequent anti-mxid-member-wraparound vacuums as a result of > this work. Granted, that beats data corruption or errors, but it > could still be pretty bad. It's certainly possible to have workloads triggering that, but I think it's relatively uncommon. I in most cases I've checked the multixact consumption rate is much lower than the xid consumption. There are some exceptions, but often that's pretty bad code. > At that > point, I think it's possible that relminmxid advancement might start > to force full-table scans more often than would be required for > relfrozenxid advancement. If so, that may be a problem for some > users. I think it's the best we can do right now. > What can we do about this? Alvaro proposed back-porting his fix for > bug #8470, which avoids locking a row if a parent subtransaction > already has the same lock. Alvaro tells me (via chat) that on some > workloads this can dramatically reduce multixact size, which is > certainly appealing. But the fix looks fairly invasive - it changes > the return value of HeapTupleSatisfiesUpdate in certain cases, for > example - and I'm not sure it's been thoroughly code-reviewed by > anyone, so I'm a little nervous about the idea of back-porting it at > this point. I am inclined to think it would be better to release the > fixes we have - after handling items 1 and 2 - and then come back to > this issue. Another thing to consider here is that if the high rate > of multixact consumption is organic rather than induced by lots of > subtransactions of the same parent locking the same tuple, this fix > won't help. I'm not inclined to backport it at this stage. Maybe if we get some field reports about too many anti-wraparound vacuums due to this, *and* the code has been tested in 9.5. > Another thought that occurs to me is that if we had a freeze map, it > would radically decrease the severity of this problem, because > freezing would become vastly cheaper. I wonder if we ought to try to > get that into 9.5, even if it means holding up 9.5 I think that's not realistic. Doing this right isn't easy. And doing it wrong can lead to quite bad results, i.e. data corruption. Doing it under the pressure of delaying a release further and further seems like recipe for disaster. > Quite aside from multixacts, repeated wraparound autovacuuming of > static data is a progressively more serious problem as data set sizes > and transaction volumes increase. Yes. Agreed. > The possibility that multixact freezing may in some > scenarios exacerbate that problem is just icing on the cake. The > fundamental problem is that a 32-bit address space just isn't that big > on modern hardware, and the problem is worse for multixact members > than it is for multixact IDs, because a given multixact only uses > consumes one multixact ID, but as many slots in the multixact member > space as it has members. FWIW, I intend to either work on this myself, or help whoever seriously tackles this, in the next cycle. Greetings, Andres Freund
On Fri, May 8, 2015 at 2:27 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-08 14:15:44 -0400, Robert Haas wrote: >> Apparently, we have been hanging our hat since the release of 9.3.0 on >> the theory that the average multixact won't ever have more than two >> members, and therefore the members SLRU won't overwrite itself and >> corrupt data. > > It's essentially a much older problem - it has essentially existed since > multixacts were introduced (8.1?). The consequences of it were much > lower before 9.3 though. OK, I wasn't aware of that. What exactly were the consequences before 9.3? > I'm not inclined to backport it at this stage. Maybe if we get some > field reports about too many anti-wraparound vacuums due to this, *and* > the code has been tested in 9.5. That was about what I was thinking, too. >> Another thought that occurs to me is that if we had a freeze map, it >> would radically decrease the severity of this problem, because >> freezing would become vastly cheaper. I wonder if we ought to try to >> get that into 9.5, even if it means holding up 9.5 > > I think that's not realistic. Doing this right isn't easy. And doing it > wrong can lead to quite bad results, i.e. data corruption. Doing it > under the pressure of delaying a release further and further seems like > recipe for disaster. Those are certainly good things to worry about. > FWIW, I intend to either work on this myself, or help whoever seriously > tackles this, in the next cycle. That would be great. I'll investigate what resources EnterpriseDB can commit to this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-05-08 14:32:14 -0400, Robert Haas wrote: > On Fri, May 8, 2015 at 2:27 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-05-08 14:15:44 -0400, Robert Haas wrote: > >> Apparently, we have been hanging our hat since the release of 9.3.0 on > >> the theory that the average multixact won't ever have more than two > >> members, and therefore the members SLRU won't overwrite itself and > >> corrupt data. > > > > It's essentially a much older problem - it has essentially existed since > > multixacts were introduced (8.1?). The consequences of it were much > > lower before 9.3 though. > > OK, I wasn't aware of that. What exactly were the consequences before 9.3? I think just problems when locking a row. That's obviously much less bad than problems when reading a row. > > FWIW, I intend to either work on this myself, or help whoever seriously > > tackles this, in the next cycle. > > That would be great. With "this" I mean freeze avoidance. While I obviously, having proposed it as well at some point, think that freeze maps are a possible solution, I'm not yet sure that it's the best solution. > I'll investigate what resources EnterpriseDB can commit to this. Cool. Greetings, Andres Freund
On 05/08/2015 11:27 AM, Andres Freund wrote: > Hi, > > On 2015-05-08 14:15:44 -0400, Robert Haas wrote: >> 3. It seems to me that there is a danger that some users could see >> extremely frequent anti-mxid-member-wraparound vacuums as a result of >> this work. Granted, that beats data corruption or errors, but it >> could still be pretty bad. > > It's certainly possible to have workloads triggering that, but I think > it's relatively uncommon. I in most cases I've checked the multixact > consumption rate is much lower than the xid consumption. There are some > exceptions, but often that's pretty bad code. I have a couple workloads in my pool which do consume mxids faster than xids, due to (I think) execeptional numbers of FK conflicts. It's definitely unusual, though, and I'm sure they'd rather have corruption protection and endure some more vacuums. If we do this, though, it might be worthwhile to backport the multixact age function, so that affected users can check and schedule mxact wraparound vacuums themselves, something you currently can't do on 9.3. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-05-08 12:57:17 -0700, Josh Berkus wrote: > I have a couple workloads in my pool which do consume mxids faster than > xids, due to (I think) execeptional numbers of FK conflicts. It's > definitely unusual, though, and I'm sure they'd rather have corruption > protection and endure some more vacuums. If we do this, though, it > might be worthwhile to backport the multixact age function, so that > affected users can check and schedule mxact wraparound vacuums > themselves, something you currently can't do on 9.3. That's not particularly realistic due to the requirement of an initdb to change the catalog. Greetings, Andres Freund
Josh Berkus wrote: > I have a couple workloads in my pool which do consume mxids faster than > xids, due to (I think) execeptional numbers of FK conflicts. It's > definitely unusual, though, and I'm sure they'd rather have corruption > protection and endure some more vacuums. If we do this, though, it > might be worthwhile to backport the multixact age function, so that > affected users can check and schedule mxact wraparound vacuums > themselves, something you currently can't do on 9.3. Backporting that is difficult in core, but you can do it with an extension without too much trouble. Also, the multixact age function does not give you the "oldest member" which is what you need to properly monitor the whole of this; you can add that to an extension too. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas wrote: > My colleague Thomas Munro and I have been working with Alvaro, and > also with Kevin and Amit, to fix bug #12990, a multixact-related data > corruption bug. Thanks for this great summary of the situation. > 1. I believe that there is still a narrow race condition that cause > the multixact code to go crazy and delete all of its data when > operating very near the threshold for member space exhaustion. See > http://www.postgresql.org/message-id/CA+TgmoZiHwybETx8NZzPtoSjprg2Kcr-NaWGajkzcLcbVJ1pKQ@mail.gmail.com > for the scenario and proposed fix. I agree that there is a problem here. > 2. We have some logic that causes autovacuum to run in spite of > autovacuum=off when wraparound threatens. My commit > 53bb309d2d5a9432d2602c93ed18e58bd2924e15 provided most of the > anti-wraparound protections for multixact members that exist for > multixact IDs and for regular XIDs, but this remains an outstanding > issue. I believe I know how to fix this, and will work up an > appropriate patch based on some of Thomas's earlier work. I believe autovacuum=off is fortunately uncommon, but certainly getting this issue fixed is a good idea. > 3. It seems to me that there is a danger that some users could see > extremely frequent anti-mxid-member-wraparound vacuums as a result of > this work. I agree with the idea that the long term solution to this issue is to make the freeze process cheaper. I don't have any good ideas on how to make this less severe in the interim. You say the fix for #8470 is not tested thoroughly enough to back-patch it just yet, and I can behind that; so let's wait until 9.5 has been tested a bit more. Another avenue not mentioned and possibly worth exploring is making some more use of the multixact cache, and reuse multixacts that were previously issued and have the same effects as the one you're interested in: for instance, if you want a multixact with locking members (10,20,30) and you have one for (5,10,20,30) but transaction 5 has finished, then essentially both have the same semantics (because locks don't have any effect the transaction has finished) so we can use it instead of creating a new one. I have no idea how to implement this; obviously, having to run TransactionIdIsCurrentTransactionId for each member on each multixact in the cache each time you want to create a new multixact is not very reasonable. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, May 8, 2015 at 5:39 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> 1. I believe that there is still a narrow race condition that cause >> the multixact code to go crazy and delete all of its data when >> operating very near the threshold for member space exhaustion. See >> http://www.postgresql.org/message-id/CA+TgmoZiHwybETx8NZzPtoSjprg2Kcr-NaWGajkzcLcbVJ1pKQ@mail.gmail.com >> for the scenario and proposed fix. > > I agree that there is a problem here. OK, I'm glad we now agree on that, since it seemed like you were initially unconvinced. >> 2. We have some logic that causes autovacuum to run in spite of >> autovacuum=off when wraparound threatens. My commit >> 53bb309d2d5a9432d2602c93ed18e58bd2924e15 provided most of the >> anti-wraparound protections for multixact members that exist for >> multixact IDs and for regular XIDs, but this remains an outstanding >> issue. I believe I know how to fix this, and will work up an >> appropriate patch based on some of Thomas's earlier work. > > I believe autovacuum=off is fortunately uncommon, but certainly getting > this issue fixed is a good idea. Right. >> 3. It seems to me that there is a danger that some users could see >> extremely frequent anti-mxid-member-wraparound vacuums as a result of >> this work. > > I agree with the idea that the long term solution to this issue is to > make the freeze process cheaper. I don't have any good ideas on how to > make this less severe in the interim. You say the fix for #8470 is not > tested thoroughly enough to back-patch it just yet, and I can behind > that; so let's wait until 9.5 has been tested a bit more. Sounds good. > Another avenue not mentioned and possibly worth exploring is making some > more use of the multixact cache, and reuse multixacts that were > previously issued and have the same effects as the one you're interested > in: for instance, if you want a multixact with locking members > (10,20,30) and you have one for (5,10,20,30) but transaction 5 has > finished, then essentially both have the same semantics (because locks > don't have any effect the transaction has finished) so we can use it > instead of creating a new one. I have no idea how to implement this; > obviously, having to run TransactionIdIsCurrentTransactionId for each > member on each multixact in the cache each time you want to create a new > multixact is not very reasonable. This sounds to me like it's probably too clever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05/08/2015 09:57 PM, Josh Berkus wrote: > [snip] >> It's certainly possible to have workloads triggering that, but I think >> it's relatively uncommon. I in most cases I've checked the multixact >> consumption rate is much lower than the xid consumption. There are some >> exceptions, but often that's pretty bad code. > I have a couple workloads in my pool which do consume mxids faster than > xids, due to (I think) execeptional numbers of FK conflicts. It's > definitely unusual, though, and I'm sure they'd rather have corruption > protection and endure some more vacuums. Seen corruption happen recently with OpenBravo on PostgreSQL 9.3.6 (Debian; binaries upgraded from 9.3.2) in a cluster pg_upgraded from 9.2.4 (albeit with quite insufficient autovacuum / poorly configured Postgres) I fear that this might be more widespread than we thought, depending on the exact workload/activity pattern. If it would help, I can try to get hold of a copy of the cluster in question (if the customer keeps any copy at all) > If we do this, though, it > might be worthwhile to backport the multixact age function, so that > affected users can check and schedule mxact wraparound vacuums > themselves, something you currently can't do on 9.3. Thanks, J.L.
On Fri, May 08, 2015 at 02:15:44PM -0400, Robert Haas wrote: > My colleague Thomas Munro and I have been working with Alvaro, and > also with Kevin and Amit, to fix bug #12990, a multixact-related data > corruption bug. Thanks Alvaro, Amit, Kevin, Robert and Thomas for mobilizing to get this fixed. > 1. I believe that there is still a narrow race condition that cause > the multixact code to go crazy and delete all of its data when > operating very near the threshold for member space exhaustion. See > http://www.postgresql.org/message-id/CA+TgmoZiHwybETx8NZzPtoSjprg2Kcr-NaWGajkzcLcbVJ1pKQ@mail.gmail.com > for the scenario and proposed fix. For anyone else following along, Thomas's subsequent test verified this threat beyond reasonable doubt: http://www.postgresql.org/message-id/CAEepm=3C32VPJLOo45y0c3-3KWXNV2xM4jaPTSVjCRD2VG0Qgg@mail.gmail.com > 2. We have some logic that causes autovacuum to run in spite of > autovacuum=off when wraparound threatens. My commit > 53bb309d2d5a9432d2602c93ed18e58bd2924e15 provided most of the > anti-wraparound protections for multixact members that exist for > multixact IDs and for regular XIDs, but this remains an outstanding > issue. I believe I know how to fix this, and will work up an > appropriate patch based on some of Thomas's earlier work. That would be good to have, and its implementation should be self-contained. > 3. It seems to me that there is a danger that some users could see > extremely frequent anti-mxid-member-wraparound vacuums as a result of > this work. Granted, that beats data corruption or errors, but it > could still be pretty bad. The default value of > autovacuum_multixact_freeze_max_age is 400000000. > Anti-mxid-member-wraparound vacuums kick in when you exceed 25% of the > addressable space, or 1073741824 total members. So, if your typical > multixact has more than 1073741824/400000000 = ~2.68 members, you're > going to see more autovacuum activity as a result of this change. > We're effectively capping autovacuum_multixact_freeze_max_age at > 1073741824/(average size of your multixacts). If your multixacts are > just a couple of members (like 3 or 4) this is probably not such a big > deal. If your multixacts typically run to 50 or so members, your > effective freeze age is going to drop from 400m to ~21.4m. At that > point, I think it's possible that relminmxid advancement might start > to force full-table scans more often than would be required for > relfrozenxid advancement. If so, that may be a problem for some > users. I don't know whether this deserves prompt remediation, but if it does, I would look no further than the hard-coded 25% figure. We permit users to operate close to XID wraparound design limits. GUC maximums force an anti-wraparound vacuum at no later than 93.1% of design capacity. XID assignment warns at 99.5%, then stops at 99.95%. PostgreSQL mandates a larger cushion for pg_multixact/offsets, with anti-wraparound VACUUM by 46.6% and a stop at 50.0%. Commit 53bb309d2d5a9432d2602c93ed18e58bd2924e15 introduced the bulkiest mandatory cushion yet, an anti-wraparound vacuum when pg_multixact/members is just 25% full. The pgsql-bugs thread driving that patch did reject making it GUC-controlled, essentially on the expectation that 25% should be adequate for everyone: http://www.postgresql.org/message-id/CA+Tgmoap6-o_5ESu5X2mBRVht_F+KNoY+oO12OvV_WekSA=ezQ@mail.gmail.com http://www.postgresql.org/message-id/20150506143418.GT2523@alvh.no-ip.org http://www.postgresql.org/message-id/1570859840.1241196.1430928954257.JavaMail.yahoo@mail.yahoo.com > What can we do about this? Alvaro proposed back-porting his fix for > bug #8470, which avoids locking a row if a parent subtransaction > already has the same lock. Like Andres and yourself, I would not back-patch it. > Another thought that occurs to me is that if we had a freeze map, it > would radically decrease the severity of this problem, because > freezing would become vastly cheaper. I wonder if we ought to try to > get that into 9.5, even if it means holding up 9.5. Declaring that a release will wait for a particular feature has consistently ended badly for PostgreSQL, and this feature is just in the planning stages. If folks are ready to hit the ground running on the project, I suggest they do so; a non-WIP submission to the first 9.6 CF would be a big accomplishment. The time to contemplate slipping it into 9.5 comes after the patch is done. If these aggressive ideas earn more than passing consideration, the 25% threshold should become user-controllable after all.
On 05/10/2015 10:30 AM, Robert Haas wrote: >>> 2. We have some logic that causes autovacuum to run in spite of >>> autovacuum=off when wraparound threatens. My commit >>> 53bb309d2d5a9432d2602c93ed18e58bd2924e15 provided most of the >>> anti-wraparound protections for multixact members that exist for >>> multixact IDs and for regular XIDs, but this remains an outstanding >>> issue. I believe I know how to fix this, and will work up an >>> appropriate patch based on some of Thomas's earlier work. >> I believe autovacuum=off is fortunately uncommon, but certainly getting >> this issue fixed is a good idea. > Right. > > I suspect it's quite a bit more common than many people imagine. cheers andrew
On 5/8/15 1:15 PM, Robert Haas wrote: > I somehow did not realize until very recently that we > actually use two SLRUs to keep track of multixacts: one for the > multixacts themselves (pg_multixacts/offsets) and one for the members > (pg_multixacts/members). Confusingly, members are sometimes called > offsets, and offsets are sometimes called IDs, or multixacts. FWIW, since I had to re-read this bit... * We use two SLRU areas, one for storing the offsets at which the data * startsfor each MultiXactId in the other one. This trick allows us to * store variable length arrays of TransactionIds. Another way this could be 'fixed' would be to bump MultiXactOffset (but NOT MultiXactId) to uint64. That would increase the number of total members we could keep by a factor of 2^32. At that point wraparound wouldn't even be possible, because you can't have more than 2^31 members in an MXID (and there can only be 2^31 MXIDs). It may not be a trivial change through, because SLRUs are currently capped at 2^32 pages. This probably isn't a good long-term solution, but it would eliminate the risk of really frequent freeze vacuums. It sounds like Josh at least knows some people that could cause big problems for. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sun, May 10, 2015 at 1:40 PM, Noah Misch <noah@leadboat.com> wrote: > I don't know whether this deserves prompt remediation, but if it does, I would > look no further than the hard-coded 25% figure. We permit users to operate > close to XID wraparound design limits. GUC maximums force an anti-wraparound > vacuum at no later than 93.1% of design capacity. XID assignment warns at > 99.5%, then stops at 99.95%. PostgreSQL mandates a larger cushion for > pg_multixact/offsets, with anti-wraparound VACUUM by 46.6% and a stop at > 50.0%. Commit 53bb309d2d5a9432d2602c93ed18e58bd2924e15 introduced the > bulkiest mandatory cushion yet, an anti-wraparound vacuum when > pg_multixact/members is just 25% full. That's certainly one possible approach. I had discounted it because you can't really get more than a small multiple out of it, but getting 2-3x more room might indeed be enough to help some people quite a bit. Just raising the threshold from 25% to say 40% would buy back a healthy amount. Or, as you suggest, we could just add a GUC. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, May 10, 2015 at 09:17:58PM -0400, Robert Haas wrote: > On Sun, May 10, 2015 at 1:40 PM, Noah Misch <noah@leadboat.com> wrote: > > I don't know whether this deserves prompt remediation, but if it does, I would > > look no further than the hard-coded 25% figure. We permit users to operate > > close to XID wraparound design limits. GUC maximums force an anti-wraparound > > vacuum at no later than 93.1% of design capacity. XID assignment warns at > > 99.5%, then stops at 99.95%. PostgreSQL mandates a larger cushion for > > pg_multixact/offsets, with anti-wraparound VACUUM by 46.6% and a stop at > > 50.0%. Commit 53bb309d2d5a9432d2602c93ed18e58bd2924e15 introduced the > > bulkiest mandatory cushion yet, an anti-wraparound vacuum when > > pg_multixact/members is just 25% full. > > That's certainly one possible approach. I had discounted it because > you can't really get more than a small multiple out of it, but getting > 2-3x more room might indeed be enough to help some people quite a bit. > Just raising the threshold from 25% to say 40% would buy back a > healthy amount. Right. It's fair to assume that the new VACUUM burden would be discountable at a 90+% threshold, because the installations that could possibly find it expensive are precisely those experiencing corruption today. These reports took eighteen months to appear, whereas some corruption originating in commit 0ac5ad5 saw reports within three months. Therefore, sites burning pg_multixact/members proportionally faster than both pg_multixact/offsets and XIDs must be unusual. Bottom line: if we do need to reduce VACUUM burden caused by the commits you cited upthread, we almost certainly don't need more than a 4x improvement.
On Mon, May 11, 2015 at 12:56 AM, Noah Misch <noah@leadboat.com> wrote: > On Sun, May 10, 2015 at 09:17:58PM -0400, Robert Haas wrote: >> On Sun, May 10, 2015 at 1:40 PM, Noah Misch <noah@leadboat.com> wrote: >> > I don't know whether this deserves prompt remediation, but if it does, I would >> > look no further than the hard-coded 25% figure. We permit users to operate >> > close to XID wraparound design limits. GUC maximums force an anti-wraparound >> > vacuum at no later than 93.1% of design capacity. XID assignment warns at >> > 99.5%, then stops at 99.95%. PostgreSQL mandates a larger cushion for >> > pg_multixact/offsets, with anti-wraparound VACUUM by 46.6% and a stop at >> > 50.0%. Commit 53bb309d2d5a9432d2602c93ed18e58bd2924e15 introduced the >> > bulkiest mandatory cushion yet, an anti-wraparound vacuum when >> > pg_multixact/members is just 25% full. >> >> That's certainly one possible approach. I had discounted it because >> you can't really get more than a small multiple out of it, but getting >> 2-3x more room might indeed be enough to help some people quite a bit. >> Just raising the threshold from 25% to say 40% would buy back a >> healthy amount. > > Right. It's fair to assume that the new VACUUM burden would be discountable > at a 90+% threshold, because the installations that could possibly find it > expensive are precisely those experiencing corruption today. These reports > took eighteen months to appear, whereas some corruption originating in commit > 0ac5ad5 saw reports within three months. Therefore, sites burning > pg_multixact/members proportionally faster than both pg_multixact/offsets and > XIDs must be unusual. Bottom line: if we do need to reduce VACUUM burden > caused by the commits you cited upthread, we almost certainly don't need more > than a 4x improvement. I looked into the approach of adding a GUC called autovacuum_multixact_freeze_max_members to set the threshold. I thought to declare it this way: { + {"autovacuum_multixact_freeze_max_members", PGC_POSTMASTER, AUTOVACUUM, + gettext_noop("# of multixact members at which autovacuum is forced to prevent multixact member wraparound."), + NULL + }, + &autovacuum_multixact_freeze_max_members, + 2000000000, 10000000, 4000000000, + NULL, NULL, NULL + }, Regrettably, I think that's not going to work, because 4000000000 overflows int. We will evidently need to denote this GUC in some other units, unless we want to introduce config_int64. Given your concerns, and the need to get a fix for this out the door quickly, what I'm inclined to do for the present is go bump the threshold from 25% of MaxMultiXact to 50% of MaxMultiXact without changing anything else. Your analysis shows that this is more in line with the existing policy for multixact IDs than what I did, and it will reduce the threat of frequent wraparound scans. Now, it will also increase the chances of somebody hitting the wall before autovacuum can bail them out. But maybe not that much. If we need 75% of the multixact member space to complete one cycle of anti-wraparound vacuums, we're actually very close to the point where the system just cannot work. If that's one big table, we're done. Also, if somebody does have a workload where the auto-clamping doesn't provide them with enough headroom, they can still improve things by reducing autovacuum_multixact_freeze_max_age to a value less than the value to which we're auto-clamping it. If they need an effective value of less than 10 million they are out of luck, but if that is the case then there is a good chance that they are hosed anyway - an anti-wraparound vacuum every 10 million multixacts sounds awfully painful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 11, 2015 at 08:29:05AM -0400, Robert Haas wrote: > Given your concerns, and the need to get a fix for this out the door > quickly, what I'm inclined to do for the present is go bump the > threshold from 25% of MaxMultiXact to 50% of MaxMultiXact without > changing anything else. +1 > Your analysis shows that this is more in line > with the existing policy for multixact IDs than what I did, and it > will reduce the threat of frequent wraparound scans. Now, it will > also increase the chances of somebody hitting the wall before > autovacuum can bail them out. But maybe not that much. If we need > 75% of the multixact member space to complete one cycle of > anti-wraparound vacuums, we're actually very close to the point where > the system just cannot work. If that's one big table, we're done. Agreed.
On Mon, May 11, 2015 at 10:11 AM, Noah Misch <noah@leadboat.com> wrote: > On Mon, May 11, 2015 at 08:29:05AM -0400, Robert Haas wrote: >> Given your concerns, and the need to get a fix for this out the door >> quickly, what I'm inclined to do for the present is go bump the >> threshold from 25% of MaxMultiXact to 50% of MaxMultiXact without >> changing anything else. > > +1 > >> Your analysis shows that this is more in line >> with the existing policy for multixact IDs than what I did, and it >> will reduce the threat of frequent wraparound scans. Now, it will >> also increase the chances of somebody hitting the wall before >> autovacuum can bail them out. But maybe not that much. If we need >> 75% of the multixact member space to complete one cycle of >> anti-wraparound vacuums, we're actually very close to the point where >> the system just cannot work. If that's one big table, we're done. > > Agreed. OK, I have made this change. Barring further trouble reports, this completes the multixact work I plan to do for the next release. Here is what is outstanding: 1. We might want to introduce a GUC to control the point at which member offset utilization begins clamping autovacuum_multixact_freeze_max_age. It doesn't seem wise to do anything about this before pushing a minor release out. It's not entirely trivial, and it may be helpful to learn more about how the changes already made work out in practice before proceeding. Also, we might not back-patch this anyway. 2. The recent changes adjust things - for good reason - so that the safe threshold for multixact member creation is advanced only at checkpoint time. This means it's theoretically possible to have a situation where autovacuum has done all it can, but because no checkpoint has happened yet, the user can't create any more multixacts. Thanks to some good work by Thomas, autovacuum will realize this and avoid spinning uselessly over every table in the system, which is good, but you're still stuck with errors until the next checkpoint. Essentially, we're hoping that autovacuum will clean things up far enough in advance of hitting the threshold where we have to throw an error that a checkpoint will intervene before the error starts happening. It's possible we could improve this further, but I think it would be unwise to mess with it right now. It may be that there is no real-world problem here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05/11/2015 09:54 AM, Robert Haas wrote: > OK, I have made this change. Barring further trouble reports, this > completes the multixact work I plan to do for the next release. Here > is what is outstanding: > > 1. We might want to introduce a GUC to control the point at which > member offset utilization begins clamping > autovacuum_multixact_freeze_max_age. It doesn't seem wise to do > anything about this before pushing a minor release out. It's not > entirely trivial, and it may be helpful to learn more about how the > changes already made work out in practice before proceeding. Also, we > might not back-patch this anyway. -1 on back-patching a new GUC. People don't know what to do with the existing multixact GUCs, and without an age(multixact) function built-in, any adjustments a user tries to make are likely to do more harm than good. In terms of adding a new GUC in 9.5: can't we take a stab at auto-tuning this instead of adding a new GUC? We already have a bunch of freezing GUCs which fewer than 1% of our user base has any idea how to set. > 2. The recent changes adjust things - for good reason - so that the > safe threshold for multixact member creation is advanced only at > checkpoint time. This means it's theoretically possible to have a > situation where autovacuum has done all it can, but because no > checkpoint has happened yet, the user can't create any more > multixacts. Thanks to some good work by Thomas, autovacuum will > realize this and avoid spinning uselessly over every table in the > system, which is good, but you're still stuck with errors until the > next checkpoint. Essentially, we're hoping that autovacuum will clean > things up far enough in advance of hitting the threshold where we have > to throw an error that a checkpoint will intervene before the error > starts happening. It's possible we could improve this further, but I > think it would be unwise to mess with it right now. It may be that > there is no real-world problem here. Given that our longest possible checkpoint timeout is an hour, is it even hypotethically possible that we would hit a limit in that time? How many mxact members are we talking about? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus wrote: > In terms of adding a new GUC in 9.5: can't we take a stab at auto-tuning > this instead of adding a new GUC? We already have a bunch of freezing > GUCs which fewer than 1% of our user base has any idea how to set. If you have development resources to pour onto 9.5, I think it would be better spent changing multixact usage tracking so that oldestOffset is included in pg_control; also make pg_multixact truncation be WAL-logged. With those changes, the need for a lot of pretty complicated code would go away. The fact that truncation is done by both vacuum and checkpoint causes a lot of the mess we were in (and from which Robert and Thomas took us --- thanks guys!). Such a change is the first step towards auto-tuning, I think. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas wrote: > OK, I have made this change. Barring further trouble reports, this > completes the multixact work I plan to do for the next release. Many thanks for all the effort here -- much appreciated. > 2. The recent changes adjust things - for good reason - so that the > safe threshold for multixact member creation is advanced only at > checkpoint time. This means it's theoretically possible to have a > situation where autovacuum has done all it can, but because no > checkpoint has happened yet, the user can't create any more > multixacts. Thanks to some good work by Thomas, autovacuum will > realize this and avoid spinning uselessly over every table in the > system, which is good, but you're still stuck with errors until the > next checkpoint. Essentially, we're hoping that autovacuum will clean > things up far enough in advance of hitting the threshold where we have > to throw an error that a checkpoint will intervene before the error > starts happening. It's possible we could improve this further, but I > think it would be unwise to mess with it right now. It may be that > there is no real-world problem here. See my response to Josh. I think much of the current rube-goldbergian design is due to the fact that pg_control cannot be changed in back branches. Going forward, I think a better plan is to include more info in pg_control, WAL-log more operations, remove checkpoint from the loop and have everything happen at vacuum time. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 05/11/2015 10:24 AM, Josh Berkus wrote: > In terms of adding a new GUC in 9.5: can't we take a stab at auto-tuning > this instead of adding a new GUC? We already have a bunch of freezing > GUCs which fewer than 1% of our user base has any idea how to set. That is a documentation problem not a user problem. Although I agree that yet another GUC for an obscure "feature" that should be internally intelligent is likely the wrong direction. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.