Обсуждение: Incomplete freezing when truncating a relation during vacuum
Hi, A customer of ours reported that some columns were missing from pg_attribute. Investigation showed that they were visible to sequential but not index scans. Looking closer, the page with the missing attributes is marked as all-visible, but the xids on the individual rows were xids more than 2^31 in the past - so, effectively in the future and invisible. pg_class.relfrozenxid, pg_database.datfrozenxid looked perfectly normal, not indicating any wraparound and were well past the xid in the particular rows. So evidently, something around freezing has gone wrong. We've haven't frozen each row, but updated relfrozenxid regardless. A longer period of staring revealed a likely reason, in lazy_vacuum_rel: /* Do the vacuuming */ lazy_scan_heap(onerel,vacrelstats, Irel, nindexes, scan_all); ... if (whatever) lazy_truncate_heap(onerel, vacrelstats); ... new_frozen_xid = FreezeLimit; if (vacrelstats->scanned_pages < vacrelstats->rel_pages) new_frozen_xid = InvalidTransactionId; but lazy_tuncate_heap() does, after it's finished truncating: vacrelstats->rel_pages = new_rel_pages; Which means, we might consider a partial vacuum as a vacuum that has frozen all old rows if just enough pages have been truncated away. This seems to be the case since b4b6923e03f4d29636a94f6f4cc2f5cf6298b8c8. I suggest we go back to using scan_all to determine whether we can set new_frozen_xid. That's a slight pessimization when we scan a relation fully without explicitly scanning it in its entirety, but given this isn't the first bug around scanned_pages/rel_pages I'd rather go that way. The aforementioned commit wasn't primarily concerned with that. Alternatively we could just compute new_frozen_xid et al before the lazy_truncate_heap. This is somewhat nasty :(. I am not sure how we could fixup the resulting corruption. In some cases we can check for page-level all-visible bit and fixup the the individual xids. But it's not guaranteed, although likely, that the page level all visible bit has been set... Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 26, 2013 at 7:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > This is somewhat nasty :(. Yeah, that's bad. Real bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-11-26 13:32:44 +0100, Andres Freund wrote: > A longer period of staring revealed a likely reason, in lazy_vacuum_rel: > /* Do the vacuuming */ > lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all); > ... > if (whatever) > lazy_truncate_heap(onerel, vacrelstats); > ... > new_frozen_xid = FreezeLimit; > if (vacrelstats->scanned_pages < vacrelstats->rel_pages) > new_frozen_xid = InvalidTransactionId; > but lazy_tuncate_heap() does, after it's finished truncating: > vacrelstats->rel_pages = new_rel_pages; > > Which means, we might consider a partial vacuum as a vacuum that has > frozen all old rows if just enough pages have been truncated away. repro.sql is a reproducer for the problem. > This seems to be the case since > b4b6923e03f4d29636a94f6f4cc2f5cf6298b8c8. I suggest we go back to using > scan_all to determine whether we can set new_frozen_xid. That's a slight > pessimization when we scan a relation fully without explicitly scanning > it in its entirety, but given this isn't the first bug around > scanned_pages/rel_pages I'd rather go that way. The aforementioned > commit wasn't primarily concerned with that. > Alternatively we could just compute new_frozen_xid et al before the > lazy_truncate_heap. I've gone for the latter in this preliminary patch. Not increasing relfrozenxid after an initial data load seems like a bit of a shame. I wonder if we should just do scan_all || vacrelstats->scanned_pages < vacrelstats->rel_pages? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 11/27/13 01:21, Andres Freund wrote: > On 2013-11-26 13:32:44 +0100, Andres Freund wrote: >> This seems to be the case since >> b4b6923e03f4d29636a94f6f4cc2f5cf6298b8c8. I suggest we go back to using >> scan_all to determine whether we can set new_frozen_xid. That's a slight >> pessimization when we scan a relation fully without explicitly scanning >> it in its entirety, but given this isn't the first bug around >> scanned_pages/rel_pages I'd rather go that way. The aforementioned >> commit wasn't primarily concerned with that. >> Alternatively we could just compute new_frozen_xid et al before the >> lazy_truncate_heap. > > I've gone for the latter in this preliminary patch. Not increasing > relfrozenxid after an initial data load seems like a bit of a shame. > > I wonder if we should just do scan_all || vacrelstats->scanned_pages < > vacrelstats->rel_pages? Hmm, you did (scan_all || vacrelstats->scanned_pages < vacrelstats->rel_pages) for relminmxid, and just (vacrelstats->scanned_pages < vacrelstats->rel_pages) for relfrozenxid. That was probably not what you meant to do, the thing you did for relfrozenxid looks good to me. Does the attached look correct to you? - Heikki
Вложения
On 2013-11-27 11:01:55 +0200, Heikki Linnakangas wrote: > On 11/27/13 01:21, Andres Freund wrote: > >On 2013-11-26 13:32:44 +0100, Andres Freund wrote: > >>This seems to be the case since > >>b4b6923e03f4d29636a94f6f4cc2f5cf6298b8c8. I suggest we go back to using > >>scan_all to determine whether we can set new_frozen_xid. That's a slight > >>pessimization when we scan a relation fully without explicitly scanning > >>it in its entirety, but given this isn't the first bug around > >>scanned_pages/rel_pages I'd rather go that way. The aforementioned > >>commit wasn't primarily concerned with that. > >>Alternatively we could just compute new_frozen_xid et al before the > >>lazy_truncate_heap. > > > >I've gone for the latter in this preliminary patch. Not increasing > >relfrozenxid after an initial data load seems like a bit of a shame. > > > >I wonder if we should just do scan_all || vacrelstats->scanned_pages < > >vacrelstats->rel_pages? > > Hmm, you did (scan_all || vacrelstats->scanned_pages < > vacrelstats->rel_pages) for relminmxid, and just (vacrelstats->scanned_pages > < vacrelstats->rel_pages) for relfrozenxid. That was probably not what you > meant to do, the thing you did for relfrozenxid looks good to me. I said it's a preliminary patch ;), really, I wasn't sure what of both to go for. > Does the attached look correct to you? Looks good. I wonder if we need to integrate any mitigating logic? Currently the corruption may only become apparent long after it occurred, that's pretty bad. And instructing people run a vacuum after the ugprade will cause the corrupted data being lost if they are already 2^31 xids. But integrating logic to fix things into heap_page_prune() looks somewhat ugly as well. Afaics the likelihood of the issue occuring on non-all-visible pages is pretty low, since they'd need to be skipped due to lock contention repeatedly. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/27/13 11:15, Andres Freund wrote: > On 2013-11-27 11:01:55 +0200, Heikki Linnakangas wrote: >> On 11/27/13 01:21, Andres Freund wrote: >>> On 2013-11-26 13:32:44 +0100, Andres Freund wrote: >>>> This seems to be the case since >>>> b4b6923e03f4d29636a94f6f4cc2f5cf6298b8c8. I suggest we go back to using >>>> scan_all to determine whether we can set new_frozen_xid. That's a slight >>>> pessimization when we scan a relation fully without explicitly scanning >>>> it in its entirety, but given this isn't the first bug around >>>> scanned_pages/rel_pages I'd rather go that way. The aforementioned >>>> commit wasn't primarily concerned with that. >>>> Alternatively we could just compute new_frozen_xid et al before the >>>> lazy_truncate_heap. >>> >>> I've gone for the latter in this preliminary patch. Not increasing >>> relfrozenxid after an initial data load seems like a bit of a shame. >>> >>> I wonder if we should just do scan_all || vacrelstats->scanned_pages < >>> vacrelstats->rel_pages? >> >> Hmm, you did (scan_all || vacrelstats->scanned_pages < >> vacrelstats->rel_pages) for relminmxid, and just (vacrelstats->scanned_pages >> < vacrelstats->rel_pages) for relfrozenxid. That was probably not what you >> meant to do, the thing you did for relfrozenxid looks good to me. > > I said it's a preliminary patch ;), really, I wasn't sure what of both > to go for. > >> Does the attached look correct to you? > > Looks good. Ok, committed and backpatched that. > I wonder if we need to integrate any mitigating logic? Currently the > corruption may only become apparent long after it occurred, that's > pretty bad. And instructing people run a vacuum after the ugprade will > cause the corrupted data being lost if they are already 2^31 xids. Ugh :-(. Running vacuum after the upgrade is the right thing to do to prevent further damage, but you're right that it will cause any already-wrapped around data to be lost forever. Nasty. > But integrating logic to fix things into heap_page_prune() looks > somewhat ugly as well. I think any mitigating logic we might add should go into vacuum. It should be possible for a DBA to run a command, and after it's finished, be confident that you're safe. That means vacuum. > Afaics the likelihood of the issue occuring on non-all-visible pages is > pretty low, since they'd need to be skipped due to lock contention > repeatedly. Hmm. If a page has its visibility-map flag set, but contains a tuple that appears to be dead because you've wrapped around, vacuum will give a warning: "page containing dead tuples is marked as all-visible in relation \"%s\" page %u". So I think if a manual VACUUM FREEZE passes without giving that warning, that vacuum hasn't destroyed any data. So we could advise to take a physical backup of the data directory, and run a manual VACUUM FREEZE on all databases. If it doesn't give a warning, you're safe from that point onwards. If it does, you'll want to recover from an older backup, or try to manually salvage just the lost rows from the backup, and re-index. Ugly, but hopefully rare in practice. Unfortunately that doesn't mean that you haven't already lost some data. Wrap-around could've already happened, and vacuum might already have run and removed some rows. You'll want to examine your logs and grep for that warning. - Heikki
On 2013-11-27 13:56:58 +0200, Heikki Linnakangas wrote: > Ok, committed and backpatched that. Thanks. > >I wonder if we need to integrate any mitigating logic? Currently the > >corruption may only become apparent long after it occurred, that's > >pretty bad. And instructing people run a vacuum after the ugprade will > >cause the corrupted data being lost if they are already 2^31 xids. > > Ugh :-(. Running vacuum after the upgrade is the right thing to do to > prevent further damage, but you're right that it will cause any > already-wrapped around data to be lost forever. Nasty. > >But integrating logic to fix things into heap_page_prune() looks > >somewhat ugly as well. > > I think any mitigating logic we might add should go into vacuum. It should > be possible for a DBA to run a command, and after it's finished, be > confident that you're safe. That means vacuum. Well, heap_page_prune() is the first thing that's executed by lazy_scan_heap(), that's why I was talking about it. So anything we do need to happen in there or before. > >Afaics the likelihood of the issue occuring on non-all-visible pages is > >pretty low, since they'd need to be skipped due to lock contention > >repeatedly. > Hmm. If a page has its visibility-map flag set, but contains a tuple that > appears to be dead because you've wrapped around, vacuum will give a > warning: "page containing dead tuples is marked as all-visible in relation > \"%s\" page %u". I don't think this warning is likely to be hit as the code stands - heap_page_prune() et. al. will have removed all dead tuples already, right and so has_dead_tuples won't be set. Independent from this, ISTM we should add a else if (PageIsAllVisible(page) && all_visible) to those checks. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/27/13 14:11, Andres Freund wrote: > On 2013-11-27 13:56:58 +0200, Heikki Linnakangas wrote: >>> Afaics the likelihood of the issue occuring on non-all-visible pages is >>> pretty low, since they'd need to be skipped due to lock contention >>> repeatedly. > >> Hmm. If a page has its visibility-map flag set, but contains a tuple that >> appears to be dead because you've wrapped around, vacuum will give a >> warning: "page containing dead tuples is marked as all-visible in relation >> \"%s\" page %u". > > I don't think this warning is likely to be hit as the code stands - > heap_page_prune() et. al. will have removed all dead tuples already, > right and so has_dead_tuples won't be set. It might be a good idea to add such a warning to heap_page_prune(). Or also emit the warning in lazy_scan_heap() if heap_page_prune() returned > 0. > Independent from this, ISTM we should add a > else if (PageIsAllVisible(page) && all_visible) > to those checks. Can you elaborate, where should that be added? - Heikki
On 2013-11-27 14:45:25 +0200, Heikki Linnakangas wrote: > On 11/27/13 14:11, Andres Freund wrote: > >I don't think this warning is likely to be hit as the code stands - > >heap_page_prune() et. al. will have removed all dead tuples already, > >right and so has_dead_tuples won't be set. > > It might be a good idea to add such a warning to heap_page_prune(). Or also > emit the warning in lazy_scan_heap() if heap_page_prune() returned > 0. > >Independent from this, ISTM we should add a > > else if (PageIsAllVisible(page) && all_visible) > >to those checks. > > Can you elaborate, where should that be added? I was thinking of adding such a warning below elog(WARNING, "page containing dead tuples is marked as all-visible in relation\"%s\" page %u",..) but cannot warn against that because GetOldestXmin() can go backwards... I think it's probably sufficient to set has_dead_tuples = true in the ItemIdIsDead() branch in lazy_scan_heap(). That should catch relevant actions from heap_page_prune(). Besides not warning in against deletions from heap_page_prune(), the current warning logic is also buggy for tables without indexes... /* * If there are no indexes then we can vacuum the page right now * instead of doing a second scan. */ if (nindexes == 0 && vacrelstats->num_dead_tuples > 0) { /* Remove tuplesfrom heap */ lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer); has_dead_tuples =false; That happens before the else if (PageIsAllVisible(page) && has_dead_tuples) check. With regard to fixing things up, ISTM the best bet is heap_prune_chain() so far. That's executed b vacuum and by opportunistic pruning and we know we have the appropriate locks there. Looks relatively easy to fix up things there. Not sure if there are any possible routes to WAL log this but using log_newpage()? I am really not sure what the best course of action is :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
How would you characterize the chances of this happening with default *vacuum_freeze_*_age settings? Offhand, it seems you would need to encounter this bug during each of ~10 generations of autovacuum_freeze_max_age before the old rows actually become invisible. On Wed, Nov 27, 2013 at 02:14:53PM +0100, Andres Freund wrote: > With regard to fixing things up, ISTM the best bet is heap_prune_chain() > so far. That's executed b vacuum and by opportunistic pruning and we > know we have the appropriate locks there. Looks relatively easy to fix > up things there. Not sure if there are any possible routes to WAL log > this but using log_newpage()? > I am really not sure what the best course of action is :( Maximizing detection is valuable, and the prognosis for automated repair is poor. I would want a way to extract tuples having xmin outside the range of CLOG that are marked HEAP_XMIN_COMMITTED or appear on an all-visible page. At first, I supposed we could offer a tool to blindly freeze such tuples. However, there's no guarantee that they are in harmony with recent changes to the database; transactions that wrongly considered those tuples invisible may have made decisions incompatible with their existence. For example, reviving such a tuple could violate a UNIQUE constraint if the user had already replaced the missing row manually. A module that offers "SELECT * FROM rows_wrongly_invisible('anytable')" would aid manual cleanup efforts. freeze_if_wrongly_invisible(tid) would be useful, too. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-11-27 14:53:27 -0500, Noah Misch wrote: > How would you characterize the chances of this happening with default > *vacuum_freeze_*_age settings? Offhand, it seems you would need to encounter > this bug during each of ~10 generations of autovacuum_freeze_max_age before > the old rows actually become invisible. I think realistically, to actually trigger the bug, it needs to happen quite a bit more often. But in some workloads it's pretty darn easy to hit. E.g. if significant parts of the table are regularly deleted, lots, if not most, of your vacuums will spuriously increase relfrozenxid above the actual value. Each time only by a small amount, but due to that small increase there never will be an actual full table vacuum since freeze_table_age will never even remotely be reached. The client that made me look into the issue noticed problems on pg_attribute - presumably because of temporary table usage primarily affecting the tail end of pg_attribute. > On Wed, Nov 27, 2013 at 02:14:53PM +0100, Andres Freund wrote: > > With regard to fixing things up, ISTM the best bet is heap_prune_chain() > > so far. That's executed b vacuum and by opportunistic pruning and we > > know we have the appropriate locks there. Looks relatively easy to fix > > up things there. Not sure if there are any possible routes to WAL log > > this but using log_newpage()? > > I am really not sure what the best course of action is :( > > Maximizing detection is valuable, and the prognosis for automated repair is > poor. I would want a way to extract tuples having xmin outside the range of > CLOG that are marked HEAP_XMIN_COMMITTED or appear on an all-visible > page. I think the likelihood of the problem affecting !all-visible pages is close to zero. Each vacuum will try to clean those, so they surely will get vacuumed at some point. I think the only way that could happen is if the ConditionalLockBufferForCleanup() fails in each vacuum. And that seems a bit unlikely. > At first, I supposed we could offer a tool to blindly freeze such tuples. > However, there's no guarantee that they are in harmony with recent changes to > the database; transactions that wrongly considered those tuples invisible may > have made decisions incompatible with their existence. For example, reviving > such a tuple could violate a UNIQUE constraint if the user had already > replaced the missing row manually. Good point, although since they are all on all-visible pages sequential scans will currently already find those. It's primarily index scans that won't. So it's not really reviving them... The primary reason why I think it might be a good idea to "revive" automatically is, that an eventual full-table/freeze vacuum will currently delete them which seems bad. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 27, 2013 at 10:43:05PM +0100, Andres Freund wrote: > On 2013-11-27 14:53:27 -0500, Noah Misch wrote: > > How would you characterize the chances of this happening with default > > *vacuum_freeze_*_age settings? Offhand, it seems you would need to encounter > > this bug during each of ~10 generations of autovacuum_freeze_max_age before > > the old rows actually become invisible. > > I think realistically, to actually trigger the bug, it needs to happen > quite a bit more often. But in some workloads it's pretty darn easy to > hit. E.g. if significant parts of the table are regularly deleted, lots, > if not most, of your vacuums will spuriously increase relfrozenxid above > the actual value. Each time only by a small amount, but due to that > small increase there never will be an actual full table vacuum since > freeze_table_age will never even remotely be reached. That makes sense. > > Maximizing detection is valuable, and the prognosis for automated repair is > > poor. I would want a way to extract tuples having xmin outside the range of > > CLOG that are marked HEAP_XMIN_COMMITTED or appear on an all-visible > > page. > > I think the likelihood of the problem affecting !all-visible pages is > close to zero. Each vacuum will try to clean those, so they surely will > get vacuumed at some point. I think the only way that could happen is if > the ConditionalLockBufferForCleanup() fails in each vacuum. And that > seems a bit unlikely. The page could have sat all-visible (through multiple XID epochs, let's say) until a recent UPDATE. > > At first, I supposed we could offer a tool to blindly freeze such tuples. > > However, there's no guarantee that they are in harmony with recent changes to > > the database; transactions that wrongly considered those tuples invisible may > > have made decisions incompatible with their existence. For example, reviving > > such a tuple could violate a UNIQUE constraint if the user had already > > replaced the missing row manually. > > Good point, although since they are all on all-visible pages sequential > scans will currently already find those. It's primarily index scans that > won't. So it's not really reviving them... True. Since a dump/reload of the database would already get the duplicate key violation, the revival is not making anything clearly worse. And if we hope for manual repair, many DBAs just won't do that at all. > The primary reason why I think it might be a good idea to "revive" > automatically is, that an eventual full-table/freeze vacuum will > currently delete them which seems bad. Will it? When the page became all-visible, the tuples were all hinted. They will never be considered dead. Every 2B transactions, they will alternate between live and not-yet-committed. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-11-27 18:18:02 -0500, Noah Misch wrote: > > I think the likelihood of the problem affecting !all-visible pages is > > close to zero. Each vacuum will try to clean those, so they surely will > > get vacuumed at some point. I think the only way that could happen is if > > the ConditionalLockBufferForCleanup() fails in each vacuum. And that > > seems a bit unlikely. > > The page could have sat all-visible (through multiple XID epochs, let's say) > until a recent UPDATE. Good point. > > > At first, I supposed we could offer a tool to blindly freeze such tuples. > > > However, there's no guarantee that they are in harmony with recent changes to > > > the database; transactions that wrongly considered those tuples invisible may > > > have made decisions incompatible with their existence. For example, reviving > > > such a tuple could violate a UNIQUE constraint if the user had already > > > replaced the missing row manually. > > > > Good point, although since they are all on all-visible pages sequential > > scans will currently already find those. It's primarily index scans that > > won't. So it's not really reviving them... > > True. Since a dump/reload of the database would already get the duplicate key > violation, the revival is not making anything clearly worse. And if we hope > for manual repair, many DBAs just won't do that at all. Especially if it involves compiling C code... > > The primary reason why I think it might be a good idea to "revive" > > automatically is, that an eventual full-table/freeze vacuum will > > currently delete them which seems bad. > > Will it? When the page became all-visible, the tuples were all hinted. They > will never be considered dead. Every 2B transactions, they will alternate > between live and not-yet-committed. Good point again. And pretty damn good luck. Although 9.3+ multixact rows look like they could return _DEAD since we'll do an TransactionIdDidAbort() (via GetUpdateXid()) and TransactionIdDidCommit() in there and we don't set XMAX_COMMITTED hint bits for XMAX_IS_MULTI rows. As an additional problem, once multixact.c has pruned the old multis away we'll get errors from there on. But that's less likely to affect many rows. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> > On Wed, Nov 27, 2013 at 02:14:53PM +0100, Andres Freund wrote: > > > With regard to fixing things up, ISTM the best bet is heap_prune_chain() > > > so far. That's executed b vacuum and by opportunistic pruning and we > > > know we have the appropriate locks there. Looks relatively easy to fix > > > up things there. Not sure if there are any possible routes to WAL log > > > this but using log_newpage()? > > > I am really not sure what the best course of action is :( Based on subsequent thread discussion, the plan you outline sounds reasonable. Here is a sketch of the specific semantics of that fixup. If a HEAPTUPLE_LIVE tuple has XIDs older than the current relfrozenxid/relminmxid of its relation or newer than ReadNewTransactionId()/ReadNextMultiXactId(), freeze those XIDs. Do likewise for HEAPTUPLE_DELETE_IN_PROGRESS, ensuring a proper xmin if the in-progress deleter aborts. Using log_newpage_buffer() seems fine; there's no need to optimize performance there. (All the better if we can, with minimal hacks, convince heap_freeze_tuple() itself to log the right changes.) I am wary about the performance loss of doing these checks in every heap_prune_chain() call, for all time. If it's measurable, can we can shed the overhead once corrections are done? Maybe bump the page layout version and skip the checks for v5 pages? (Ugh.) Time is tight to finalize this, but it would be best to get this into next week's release. That way, the announcement, fix, and mitigating code pertaining to this data loss bug all land in the same release. If necessary, I think it would be worth delaying the release, or issuing a new release a week or two later, to closely align those events. That being said, I'm prepared to review a patch in this area over the weekend. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Hi Noah, On 2013-11-30 00:40:06 -0500, Noah Misch wrote: > > > On Wed, Nov 27, 2013 at 02:14:53PM +0100, Andres Freund wrote: > > > > With regard to fixing things up, ISTM the best bet is heap_prune_chain() > > > > so far. That's executed b vacuum and by opportunistic pruning and we > > > > know we have the appropriate locks there. Looks relatively easy to fix > > > > up things there. Not sure if there are any possible routes to WAL log > > > > this but using log_newpage()? > > > > I am really not sure what the best course of action is :( > > Based on subsequent thread discussion, the plan you outline sounds reasonable. > Here is a sketch of the specific semantics of that fixup. If a HEAPTUPLE_LIVE > tuple has XIDs older than the current relfrozenxid/relminmxid of its relation > or newer than ReadNewTransactionId()/ReadNextMultiXactId(), freeze those XIDs. > Do likewise for HEAPTUPLE_DELETE_IN_PROGRESS, ensuring a proper xmin if the > in-progress deleter aborts. Using log_newpage_buffer() seems fine; there's no > need to optimize performance there. We'd need to decide what to do with xmax values, they'd likely need to be treated differently. The problem with log_newpage_buffer() is that we'd quite possibly issue one such call per item on a page. And that might become quite expensive. Logging ~1.5MB per 8k page in the worst case sounds a bit scary. > (All the better if we can, with minimal > hacks, convince heap_freeze_tuple() itself to log the right changes.) That likely comes to late - we've already pruned the page and might have made wrong decisions there. Also, heap_freeze_tuple() is run on both the primary and standbys. I think our xl_heap_freeze format, that relies on running heap_freeze_tuple() during recovery, is a terrible idea, but we cant change that right now. > Time is tight to finalize this, but it would be best to get this into next > week's release. That way, the announcement, fix, and mitigating code > pertaining to this data loss bug all land in the same release. If necessary, > I think it would be worth delaying the release, or issuing a new release a > week or two later, to closely align those events. That being said, I'm > prepared to review a patch in this area over the weekend. I don't think I currently have the energy/brainpower/time to develop such a fix in a suitable quality till monday. I've worked pretty hard on trying to fix the host of multixact data corruption bugs the last days and developing a solution that I'd be happy to put into such critical paths is certainly several days worth of work. I am not sure if it's a good idea to delay the release because of this, there are so many other critical issues that that seems like a bad tradeoff. That said, if somebody else is taking the lead I am certainly willing to help in detail with review and testing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > I am not sure if it's a good idea to delay the release because of this, > there are so many other critical issues that that seems like a bad > tradeoff. Indeed. We already said that this release was being done *now* because of the replication bug, and I see no reason to change that. The more last-minute stuff we try to cram in, the bigger risk of (another) mistake. We've already taken a credibility hit from introducing a new bug into the last round of update releases; let's please not take a risk of doing that again. regards, tom lane
On 2013-11-30 11:18:09 -0500, Tom Lane wrote: > We've already taken a credibility hit from introducing a new > bug into the last round of update releases; let's please not take a > risk of doing that again. On that front: I'd love for somebody else to look at the revised 9.3 freezing logic. It's way to complicated and there are quite some subtleties about freezing that are hard to get right, as evidenced by 9.3. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Nov 30, 2013 at 05:22:04PM +0100, Andres Freund wrote: > On that front: I'd love for somebody else to look at the revised 9.3 > freezing logic. Do you speak of the changes to xmax freezing arising from the FK lock optimization? -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-11-30 11:18:09 -0500, Tom Lane wrote: > Indeed. We already said that this release was being done *now* because > of the replication bug, and I see no reason to change that. FWIW, I think the two other data corrupting bugs, "incomplete freezing due to truncation" (all branches) and freezing overall (in 9.3), are at least as bad because they take effect on the primary. Not saying that because of my involvement, but because I think they need to be presented at least as prominent in the release notes. They bugs themselves are all fixed in the relevant branches, but I do think we need to talk about about how to detect and fix possible corruption. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > FWIW, I think the two other data corrupting bugs, "incomplete freezing > due to truncation" (all branches) and freezing overall (in 9.3), are at > least as bad because they take effect on the primary. > Not saying that because of my involvement, but because I think they need > to be presented at least as prominent in the release notes. > They bugs themselves are all fixed in the relevant branches, but I do > think we need to talk about about how to detect and fix possible > corruption. I was planning to draft up the release notes today. Can you propose text about the above? regards, tom lane
On 2013-11-30 11:40:36 -0500, Noah Misch wrote: > On Sat, Nov 30, 2013 at 05:22:04PM +0100, Andres Freund wrote: > > On that front: I'd love for somebody else to look at the revised 9.3 > > freezing logic. > > Do you speak of the changes to xmax freezing arising from the FK lock > optimization? Yes. There had been several major bugs in 9.3+ around freezing: * old "updater" xids in multixacts haven't been subjected to the new xmin horizon => inaccessible or duplicated rows. * If a multi was too old, we simply removed it, even if it contained an committed xmax => duplicate rows * If HEAP_XMAX_INVALID was set in heap_freeze_tuple() and heap_tuple_needs_freeze() we didn't do anything about xmax. That'scompletely not okay since the hint bit might not have been set on the standby => diverging standby and primary, withunfrozen rows remaining on the standby, missing or duplicate rows there. The fixed (2393c7d102368717283d7121a6ea8164e902b011) heap_freeze_tuple() and heap_tuple_needs_freeze() look much safer to me, but theres lots of complexity there. I don't see any alternative to the complexity until we change the format of xl_heap_freeze, but some more eyes than Alvaro's and mine definitely would be good. 71ad5a8475b4df896a7baa71e6dd3c455eebae99 also touches quite some intricate code paths and fixes a good amount of bugs, so it's definitely also worthy of further inspection. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-30 11:50:57 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > FWIW, I think the two other data corrupting bugs, "incomplete freezing > > due to truncation" (all branches) and freezing overall (in 9.3), are at > > least as bad because they take effect on the primary. > > Not saying that because of my involvement, but because I think they need > > to be presented at least as prominent in the release notes. > > They bugs themselves are all fixed in the relevant branches, but I do > > think we need to talk about about how to detect and fix possible > > corruption. > > I was planning to draft up the release notes today. Can you propose > text about the above? I can, but it will be a couple of hours before I can give it serious thought (starving and insanity being serious perils otherwise ;)). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Nov 30, 2013 at 05:00:58PM +0100, Andres Freund wrote: > The problem with log_newpage_buffer() is that we'd quite possibly issue > one such call per item on a page. And that might become quite > expensive. Logging ~1.5MB per 8k page in the worst case sounds a bit > scary. I had in mind issuing at most one call per page. heap_page_prune() has a structure conducive to that. > On 2013-11-30 00:40:06 -0500, Noah Misch wrote: > > Time is tight to finalize this, but it would be best to get this into next > > week's release. That way, the announcement, fix, and mitigating code > > pertaining to this data loss bug all land in the same release. If necessary, > > I think it would be worth delaying the release, or issuing a new release a > > week or two later, to closely align those events. > I am not sure if it's a good idea to delay the release because of this, > there are so many other critical issues that that seems like a bad > tradeoff. Fair enough; I'll drop that proposal. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-11-30 12:22:16 -0500, Noah Misch wrote: > On Sat, Nov 30, 2013 at 05:00:58PM +0100, Andres Freund wrote: > > The problem with log_newpage_buffer() is that we'd quite possibly issue > > one such call per item on a page. And that might become quite > > expensive. Logging ~1.5MB per 8k page in the worst case sounds a bit > > scary. > > I had in mind issuing at most one call per page. heap_page_prune() has a > structure conducive to that. That, at least as far as I can imagine, would make the fix quite complicated though. In the first phase heap_page_prune() we aren't in a critical section and cannot modify the buffer yet, so we would make all the involved code cope with the unfixed xids and hint bits. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-11-30 11:50:57 -0500, Tom Lane wrote: >> I was planning to draft up the release notes today. Can you propose >> text about the above? > I can, but it will be a couple of hours before I can give it serious > thought (starving and insanity being serious perils otherwise ;)). Sure. I can wait till tomorrow as far as this aspect is concerned. regards, tom lane
On 2013-11-30 11:50:57 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > FWIW, I think the two other data corrupting bugs, "incomplete freezing > > due to truncation" (all branches) and freezing overall (in 9.3), are at > > least as bad because they take effect on the primary. > > Not saying that because of my involvement, but because I think they need > > to be presented at least as prominent in the release notes. > > They bugs themselves are all fixed in the relevant branches, but I do > > think we need to talk about about how to detect and fix possible > > corruption. > > I was planning to draft up the release notes today. Can you propose > text about the above? * Fix possible data corruptions due to incomplete vacuuming (Andres Freund, Heikki Linnakangas) Due to this bug (auto-)vacuum could sometimes treat a partial vacuum as a full table vacuum mistakenly increasing relfrozenxid as a result. This could happen if it managed to truncate the tail end of the table due to dead space. Possible consequences are: * Errors like "could not access status of transaction XXX" when accessing such rows. * Vanishing rows after more than 2^31 transactions have passed. Tables in which parts only changing infrequently, while others change heavily are more likely to be affected. It is recommended to perform a VACUUM of all tables in all databases while having vacuum_freeze_table_age set to zero. This will fix latent corruption but will not be able to fix all errors. To detect whether a database is possibly affected check wether either "SELECT txid_current() < 2^31" returns 'f' or a VACUUM of all tables with vacuum_freeze_table_age set to zero returns errors. If neither are the case, the database is safe after performing the VACUUM. If you think you are suffering from this corruption, please contact the pgsql-hackers mailing list or your service provider, data is likely to be recoverable. Users updating from 9.0.4/8.4.8 or earlier are not affected. All branches. Commit: 82b43f7df2036d06b4410721f77512969846b6d0 * Fix possible data corruptions due to several bugs around vacuuming [in the 9.3 foreign key locks feature] (Andres Freund,Alvaro Herrera) The VACUUM implementation in 9.3 had several bugs: It removed multixact xmax values without regard of the importance of contained xids, it did not remove multixacts if the contained xids were too old and it relied on hint bits when checking whether a row needed to be frozen which might not have been set on replicas. It is unlikely that databases on a primary are affected in which no VACUUM FREEZE or a VACUUM with a nondefault vacuum_freeze_min_age was ever executed and in which SELECT relminmxid FROM pg_class WHERE relkind = 'r' AND NOT oid = 1262 AND NOT relminmxid = 1 returns no rows. Possible consequences are: * Duplicated or vanishing rows. * Errors like "could not access status of transaction XXX" when accessing rows. * Primary and Standby servers getting out of sync It is strongly recommended to re-clone all standby servers after ugprading, especially if full_page_writes was set to false. On the primary it recommented to execute a VACUUM of all tables in all databases after upgrading both the primary and possibly existing standbys while having vacuum_freeze_table_age set to zero. This will fix latent corruption on primaries but will not be able to fix all pre-existing errors. If you think you are suffering from data loss due this corruption on the primary, please contact the pgsql-hackers mailing list or your service provider, some data might be recoverable. 9.3 only, but should be mentioned first as corruption due to this is quite likely. Commit: 2393c7d102368717283d7121a6ea8164e902b011 I had quite a hard time - likely noticeable - to summarize the second time in easy to understand terms. The interactions are quite complicated. We could tell users they don't necessarily need to re-clone standbys if no xids have been truncated away (txid_current() < 2^31, and datfrozenxid of at least one database = 1), but given the replication issue that seems like unneccessary confusion. Questions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-27 14:53:27 -0500, Noah Misch wrote: > How would you characterize the chances of this happening with default > *vacuum_freeze_*_age settings? Offhand, it seems you would need to encounter > this bug during each of ~10 generations of autovacuum_freeze_max_age before > the old rows actually become invisible. On second thought, it's quite possible to see problems before that leading to more problems. A single occurance of such a illegitimate increase in relfrozenxid can be enough to cause problems of a slightly different nature. As relfrozenxid has been updated we might now, or after vacuuming some other tables, become elegible to truncate the clog. In that case we'll get ERRORs about "could not access status of transaction" if the tuple hasn't been fully hinted when scanning it later. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-01 13:33:42 +0100, Andres Freund wrote: > On 2013-11-27 14:53:27 -0500, Noah Misch wrote: > > How would you characterize the chances of this happening with default > > *vacuum_freeze_*_age settings? Offhand, it seems you would need to encounter > > this bug during each of ~10 generations of autovacuum_freeze_max_age before > > the old rows actually become invisible. > > On second thought, it's quite possible to see problems before that > leading to more problems. A single occurance of such a illegitimate > increase in relfrozenxid can be enough to cause problems of a slightly > different nature. > As relfrozenxid has been updated we might now, or after vacuuming some > other tables, become elegible to truncate the clog. In that case we'll > get ERRORs about "could not access status of transaction" if the tuple > hasn't been fully hinted when scanning it later. And indeed, a quick search shows up some threads that might suffer from it: BD7EE863F673A14EBF99D95562AEE15E44B1DA71@digi-pdc.digitilitiprod.int CAAzPmNxfDrV72wDmBEv5tcQOByE_wvGSeqRkQj0FizXmCYyaPQ@mail.gmail.com CAK9oVJwvAZLmdMrHMPg1+s37z16j+BZ8FbarZSpmrHsXxH-4GQ@mail.gmail.com Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Dec 01, 2013 at 01:55:45PM +0100, Andres Freund wrote: > On 2013-12-01 13:33:42 +0100, Andres Freund wrote: > > On 2013-11-27 14:53:27 -0500, Noah Misch wrote: > > > How would you characterize the chances of this happening with default > > > *vacuum_freeze_*_age settings? Offhand, it seems you would need to encounter > > > this bug during each of ~10 generations of autovacuum_freeze_max_age before > > > the old rows actually become invisible. > > > > On second thought, it's quite possible to see problems before that > > leading to more problems. A single occurance of such a illegitimate > > increase in relfrozenxid can be enough to cause problems of a slightly > > different nature. > > As relfrozenxid has been updated we might now, or after vacuuming some > > other tables, become elegible to truncate the clog. In that case we'll > > get ERRORs about "could not access status of transaction" if the tuple > > hasn't been fully hinted when scanning it later. Agreed. Probably, the use of hint bits and the low frequency with which TruncateCLOG() can actually remove something has kept this below the radar. > And indeed, a quick search shows up some threads that might suffer from > it: > BD7EE863F673A14EBF99D95562AEE15E44B1DA71@digi-pdc.digitilitiprod.int This system had multiple problems, a missing pg_subtrans file and a missing TOAST chunk for pg_attribute. I don't see a pg_clog problem connecting it to the freeze bug at hand. > CAAzPmNxfDrV72wDmBEv5tcQOByE_wvGSeqRkQj0FizXmCYyaPQ@mail.gmail.com This report is against PostgreSQL 8.1.11, which never had a commit like b4b6923. If a similar bug is at work, this older version acquired it through a different vector. > CAK9oVJwvAZLmdMrHMPg1+s37z16j+BZ8FbarZSpmrHsXxH-4GQ@mail.gmail.com Possible match, but suggestions of lower-level problems cloud the diagnosis. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-12-01 12:49:40 -0500, Noah Misch wrote: > This system had multiple problems, a missing pg_subtrans file and a missing > TOAST chunk for pg_attribute. I don't see a pg_clog problem connecting it to > the freeze bug at hand. Those all sound like possible problems caused by the bug, no? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Dec 01, 2013 at 06:56:10PM +0100, Andres Freund wrote: > On 2013-12-01 12:49:40 -0500, Noah Misch wrote: > > This system had multiple problems, a missing pg_subtrans file and a missing > > TOAST chunk for pg_attribute. I don't see a pg_clog problem connecting it to > > the freeze bug at hand. > > Those all sound like possible problems caused by the bug, no? pg_subtrans has a lifecycle unrelated to datfrozenxid. I am not aware of a mechanism connecting that problem to the bug at hand. The missing TOAST chunk (in pg_statistic, not pg_attribute as I wrote above) could happen from the XID space wrapping with that TOAST table page marked all-visible but not frozen. The bug reporter describes the start of that error coinciding with a minor version upgrade, so that would be an odd coincidence: 8.4.3 did not have the bug as we know it, so considerable time would typically pass between the upgrade and that symptom appearing. Can't rule it out, but this user report fits the known bug symptoms only loosely. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Andres Freund <andres@2ndquadrant.com> writes: > The VACUUM implementation in 9.3 had several bugs: It removed multixact > xmax values without regard of the importance of contained xids, it did > not remove multixacts if the contained xids were too old and it relied > on hint bits when checking whether a row needed to be frozen which might > not have been set on replicas. Uh ... what does the last have to do with it? Surely we don't run VACUUM on replicas. Or are you talking about what might happen when VACUUM is run on a former replica that's been promoted to master? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> schrieb: >Andres Freund <andres@2ndquadrant.com> writes: >> The VACUUM implementation in 9.3 had several bugs: It removed >multixact >> xmax values without regard of the importance of contained xids, it >did >> not remove multixacts if the contained xids were too old and it >relied >> on hint bits when checking whether a row needed to be frozen which >might >> not have been set on replicas. > >Uh ... what does the last have to do with it? Surely we don't run >VACUUM on replicas. Or are you talking about what might happen when >VACUUM is run on a former replica that's been promoted to master? Unfortunately not. The problem is that xl_heap_freeze's redo function simply reexecutes heap-freeze-tuple() instead of loggingmuch about each tuple... Andres -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch <noah@leadboat.com> schrieb: >On Sun, Dec 01, 2013 at 06:56:10PM +0100, Andres Freund wrote: >> On 2013-12-01 12:49:40 -0500, Noah Misch wrote: >> > This system had multiple problems, a missing pg_subtrans file and a >missing >> > TOAST chunk for pg_attribute. I don't see a pg_clog problem >connecting it to >> > the freeze bug at hand. >> >> Those all sound like possible problems caused by the bug, no? > >pg_subtrans has a lifecycle unrelated to datfrozenxid. I am not aware >of a >mechanism connecting that problem to the bug at hand. TransactinIdIsInProgress returns true for future xids. Which triggers the use of XactLockTableWait. Which then does SubtransGetParentof a far future xid... Andres -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> schrieb: >> Uh ... what does the last have to do with it? Surely we don't run >> VACUUM on replicas. Or are you talking about what might happen when >> VACUUM is run on a former replica that's been promoted to master? > Unfortunately not. The problem is that xl_heap_freeze's redo function simply reexecutes heap-freeze-tuple() instead oflogging much about each tuple... That was a pretty stupid choice ... we should think seriously about changing that for 9.4. In general the application of a WAL record needs to be 100% deterministic. regards, tom lane
Andres Freund <andres@2ndquadrant.com> writes: > * Fix possible data corruptions due to incomplete vacuuming (Andres Freund, Heikki Linnakangas) > Due to this bug (auto-)vacuum could sometimes treat a partial vacuum as > a full table vacuum mistakenly increasing relfrozenxid as a result. This > could happen if it managed to truncate the tail end of the table due to > dead space. Possible consequences are: > * Errors like "could not access status of transaction XXX" when > accessing such rows. > * Vanishing rows after more than 2^31 transactions have passed. Is there really a significant risk of clog access errors due to this bug? IIUC, the risk is that tuples in pages that vacuum skips due to being all-visible might not be frozen when intended. But it seems just about certain that such tuples would be properly hinted already, which means that nothing would ever go to clog to confirm that. So ISTM the only real risk is that rows would become invisible after 2^31 transactions (and then visible again after 2^31 more). And even then you'd need persistent bad luck, in the form of incorrect advancements of relfrozenxid having happened often enough to prevent any anti-wraparound vacuums from getting launched. Am I missing something? It's certainly a bad bug, but it seems to me that the probability of data loss is low enough that it's not so surprising this has escaped detection for so long. regards, tom lane
On 2013-12-01 17:15:31 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > * Fix possible data corruptions due to incomplete vacuuming (Andres Freund, Heikki Linnakangas) > > > Due to this bug (auto-)vacuum could sometimes treat a partial vacuum as > > a full table vacuum mistakenly increasing relfrozenxid as a result. This > > could happen if it managed to truncate the tail end of the table due to > > dead space. Possible consequences are: > > * Errors like "could not access status of transaction XXX" when > > accessing such rows. > > * Vanishing rows after more than 2^31 transactions have passed. > > Is there really a significant risk of clog access errors due to this bug? > IIUC, the risk is that tuples in pages that vacuum skips due to being > all-visible might not be frozen when intended. But it seems just about > certain that such tuples would be properly hinted already, which means > that nothing would ever go to clog to confirm that. So ISTM the only real > risk is that rows would become invisible after 2^31 transactions (and then > visible again after 2^31 more). Unfortunately it's not actually too hard to hit due to following part of the code in vacuumlazy.c: /* We need buffer cleanup lock so that we can prune HOT chains. */ if (!ConditionalLockBufferForCleanup(buf)) {/* * If we're not scanning the whole relation to guard against XID * wraparound, it's OK to skip vacuuming a page. Thenext vacuum * will clean it up. */if (!scan_all){ ReleaseBuffer(buf); continue;} ... if you have some concurrency you hit that path pretty often. And once such a vacuum went through the table it will have a higher relfrozenxid, so an impending "wave" of anti-wraparound vacuums won't scan it. Also, the hint bits won't be on potential standbys, so that's not necessarily preventing problems. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-01 15:54:41 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Tom Lane <tgl@sss.pgh.pa.us> schrieb: > >> Uh ... what does the last have to do with it? Surely we don't run > >> VACUUM on replicas. Or are you talking about what might happen when > >> VACUUM is run on a former replica that's been promoted to master? > > > Unfortunately not. The problem is that xl_heap_freeze's redo function simply reexecutes heap-freeze-tuple() instead oflogging much about each tuple... > > That was a pretty stupid choice ... we should think seriously about > changing that for 9.4. In general the application of a WAL record > needs to be 100% deterministic. Completely agreed. I'm not really sure what led to that design choice except the desire to save a bit of WAL volume. It's a pretty old piece of code - a good while before I followed development in any form of detail. It's actually in the original commit (48188e1621bb6711e7d092bee48523b18cd80177) that introduced today's form of freezing. If it had been a more robust format all along, potential damage from the replication bug could have been fixed by a VACUUM FREEZE... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-12-01 17:15:31 -0500, Tom Lane wrote: >> Is there really a significant risk of clog access errors due to this bug? >> IIUC, the risk is that tuples in pages that vacuum skips due to being >> all-visible might not be frozen when intended. > Unfortunately it's not actually too hard to hit due to following part of the > code in vacuumlazy.c: > /* > * If we're not scanning the whole relation to guard against XID > * wraparound, it's OK to skip vacuuming a page. The next vacuum > * will clean it up. > */ Ah. So it's only been *seriously* broken since commit bbb6e559c, ie 9.2. regards, tom lane
On 2013-12-01 18:02:27 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-12-01 17:15:31 -0500, Tom Lane wrote: > >> Is there really a significant risk of clog access errors due to this bug? > >> IIUC, the risk is that tuples in pages that vacuum skips due to being > >> all-visible might not be frozen when intended. > > > Unfortunately it's not actually too hard to hit due to following part of the > > code in vacuumlazy.c: > > > /* > > * If we're not scanning the whole relation to guard against XID > > * wraparound, it's OK to skip vacuuming a page. The next vacuum > > * will clean it up. > > */ > > Ah. So it's only been *seriously* broken since commit bbb6e559c, ie 9.2. Well, even before that crash recovery/replication didn't necessarily preserve the hint bits. Even more so if somebody dared to set full_page_writes=off. I personally think full_page_writes=off should conflict with wal_level != minimal, btw, but I don't see much chance of gaining acceptance for that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services