Обсуждение: postgresql 9.6 - cannot freeze committed xmax
Hello,
Recently our team performed an upgrade on one of our old postgres 9.2 databases. "for science" we tried a direct upgrade from 9.2 to 9.6 on our staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6.There's still a lot of unknown on my side about
- what does this error exactly implies? -> are we trying to freeze committed transactions?
-Why would that be an issue?
-Why would that be an issue?
- How can I fix this? If it's possible
I would welcome any information that could put me in the right direction.
Cheers,
Alexandre
Alexandre Garcia wrote: > Recently our team performed an upgrade on one of our old postgres 9.2 > databases. "for science" we tried a direct upgrade from 9.2 to 9.6 on our > staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6. > > The upgrade turned out successful on staging and we decided to go with it > on prod as well. > Prod starting throwing the following errors during autovacuum -> 'cannot > freeze committed xmax <xid>' on 2 different tables. Running vacuum manually > revealed more tables affected by the same error. > > The staging database did not present any error but the process to sync prod > to stage includes a sanitize script that removes sensitive information and > it somehow seems to fix the issue on stage (we've done a sync from prod to > stage after the upgrade) > > I've been doing a lot of search about this and even tried to go through the > code that throws that specific error. This particular error condition is a sanity check that was only introduced in 9.6.7, so you would not find too many reports of that (this exact error message wording doesn't exist prior to that). It is possible that we missed some corner case when writing that check. Upgrades from 9.2 are particularly unusual since the xmax header was reused in the 9.3 era to mean something completely different under some circumstances. I'm not in a position to do deeper debugging for you at this time, though. The commit in question is https://git.postgresql.org/pg/commitdiff/986a9153b9708071adf6ce2c9131266f3431f4ec Wild guess: maybe we should be checking HEAP_LOCKED_UPGRADED before bailing out. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-02-26 20:09:31 -0300, Alvaro Herrera wrote: > Alexandre Garcia wrote: > > > Recently our team performed an upgrade on one of our old postgres 9.2 > > databases. "for science" we tried a direct upgrade from 9.2 to 9.6 on our > > staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6. > > > > The upgrade turned out successful on staging and we decided to go with it > > on prod as well. > > Prod starting throwing the following errors during autovacuum -> 'cannot > > freeze committed xmax <xid>' on 2 different tables. Running vacuum manually > > revealed more tables affected by the same error. > > > > The staging database did not present any error but the process to sync prod > > to stage includes a sanitize script that removes sensitive information and > > it somehow seems to fix the issue on stage (we've done a sync from prod to > > stage after the upgrade) > > > > I've been doing a lot of search about this and even tried to go through the > > code that throws that specific error. Could you show pg_controldata from before/after pg_upgrade, and the output of SELECT oid::regclass, relfrozenxid, age(relfrozenxid), txid_current() FROM pg_class WHERE oid = 'problematic_table'::regclass; from both before/after? > This particular error condition is a sanity check that was only > introduced in 9.6.7, so you would not find too many reports of that > (this exact error message wording doesn't exist prior to that). It is > possible that we missed some corner case when writing that check. > Upgrades from 9.2 are particularly unusual since the xmax header was > reused in the 9.3 era to mean something completely different under some > circumstances. I'm not in a position to do deeper debugging for you at > this time, though. > > The commit in question is > https://git.postgresql.org/pg/commitdiff/986a9153b9708071adf6ce2c9131266f3431f4ec > > Wild guess: maybe we should be checking HEAP_LOCKED_UPGRADED before > bailing out. Doesn't !(tuple->t_infomask & HEAP_XMAX_LOCK_ONLY) already guard against that? Greetings, Andres Freund
Thanks for your answers, took me a little while to restore a backup from before the upgrade on an empty 9.2 server but here we are => https://gist.github.com/kuuji/cc73c3c193a0d58aa0e651fbd7bfec58
On Mon, Feb 26, 2018 at 6:21 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-02-26 20:09:31 -0300, Alvaro Herrera wrote:
> Alexandre Garcia wrote:
>
> > Recently our team performed an upgrade on one of our old postgres 9.2
> > databases. "for science" we tried a direct upgrade from 9.2 to 9.6 on our
> > staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6.
> >
> > The upgrade turned out successful on staging and we decided to go with it
> > on prod as well.
> > Prod starting throwing the following errors during autovacuum -> 'cannot
> > freeze committed xmax <xid>' on 2 different tables. Running vacuum manually
> > revealed more tables affected by the same error.
> >
> > The staging database did not present any error but the process to sync prod
> > to stage includes a sanitize script that removes sensitive information and
> > it somehow seems to fix the issue on stage (we've done a sync from prod to
> > stage after the upgrade)
> >
> > I've been doing a lot of search about this and even tried to go through the
> > code that throws that specific error.
Could you show pg_controldata from before/after pg_upgrade, and the
output of
SELECT oid::regclass, relfrozenxid, age(relfrozenxid), txid_current()
FROM pg_class
WHERE oid = 'problematic_table'::regclass;
from both before/after?
> This particular error condition is a sanity check that was only
> introduced in 9.6.7, so you would not find too many reports of that
> (this exact error message wording doesn't exist prior to that). It is
> possible that we missed some corner case when writing that check.
> Upgrades from 9.2 are particularly unusual since the xmax header was
> reused in the 9.3 era to mean something completely different under some
> circumstances. I'm not in a position to do deeper debugging for you at
> this time, though.
>
> The commit in question is
> https://git.postgresql.org/pg/commitdiff/986a9153b9708071adf6ce2c9131266f3431f4ec
>
> Wild guess: maybe we should be checking HEAP_LOCKED_UPGRADED before
> bailing out.
Doesn't !(tuple->t_infomask & HEAP_XMAX_LOCK_ONLY) already guard against
that?
Greetings,
Andres Freund
Hi, On 2018-02-28 21:12:58 +0000, Alexandre Garcia wrote: > Thanks for your answers, took me a little while to restore a backup from > before the upgrade on an empty 9.2 server but here we are => > https://gist.github.com/kuuji/cc73c3c193a0d58aa0e651fbd7bfec58 Thanks. > > > > Prod starting throwing the following errors during autovacuum -> > > 'cannot > > > > freeze committed xmax <xid>' on 2 different tables. Running vacuum > > manually > > > > revealed more tables affected by the same error. What's an example <xid> in that message that triggers for problematic_table? I need that to properly interpret the data you provided. Thanks, Andres
Oh yes sorry => `ERROR: cannot freeze committed xmax 43076385` on that same table
On Wed, Feb 28, 2018 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-02-28 21:12:58 +0000, Alexandre Garcia wrote:
> Thanks for your answers, took me a little while to restore a backup from
> before the upgrade on an empty 9.2 server but here we are =>
> https://gist.github.com/kuuji/cc73c3c193a0d58aa0e651fbd7bfec58
Thanks.
> > > > Prod starting throwing the following errors during autovacuum ->
> > 'cannot
> > > > freeze committed xmax <xid>' on 2 different tables. Running vacuum
> > manually
> > > > revealed more tables affected by the same error.
What's an example <xid> in that message that triggers for
problematic_table? I need that to properly interpret the data you
provided.
Thanks,
Andres
Hi Alexandre, Alvaro, Please not that we try not to top-post on this list. Alvaro, On 2018-02-28 21:17:08 +0000, Alexandre Garcia wrote: > Oh yes sorry => `ERROR: cannot freeze committed xmax 43076385` on that > same table In light of Latest checkpoint's NextXID: 0/128653693 I find this confusing. Alexandre, did you ever use pg_resetxlog on that cluster? Greetings, Andres Freund
Sorry about the top-post.
Alexandre, did you ever use pg_resetxlog on that cluster?
Greetings,
Andres Freund
This database used to be managed by someone else and we started managing it recently. Our team didn't run pg_resetxlog for sure.
I just chatted with the previous team and it looks like it was never ran either. This is a fairly old database and it's not its first upgrade.
I just realized we're using 9.6.6 on our staging environment, which explains why we don't see that message altogether since it was introduced in 9.6.7.
Andres Freund wrote: > Alvaro, > > On 2018-02-28 21:17:08 +0000, Alexandre Garcia wrote: > > Oh yes sorry => `ERROR: cannot freeze committed xmax 43076385` on that > > same table > > In light of > Latest checkpoint's NextXID: 0/128653693 > I find this confusing. Ooh. If you SELECT FOR UPDATE a tuple in 9.2, bit 0x0040 gets set in infomask, and nothing else. If you pg_upgrade and later try to freeze such a tuple, it will fail with the error reported. The correct test to use is HEAP_XMAX_IS_LOCKED_ONLY, which also tests for the above condition. I will verify this theory and push a patch shortly, if it proves correct. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Ooh.
If you SELECT FOR UPDATE a tuple in 9.2, bit 0x0040 gets set in
infomask, and nothing else. If you pg_upgrade and later try to freeze
such a tuple, it will fail with the error reported.
The correct test to use is HEAP_XMAX_IS_LOCKED_ONLY, which also tests
for the above condition.
I will verify this theory and push a patch shortly, if it proves
correct.
Oh good news :). I have my old 9.2 around if you need me to do more testing on it.
Alexandre Garcia wrote: > > > > Ooh. > > > > If you SELECT FOR UPDATE a tuple in 9.2, bit 0x0040 gets set in > > infomask, and nothing else. If you pg_upgrade and later try to freeze > > such a tuple, it will fail with the error reported. > > > > The correct test to use is HEAP_XMAX_IS_LOCKED_ONLY, which also tests > > for the above condition. > > > > I will verify this theory and push a patch shortly, if it proves > > correct. > > Oh good news :). I have my old 9.2 around if you need me to do more > testing on it. Not necessary -- I reproduced the problem (quite easily -- just SELECT FOR UPDATE a tuple in 9.2, then pg_upgrade, then VACUUM FREEZE the table in the upgraded server) and confirm that it doesn't happen in 9.6.6, happens in 9.6.8 (didn't try 9.6.7 but code is the same as 9.6.8), and is fixed with the attached patch. Will push soon ... probably tomorrow as I have to leave the building now. Cheers -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Hi, On 2018-02-28 20:30:19 -0300, Alvaro Herrera wrote: > From 4ecaa6cdc041d9f44d54f9d56588ea73c1c6f3c7 Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: Wed, 28 Feb 2018 20:16:35 -0300 > Subject: [PATCH] Fix freeze xmax bug > > --- > src/backend/access/heap/heapam.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index 8a846e7dba..dc762f913d 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -6799,7 +6799,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, > * independent of committedness, since a committed lock holder has > * released the lock). > */ > - if (!(tuple->t_infomask & HEAP_XMAX_LOCK_ONLY) && > + if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) > && Good catch. Wonder if this a big enough fix? Isn't the bug "one layer up"? We shouldn't do the TransactionIdPrecedes() calls in the surrounding block, in the first place, right? Right now we could very well hit both ERROR: found xmax %u from before relfrozenxid %u and ERROR: cannot freeze committed xmax %u and freeze_xmax, totally_frozen will be wrong. I think that's largely harmless, but we should fix it nonetheless, and inspect other relevant sites. Greetings, Andres Freund
Andres Freund wrote: > Wonder if this a big enough fix? Isn't the bug "one layer up"? We > shouldn't do the TransactionIdPrecedes() calls in the surrounding block, > in the first place, right? Right now we could very well hit both > ERROR: found xmax %u from before relfrozenxid %u > and > ERROR: cannot freeze committed xmax %u > and freeze_xmax, totally_frozen will be wrong. I think that's largely > harmless, but we should fix it nonetheless, and inspect other relevant > sites. Hmm ... After reading the 9.2 code, I don't think so. These values must absolutely be still within the freeze limits. The version pre-pgupgrade should have replaced the xmax with InvalidTransactionId as soon as the tuple was freezable. In fact, I checked 9.2's code and heap_tuple_needs_freeze tests "if xmax not a multi and < cutoff_xid"; and in that case heap_freeze_tuple does change the value to 0. I hope I'm not misunderstanding what you mean. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I pushed my proposed fix all the way back to 9.3. As I said earlier, I couldn't detect a bug of the sort you describe: a) as I argued in my previous email, no tuple earlier than the relfreezexid should survive, because 9.2 did remove those when freezing the table. b) One thing we could do but do not, is detect the case where there is an Xmax that is newer than the cutoff_xid but comes from before the pg_upgrade. We don't freeze it in that case. But we do return it as totally_frozen=false, so it'll be frozen later anyway -- nothing bad happens. I don't think this is worth tinkering with. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-03-01 18:32:08 -0300, Alvaro Herrera wrote: > I pushed my proposed fix all the way back to 9.3. > > As I said earlier, I couldn't detect a bug of the sort you describe: > > a) as I argued in my previous email, no tuple earlier than the > relfreezexid should survive, because 9.2 did remove those when > freezing the table. You're right, I think I was thinking of the < 9.3.x case where multixacts weren't removed at all. Greetings, Andres Freund