Обсуждение: postgresql 9.6 - cannot freeze committed xmax

Поиск
Список
Период
Сортировка

postgresql 9.6 - cannot freeze committed xmax

От
Alexandre Garcia
Дата:
Hello,

Recently our team performed an upgrade on one of our old postgres 9.2 databases. "for science" we tried a direct upgrade from 9.2 to 9.6  on our staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6.

The upgrade turned out successful on staging and we decided to go with it on prod as well.
Prod starting throwing the following errors during autovacuum -> 'cannot freeze committed xmax <xid>' on 2 different tables. Running vacuum manually revealed more tables affected by the same error.

The staging database did not present any error but the process to sync prod to stage includes a sanitize script that removes sensitive information and it somehow seems to fix the issue on stage (we've done a sync from prod to stage after the upgrade)

I've been doing a lot of search about this and even tried to go through the code that throws that specific error. I tried to temporarily increase the `vacuum_freeze_min_age` setting and run a vacuum, trying to bypass the freezing process. But this was unsuccessful (still throwing the error on some tables, and not anymore on some others)

There's still a lot of unknown on my side about
  - what does this error exactly implies? -> are we trying to freeze committed transactions?
  -Why would that be an issue?
 - How can I fix this? If it's possible


I would welcome any information that could put me in the right direction.

Cheers,
Alexandre

Re: postgresql 9.6 - cannot freeze committed xmax

От
Alvaro Herrera
Дата:
Alexandre Garcia wrote:

> Recently our team performed an upgrade on one of our old postgres 9.2
> databases. "for science" we tried a direct upgrade from 9.2 to 9.6  on our
> staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6.
> 
> The upgrade turned out successful on staging and we decided to go with it
> on prod as well.
> Prod starting throwing the following errors during autovacuum -> 'cannot
> freeze committed xmax <xid>' on 2 different tables. Running vacuum manually
> revealed more tables affected by the same error.
> 
> The staging database did not present any error but the process to sync prod
> to stage includes a sanitize script that removes sensitive information and
> it somehow seems to fix the issue on stage (we've done a sync from prod to
> stage after the upgrade)
>
> I've been doing a lot of search about this and even tried to go through the
> code that throws that specific error.

This particular error condition is a sanity check that was only
introduced in 9.6.7, so you would not find too many reports of that
(this exact error message wording doesn't exist prior to that).  It is
possible that we missed some corner case when writing that check.
Upgrades from 9.2 are particularly unusual since the xmax header was
reused in the 9.3 era to mean something completely different under some
circumstances.  I'm not in a position to do deeper debugging for you at
this time, though.

The commit in question is
https://git.postgresql.org/pg/commitdiff/986a9153b9708071adf6ce2c9131266f3431f4ec

Wild guess: maybe we should be checking HEAP_LOCKED_UPGRADED before
bailing out.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: postgresql 9.6 - cannot freeze committed xmax

От
Andres Freund
Дата:
On 2018-02-26 20:09:31 -0300, Alvaro Herrera wrote:
> Alexandre Garcia wrote:
> 
> > Recently our team performed an upgrade on one of our old postgres 9.2
> > databases. "for science" we tried a direct upgrade from 9.2 to 9.6  on our
> > staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6.
> > 
> > The upgrade turned out successful on staging and we decided to go with it
> > on prod as well.
> > Prod starting throwing the following errors during autovacuum -> 'cannot
> > freeze committed xmax <xid>' on 2 different tables. Running vacuum manually
> > revealed more tables affected by the same error.
> > 
> > The staging database did not present any error but the process to sync prod
> > to stage includes a sanitize script that removes sensitive information and
> > it somehow seems to fix the issue on stage (we've done a sync from prod to
> > stage after the upgrade)
> >
> > I've been doing a lot of search about this and even tried to go through the
> > code that throws that specific error.

Could you show pg_controldata from before/after pg_upgrade, and the
output of
SELECT oid::regclass, relfrozenxid, age(relfrozenxid), txid_current()
FROM pg_class
WHERE oid = 'problematic_table'::regclass;
from both before/after?


> This particular error condition is a sanity check that was only
> introduced in 9.6.7, so you would not find too many reports of that
> (this exact error message wording doesn't exist prior to that).  It is
> possible that we missed some corner case when writing that check.
> Upgrades from 9.2 are particularly unusual since the xmax header was
> reused in the 9.3 era to mean something completely different under some
> circumstances.  I'm not in a position to do deeper debugging for you at
> this time, though.
> 
> The commit in question is
> https://git.postgresql.org/pg/commitdiff/986a9153b9708071adf6ce2c9131266f3431f4ec
> 
> Wild guess: maybe we should be checking HEAP_LOCKED_UPGRADED before
> bailing out.

Doesn't !(tuple->t_infomask & HEAP_XMAX_LOCK_ONLY) already guard against
that?

Greetings,

Andres Freund


Re: postgresql 9.6 - cannot freeze committed xmax

От
Alexandre Garcia
Дата:
Thanks for your answers, took me a little while to restore a backup from before the upgrade on an empty 9.2 server but here we are => https://gist.github.com/kuuji/cc73c3c193a0d58aa0e651fbd7bfec58





On Mon, Feb 26, 2018 at 6:21 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-02-26 20:09:31 -0300, Alvaro Herrera wrote:
> Alexandre Garcia wrote:
>
> > Recently our team performed an upgrade on one of our old postgres 9.2
> > databases. "for science" we tried a direct upgrade from 9.2 to 9.6  on our
> > staging environment . The initial plan was to do 9.2 -> 9.4 and 9.4 -> 9.6.
> >
> > The upgrade turned out successful on staging and we decided to go with it
> > on prod as well.
> > Prod starting throwing the following errors during autovacuum -> 'cannot
> > freeze committed xmax <xid>' on 2 different tables. Running vacuum manually
> > revealed more tables affected by the same error.
> >
> > The staging database did not present any error but the process to sync prod
> > to stage includes a sanitize script that removes sensitive information and
> > it somehow seems to fix the issue on stage (we've done a sync from prod to
> > stage after the upgrade)
> >
> > I've been doing a lot of search about this and even tried to go through the
> > code that throws that specific error.

Could you show pg_controldata from before/after pg_upgrade, and the
output of
SELECT oid::regclass, relfrozenxid, age(relfrozenxid), txid_current()
FROM pg_class
WHERE oid = 'problematic_table'::regclass;
from both before/after?


> This particular error condition is a sanity check that was only
> introduced in 9.6.7, so you would not find too many reports of that
> (this exact error message wording doesn't exist prior to that).  It is
> possible that we missed some corner case when writing that check.
> Upgrades from 9.2 are particularly unusual since the xmax header was
> reused in the 9.3 era to mean something completely different under some
> circumstances.  I'm not in a position to do deeper debugging for you at
> this time, though.
>
> The commit in question is
> https://git.postgresql.org/pg/commitdiff/986a9153b9708071adf6ce2c9131266f3431f4ec
>
> Wild guess: maybe we should be checking HEAP_LOCKED_UPGRADED before
> bailing out.

Doesn't !(tuple->t_infomask & HEAP_XMAX_LOCK_ONLY) already guard against
that?

Greetings,

Andres Freund

Re: postgresql 9.6 - cannot freeze committed xmax

От
Andres Freund
Дата:
Hi,

On 2018-02-28 21:12:58 +0000, Alexandre Garcia wrote:
> Thanks for your answers, took me a little while to restore a backup from
> before the upgrade on an empty 9.2 server but here we are =>
> https://gist.github.com/kuuji/cc73c3c193a0d58aa0e651fbd7bfec58

Thanks.

> > > > Prod starting throwing the following errors during autovacuum ->
> > 'cannot
> > > > freeze committed xmax <xid>' on 2 different tables. Running vacuum
> > manually
> > > > revealed more tables affected by the same error.

What's an example <xid> in that message that triggers for
problematic_table?  I need that to properly interpret the data you
provided.

Thanks,

Andres


Re: postgresql 9.6 - cannot freeze committed xmax

От
Alexandre Garcia
Дата:
Oh yes sorry => `ERROR:  cannot freeze committed xmax 43076385` on that same table

On Wed, Feb 28, 2018 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-02-28 21:12:58 +0000, Alexandre Garcia wrote:
> Thanks for your answers, took me a little while to restore a backup from
> before the upgrade on an empty 9.2 server but here we are =>
> https://gist.github.com/kuuji/cc73c3c193a0d58aa0e651fbd7bfec58

Thanks.

> > > > Prod starting throwing the following errors during autovacuum ->
> > 'cannot
> > > > freeze committed xmax <xid>' on 2 different tables. Running vacuum
> > manually
> > > > revealed more tables affected by the same error.

What's an example <xid> in that message that triggers for
problematic_table?  I need that to properly interpret the data you
provided.

Thanks,

Andres

Re: postgresql 9.6 - cannot freeze committed xmax

От
Andres Freund
Дата:
Hi Alexandre, Alvaro,

Please not that we try not to top-post on this list.

Alvaro,

On 2018-02-28 21:17:08 +0000, Alexandre Garcia wrote:
> Oh yes sorry => `ERROR:  cannot freeze committed xmax 43076385` on that
> same table

In light of
Latest checkpoint's NextXID:          0/128653693
I find this confusing.

Alexandre, did you ever use pg_resetxlog on that cluster?

Greetings,

Andres Freund


Re: postgresql 9.6 - cannot freeze committed xmax

От
Alexandre Garcia
Дата:
Sorry about the top-post.

Alexandre, did you ever use pg_resetxlog on that cluster?

Greetings,

Andres Freund

This database used to be managed by someone else and we started managing it recently. Our team didn't run pg_resetxlog for sure.

I just chatted with the previous team and it looks like it was never ran either. This is a fairly old database and it's not its first upgrade.

Re: postgresql 9.6 - cannot freeze committed xmax

От
Alexandre Garcia
Дата:
I just realized we're using 9.6.6 on our staging environment, which explains why we don't see that message altogether since it was introduced in 9.6.7.


Re: postgresql 9.6 - cannot freeze committed xmax

От
Alvaro Herrera
Дата:
Andres Freund wrote:

> Alvaro,
> 
> On 2018-02-28 21:17:08 +0000, Alexandre Garcia wrote:
> > Oh yes sorry => `ERROR:  cannot freeze committed xmax 43076385` on that
> > same table
> 
> In light of
> Latest checkpoint's NextXID:          0/128653693
> I find this confusing.

Ooh.

If you SELECT FOR UPDATE a tuple in 9.2, bit 0x0040 gets set in
infomask, and nothing else.  If you pg_upgrade and later try to freeze
such a tuple, it will fail with the error reported.

The correct test to use is HEAP_XMAX_IS_LOCKED_ONLY, which also tests
for the above condition.

I will verify this theory and push a patch shortly, if it proves
correct.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: postgresql 9.6 - cannot freeze committed xmax

От
Alexandre Garcia
Дата:


Ooh.

If you SELECT FOR UPDATE a tuple in 9.2, bit 0x0040 gets set in
infomask, and nothing else.  If you pg_upgrade and later try to freeze
such a tuple, it will fail with the error reported.

The correct test to use is HEAP_XMAX_IS_LOCKED_ONLY, which also tests
for the above condition.

I will verify this theory and push a patch shortly, if it proves
correct.


Oh good news  :). I have my old 9.2 around if you need me to do more testing on it.

Re: postgresql 9.6 - cannot freeze committed xmax

От
Alvaro Herrera
Дата:
Alexandre Garcia wrote:
> >
> > Ooh.
> >
> > If you SELECT FOR UPDATE a tuple in 9.2, bit 0x0040 gets set in
> > infomask, and nothing else.  If you pg_upgrade and later try to freeze
> > such a tuple, it will fail with the error reported.
> >
> > The correct test to use is HEAP_XMAX_IS_LOCKED_ONLY, which also tests
> > for the above condition.
> >
> > I will verify this theory and push a patch shortly, if it proves
> > correct.
>
> Oh good news  :). I have my old 9.2 around if you need me to do more
> testing on it.

Not necessary -- I reproduced the problem (quite easily -- just SELECT
FOR UPDATE a tuple in 9.2, then pg_upgrade, then VACUUM FREEZE the table
in the upgraded server) and confirm that it doesn't happen in 9.6.6,
happens in 9.6.8 (didn't try 9.6.7 but code is the same as 9.6.8), and
is fixed with the attached patch.

Will push soon ... probably tomorrow as I have to leave the building
now.

Cheers

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: postgresql 9.6 - cannot freeze committed xmax

От
Andres Freund
Дата:
Hi,

On 2018-02-28 20:30:19 -0300, Alvaro Herrera wrote:
> From 4ecaa6cdc041d9f44d54f9d56588ea73c1c6f3c7 Mon Sep 17 00:00:00 2001
> From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date: Wed, 28 Feb 2018 20:16:35 -0300
> Subject: [PATCH] Fix freeze xmax bug
> 
> ---
>  src/backend/access/heap/heapam.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 8a846e7dba..dc762f913d 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -6799,7 +6799,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>               * independent of committedness, since a committed lock holder has
>               * released the lock).
>               */
> -            if (!(tuple->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> +            if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)
> &&

Good catch.

Wonder if this a big enough fix? Isn't the bug "one layer up"? We
shouldn't do the TransactionIdPrecedes() calls in the surrounding block,
in the first place, right? Right now we could very well hit both
ERROR: found xmax %u from before relfrozenxid %u
and
ERROR: cannot freeze committed xmax %u
and freeze_xmax, totally_frozen will be wrong.  I think that's largely
harmless, but we should fix it nonetheless, and inspect other relevant
sites.

Greetings,

Andres Freund


Re: postgresql 9.6 - cannot freeze committed xmax

От
Alvaro Herrera
Дата:
Andres Freund wrote:

> Wonder if this a big enough fix? Isn't the bug "one layer up"? We
> shouldn't do the TransactionIdPrecedes() calls in the surrounding block,
> in the first place, right? Right now we could very well hit both
> ERROR: found xmax %u from before relfrozenxid %u
> and
> ERROR: cannot freeze committed xmax %u
> and freeze_xmax, totally_frozen will be wrong.  I think that's largely
> harmless, but we should fix it nonetheless, and inspect other relevant
> sites.

Hmm ...

After reading the 9.2 code, I don't think so.  These values must
absolutely be still within the freeze limits.  The version pre-pgupgrade
should have replaced the xmax with InvalidTransactionId as soon as the
tuple was freezable.

In fact, I checked 9.2's code and heap_tuple_needs_freeze tests "if xmax
not a multi and < cutoff_xid"; and in that case heap_freeze_tuple does
change the value to 0.

I hope I'm not misunderstanding what you mean.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: postgresql 9.6 - cannot freeze committed xmax

От
Alvaro Herrera
Дата:
I pushed my proposed fix all the way back to 9.3.

As I said earlier, I couldn't detect a bug of the sort you describe:

a) as I argued in my previous email, no tuple earlier than the
   relfreezexid should survive, because 9.2 did remove those when
   freezing the table.

b) One thing we could do but do not, is detect the case where there is
   an Xmax that is newer than the cutoff_xid but comes from before the
   pg_upgrade.  We don't freeze it in that case.  But we do return it as
   totally_frozen=false, so it'll be frozen later anyway -- nothing bad
   happens.  I don't think this is worth tinkering with.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: postgresql 9.6 - cannot freeze committed xmax

От
Andres Freund
Дата:
On 2018-03-01 18:32:08 -0300, Alvaro Herrera wrote:
> I pushed my proposed fix all the way back to 9.3.
> 
> As I said earlier, I couldn't detect a bug of the sort you describe:
> 
> a) as I argued in my previous email, no tuple earlier than the
>    relfreezexid should survive, because 9.2 did remove those when
>    freezing the table.

You're right, I think I was thinking of the < 9.3.x case where
multixacts weren't removed at all.

Greetings,

Andres Freund