Обсуждение: VACUUM unable to accomplish because of a non-existent MultiXactId

Поиск
Список
Период
Сортировка

VACUUM unable to accomplish because of a non-existent MultiXactId

От
Kouber Saparev
Дата:
Hello,

Recently I spotted an auto vacuum that is constantly being run over a pg_toast table in the database. Interestingly the
underlyingtable did not have that amount of writes that would trigger the auto vacuum every minute. 

When I tried to run the VACUUM manually it died with a non-existent MultiXactId error:

db=# vacuum analyze verbose pg_toast.pg_toast_376621;
INFO:  vacuuming "pg_toast.pg_toast_376621"
ERROR:  MultiXactId 2915905228 does no longer exist -- apparent wraparound

db=# select txid_current();
 txid_current
--------------
   2583853583
(1 row)

db=# select datfrozenxid from pg_database where datname=‘db';
datfrozenxid
--------------
   2161848861
(1 row)


We did a pg_upgrade about a month ago to upgrade from 9.2 to 9.4, so I guess there might be some relation, as I do not
rememberhaving that issue before the upgrade. 

The question is, how can I clean the table in question and finally be able to run a vacuum over it?

Regards,
—
Kouber Saparev

Re: VACUUM unable to accomplish because of a non-existent MultiXactId

От
Alvaro Herrera
Дата:
Kouber Saparev wrote:
> Hello,
>
> Recently I spotted an auto vacuum that is constantly being run over a pg_toast table in the database. Interestingly
theunderlying table did not have that amount of writes that would trigger the auto vacuum every minute. 
>
> When I tried to run the VACUUM manually it died with a non-existent MultiXactId error:
>
> db=# vacuum analyze verbose pg_toast.pg_toast_376621;
> INFO:  vacuuming "pg_toast.pg_toast_376621"
> ERROR:  MultiXactId 2915905228 does no longer exist -- apparent wraparound
>
> db=# select txid_current();
>  txid_current
> --------------
>    2583853583
> (1 row)
>
> db=# select datfrozenxid from pg_database where datname=‘db';
> datfrozenxid
> --------------
>    2161848861
> (1 row)
>
>
> We did a pg_upgrade about a month ago to upgrade from 9.2 to 9.4, so I guess there might be some relation, as I do
notremember having that issue before the upgrade. 
>
> The question is, how can I clean the table in question and finally be able to run a vacuum over it?

Hmm, I don't think there can be multixacts in toast tables at all,
normally.  SELECT FOR UPDATE fails on a toast table, so I don't see a
mechanism for this to happen at all.  Maybe some odd corner case in 9.2
that left things in a strange state.

The state of your data is probably caused by some weird corner case of
the upgrade.  Can you see in the log files that the toast table has been
failing vacuum since the upgrade, or is it more recent than that?  (In
other words, is there any working vacuum after the upgrade?)

It would be useful to debug this that you attached gdb to a backend, set
breakpoint on errfinish, then run vacuum on that table.  Then you can
extract the page number from the backtrace.  With the page number we can
try pageinspect and heap_page_items until we find the culprit and
perhaps identify how it got in that state.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: VACUUM unable to accomplish because of a non-existent MultiXactId

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Kouber Saparev wrote:
>> Recently I spotted an auto vacuum that is constantly being run over a pg_toast table in the database. Interestingly
theunderlying table did not have that amount of writes that would trigger the auto vacuum every minute. 
>>
>> When I tried to run the VACUUM manually it died with a non-existent MultiXactId error:
>>
>> db=# vacuum analyze verbose pg_toast.pg_toast_376621;
>> INFO:  vacuuming "pg_toast.pg_toast_376621"
>> ERROR:  MultiXactId 2915905228 does no longer exist -- apparent wraparound

> Hmm, I don't think there can be multixacts in toast tables at all,
> normally.

Couldn't creation and deletion of a toasted object within the same
transaction do it?

            regards, tom lane


Re: VACUUM unable to accomplish because of a non-existent MultiXactId

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Kouber Saparev wrote:
> >> Recently I spotted an auto vacuum that is constantly being run over a pg_toast table in the database.
Interestinglythe underlying table did not have that amount of writes that would trigger the auto vacuum every minute. 
> >>
> >> When I tried to run the VACUUM manually it died with a non-existent MultiXactId error:
> >>
> >> db=# vacuum analyze verbose pg_toast.pg_toast_376621;
> >> INFO:  vacuuming "pg_toast.pg_toast_376621"
> >> ERROR:  MultiXactId 2915905228 does no longer exist -- apparent wraparound
>
> > Hmm, I don't think there can be multixacts in toast tables at all,
> > normally.
>
> Couldn't creation and deletion of a toasted object within the same
> transaction do it?

I tried and couldn't find a way to cause one to appear.  When a row is
created, it has xmax=0 so when it's deleted the xmax can just be set to
the deleting xact.  And we don't ever update pg_toast rows, AFAIK.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: VACUUM unable to accomplish because of a non-existent MultiXactId

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Couldn't creation and deletion of a toasted object within the same
>> transaction do it?

> I tried and couldn't find a way to cause one to appear.  When a row is
> created, it has xmax=0 so when it's deleted the xmax can just be set to
> the deleting xact.  And we don't ever update pg_toast rows, AFAIK.

Ah, nevermind, brain fade ... I was momentarily confusing this with
combo CIDs.  Right, there are no updates or row locks taken on toast
table rows (unless someone were to do one manually?)

            regards, tom lane


Re: VACUUM unable to accomplish because of a non-existent MultiXactId

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Couldn't creation and deletion of a toasted object within the same
> >> transaction do it?
>
> > I tried and couldn't find a way to cause one to appear.  When a row is
> > created, it has xmax=0 so when it's deleted the xmax can just be set to
> > the deleting xact.  And we don't ever update pg_toast rows, AFAIK.
>
> Ah, nevermind, brain fade ... I was momentarily confusing this with
> combo CIDs.  Right, there are no updates or row locks taken on toast
> table rows (unless someone were to do one manually?)

Both UPDATE and SELECT FOR UPDATE fail right away:

alvherre=# update pg_toast.pg_toast_16398 set chunk_id = 7 where chunk_id = 5;
ERROR:  cannot change TOAST relation "pg_toast_16398"

alvherre=# select * from pg_toast.pg_toast_16398 for update;
ERROR:  cannot lock rows in TOAST relation "pg_toast_16398"

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: VACUUM unable to accomplish because of a non-existent MultiXactId

От
Kouber Saparev
Дата:

The state of your data is probably caused by some weird corner case of
the upgrade.  Can you see in the log files that the toast table has been
failing vacuum since the upgrade, or is it more recent than that?  (In
other words, is there any working vacuum after the upgrade?)

We upgraded to 9.4.5 on 19 October, and there was a successful automatic vacuum over pg_toast_376621 just 3 days later - on 22 October:

Oct 22 08:16:49 db-master postgres[10589]: [3-1] []: LOG:  automatic vacuum of table “db.pg_toast.pg_toast_376621": index scans: 1
Oct 22 08:16:49 db-master postgres[10589]: [3-2]      pages: 0 removed, 784361 remain
Oct 22 08:16:49 db-master postgres[10589]: [3-3]      tuples: 110 removed, 3768496 remain, 0 are dead but not yet removable
Oct 22 08:16:49 db-master postgres[10589]: [3-4]      buffer usage: 37193 hits, 44891 misses, 32311 dirtied
Oct 22 08:16:49 db-master postgres[10589]: [3-5]      avg read rate: 0.954 MB/s, avg write rate: 0.686 MB/s
Oct 22 08:16:49 db-master postgres[10589]: [3-6]      system usage: CPU 1.10s/1.67u sec elapsed 367.73 sec

The next automatic vacuum came 8 days later - on 30 October and failed and it is failing ever since:

Oct 30 14:22:01 db-master postgres[16160]: [3-1] []: ERROR:  MultiXactId 2915905228 does no longer exist -- apparent wraparound
Oct 30 14:22:01 db-master postgres[16160]: [3-2] []: CONTEXT:  automatic vacuum of table “db.pg_toast.pg_toast_376621”

So I guess something happened between 22 and 30 October and there is no relation to the pg_upgrade we did on 19 October.

It would be useful to debug this that you attached gdb to a backend, set
breakpoint on errfinish, then run vacuum on that table.  Then you can
extract the page number from the backtrace.  With the page number we can
try pageinspect and heap_page_items until we find the culprit and
perhaps identify how it got in that state.

I will try to obtain the page number, and will then send you the results, thank you.

Can we somehow do it on one of our replicas (after detaching it), i.e. is the corrupted record propagated through the replication channel, and in the meantime fix the table on the master?

Thanks!

Kouber Saparev

Re: VACUUM unable to accomplish because of a non-existent MultiXactId

От
Alvaro Herrera
Дата:
Kouber Saparev wrote:
>
> > The state of your data is probably caused by some weird corner case of
> > the upgrade.  Can you see in the log files that the toast table has been
> > failing vacuum since the upgrade, or is it more recent than that?  (In
> > other words, is there any working vacuum after the upgrade?)
>
> We upgraded to 9.4.5 on 19 October, and there was a successful automatic vacuum over pg_toast_376621 just 3 days
later- on 22 October: 
>
> Oct 22 08:16:49 db-master postgres[10589]: [3-1] []: LOG:  automatic vacuum of table “db.pg_toast.pg_toast_376621":
indexscans: 1 
> Oct 22 08:16:49 db-master postgres[10589]: [3-2]      pages: 0 removed, 784361 remain
> Oct 22 08:16:49 db-master postgres[10589]: [3-3]      tuples: 110 removed, 3768496 remain, 0 are dead but not yet
removable
> Oct 22 08:16:49 db-master postgres[10589]: [3-4]      buffer usage: 37193 hits, 44891 misses, 32311 dirtied
> Oct 22 08:16:49 db-master postgres[10589]: [3-5]      avg read rate: 0.954 MB/s, avg write rate: 0.686 MB/s
> Oct 22 08:16:49 db-master postgres[10589]: [3-6]      system usage: CPU 1.10s/1.67u sec elapsed 367.73 sec
>
> The next automatic vacuum came 8 days later - on 30 October and failed and it is failing ever since:
>
> Oct 30 14:22:01 db-master postgres[16160]: [3-1] []: ERROR:  MultiXactId 2915905228 does no longer exist -- apparent
wraparound
> Oct 30 14:22:01 db-master postgres[16160]: [3-2] []: CONTEXT:  automatic vacuum of table
“db.pg_toast.pg_toast_376621”
>
> So I guess something happened between 22 and 30 October and there is no relation to the pg_upgrade we did on 19
October.

OK, so it wasn't pg_upgrade.  Good to know.

> > It would be useful to debug this that you attached gdb to a backend, set
> > breakpoint on errfinish, then run vacuum on that table.  Then you can
> > extract the page number from the backtrace.  With the page number we can
> > try pageinspect and heap_page_items until we find the culprit and
> > perhaps identify how it got in that state.
>
> I will try to obtain the page number, and will then send you the results, thank you.
>
> Can we somehow do it on one of our replicas (after detaching it), i.e.
> is the corrupted record propagated through the replication channel,
> and in the meantime fix the table on the master?

I suppose the corrupted data should be in any replicas as well, but I
have no way to be sure.  Also, I don't know how to get the data to a
valid state, other than dropping the table and reloading it.  I suppose
you can take a complete pg_dump of the data in that table?

If you want to reset it to a known-good state, you could just save the
pg_toast file aside (and its corresponding main table file, just in
case).  Doing forensics in a copy is better practice anyway, you know.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services