Обсуждение: Re: MultiXactId Error in Autovacuum

Поиск
Список
Период
Сортировка

Re: MultiXactId Error in Autovacuum

От
Karl Hafner
Дата:
Thank you for your suggestion!

Since the autovacuum process is failing each time because of: "ERROR:  MultiXactId 2683601542 does no longer exist -- apparent wraparound", I have expected the manual vacuum also to fail because of the same error.
I've tried it anyway. Well ... manual vacuum also fails unfortunately.
I did not run vacuum with --full, because I want to avoid a locked table.

BR, 
Karl



On Sat, Nov 16, 2013 at 9:25 PM, Strahinja Kustudić <strahinjak@nordeus.com> wrote:
If autovacuum doesn't want to finish the vacuuming of that table, why not just kill the autovacuum process on that table (you could use pg_cancel_backend())and run vacuum manually on that table?


Strahinja Kustudić
| Lead System Engineer | Nordeus


On Fri, Nov 15, 2013 at 10:33 AM, Karl Hafner <karl@scoreloop.com> wrote:
Sorry, dear mailing list, somehow my email got sent before I finished it :-(

I will go on and repeat. Please bear with me!


I am currently worried about an error that shows up in our log files since a few weeks: 

2013-11-15 07:13:13 UTC [22668]: [2-1] ERROR:  MultiXactId 2683601542 does no longer exist -- apparent wraparound
2013-11-15 07:13:13 UTC [22668]: [3-1] CONTEXT:  automatic vacuum of table "scoreloop.public.gamer_device_sightings"

It is always the same MultiXactId.
It seems that "autovacuum" is active on that table, but it seems to have restarted today in the morning:  

select query_start, query, state, pid from pg_stat_activity where query LIKE 'autovacuum%' ;

# select query_start, query, state, pid from pg_stat_activity where query LIKE 'autovacuum%' ;
         
query_start  |                                      query                                       | state  | pid  
-------------------------------+----------------------------------------------------------------------------------+--------+------
 2013-11-15 07:13:13.281202+00 | autovacuum: VACUUM ANALYZE public.gamer_device_sightings (to prevent wraparound) | active | 4360


There are a few locks on tat table: 

# select c.oid,c.relname,l.pid,l.mode,l.granted from pg_class c join pg_locks l on c.oid=l.relation where l.pid='4360' order by l.pid;
  oid  |                 relname                  | pid  |           mode           | granted 
-------+------------------------------------------+------+--------------------------+---------
 19862 | index_gamer_device_sightings_on_gamer_id | 4360 | RowExclusiveLock         | t
 19719 | gamer_device_sightings_pkey                      | 4360 | RowExclusiveLock         | t
 19309 | gamer_device_sightings                               | 4360 | ShareUpdateExclusiveLock | t



The error I described seems to terminate the auto vacuum process: It has never finished on that table. 
It is the only table without a finished vacuum. So I guess the above error is the reason for this.

Is there anything that I can do to get that fixed? Should I worry about the error? (I would not if the vacuum would run through).

Any tips, any help would be much appreciated!

Thank you very much! And sorry once more for my posting mistake
Karl 














On Fri, Nov 15, 2013 at 10:16 AM, Karl Hafner <karl@scoreloop.com> wrote:
Dear mailing-list, 

I am currently worried about an error that shows up in our log files since a few weeks: 

2013-11-15 07:13:13 UTC [22668]: [2-1] ERROR:  MultiXactId 2683601542 does no longer exist -- apparent wraparound
2013-11-15 07:13:13 UTC [22668]: [3-1] CONTEXT:  automatic vacuum of table "scoreloop.public.gamer_device_sightings"

It seems that "autovacuum" is active on that table, but it seems to have restarted today in the morning:  

select query_start, query, state, pid from pg_stat_activity where query LIKE 'autovacuum%' ;

XXX=# select query_start, query, state, pid from pg_stat_activity where query LIKE 'autovacuum%' ;
         
query_start          |                                      query                                       | state  | pid  
-------------------------------+----------------------------------------------------------------------------------+--------+------
 2013-11-15 07:13:13.281202+00 | autovacuum: VACUUM ANALYZE public.gamer_device_sightings (to prevent wraparound) | active | 4360



Re: MultiXactId Error in Autovacuum

От
Albe Laurenz
Дата:
Karl Hafner wrote:
> Since the autovacuum process is failing each time because of: "ERROR:  MultiXactId 2683601542 does no
> longer exist -- apparent wraparound", I have expected the manual vacuum also to fail because of the
> same error.
> I've tried it anyway. Well ... manual vacuum also fails unfortunately.

Do you know if you experienced transaction ID wraparound as the
message suggests?

Are there any other questionable messages in the log?

What do you get for the following queries:

SELECT txid_current();

SELECT datname, datfrozenxid FROM pg_database;

SELECT current_setting('autovacuum_freeze_max_age'),
       current_setting('vacuum_freeze_min_age'),
       current_setting('vacuum_freeze_table_age');

Yours,
Laurenz Albe

Re: MultiXactId Error in Autovacuum

От
Karl Hafner
Дата:
Hi Laurenz, 

thank you for looking into my issue. 


Do you know if you experienced transaction ID wraparound as the
message suggests?

I am not aware of a wraparound in the past. I am quite sure no. 


Are there any other questionable messages in the log?

The log shows nothing special except this error that is always associated with the same MultiXactId 2683601542.
 

What do you get for the following queries:

SELECT txid_current();

txid_current 
--------------
1673482898
 

SELECT datname, datfrozenxid FROM pg_database;


SELECT datname, datfrozenxid FROM pg_database;

datname  | datfrozenxid 
-----------+--------------
postgres  |   1497699916
template0 |   1497699916
MYDB      |   1335927866
template1 |   1497699916


 

SELECT current_setting('autovacuum_freeze_max_age'),
       current_setting('vacuum_freeze_min_age'),
       current_setting('vacuum_freeze_table_age');


current_setting | current_setting | current_setting 
-----------------+-----------------+-----------------
200000000       | 50000000        | 150000000


Thank you!
Best regards, 
Karl

Re: MultiXactId Error in Autovacuum

От
Albe Laurenz
Дата:
Karl Hafner wrote:
>> Do you know if you experienced transaction ID wraparound as the
>> message suggests?
> 
> I am not aware of a wraparound in the past. I am quite sure no.
> 
>> Are there any other questionable messages in the log?
> 
> The log shows nothing special except this error that is always associated with the same MultiXactId
> 2683601542.

Well, probably you are right, because PostgreSQL would shutdown
before a wraparound happens, and you would have noticed that.

I don't know how it happened, but your database has suffered
corruption.  Any chance you can restore from a backup?
Can you still "SELECT *" from the table that has the problem?

Yours,
Laurenz Albe

Re: MultiXactId Error in Autovacuum

От
Karl Hafner
Дата:

Well, probably you are right, because PostgreSQL would shutdown
before a wraparound happens, and you would have noticed that.
 
I don't know how it happened, but your database has suffered
corruption.  Any chance you can restore from a backup?
Can you still "SELECT *" from the table that has the problem?
 
I cannot run a full SELECT * on that table. It stops after a few minutes with the same error.
I am currently testing different things on a DB created from a backup like exporting per month/per day … 
but I am running into this error again and again. 
Running a VACUUM --FULL gives me:  WARNING:  concurrent delete in progress within table "the_bad_table" … then it also breaks.


BR, 
Karl

Re: MultiXactId Error in Autovacuum

От
Karl Hafner
Дата:
Hi all, 

sorry for not getting back earlier. I wanted to let you know that we found the reason for our error (and a possible solution for our problem) in the meantime: 

The starting point of our worries was a log file error message: "ERROR: MultiXactId 2683601542 does no longer exist -- apparent wraparound". 
Autovacuum never finished on one of our tables because of this.

So we started to investigate and found out:  

We had a few rows in our table that suffered from some inconsistencies in its rowlock status, probably caused by a disc crash a few month ago.
Statements like "SELECT * FROM mytable" or autovacuum run into an error and exit when reading one of these inconstistent rows.

We digged deeper: Those rows were marked already DEAD but still pointed to MultiXactIds that were not listed in
pg_multixact anymore (what probably messed up the lock information)

The interesting point is: Different statements read those information in a different order.

  (a) Statements that select a single row like "SELECT * FROM mytable WHERE id = 'xxxx'" would recognize the status DEAD 
       BEFORE checking the LOCK status. Those statements just return 0 rows.

  (b) Statements that read every row like "SELECT id FROM mytable" process the LOCK information BEFORE checking if a row is already marked DEAD.
      If there is something wrong (like in our case due to inconsistent lock information) the statement quits with an ERROR. 
      In our case autovacuum always died because of this.

The trick was patching postgres for case (b) in a way that it would just print a WARNING and the row ID for these defect rows and continue.

Finally we knew the IDs of the defect rows (we had to install the contrib module pgrowlocks to find the locked rows) using the query
"SELECT id FROM mytable AS m, pgrowlocks('mytable') AS p WHERE p.locked_row = m.ctid".
Just deleting these rows using "DELETE FROM mytable WHERE id = 'xxxx'" did not work because of case (a). It returned 0 rows.
So we copied all but the defect rows into a new table, dropped the old table, and renamed the new one - quite an effort for finally deleting two rows :-)

We are wondering it there is a deeper reason behind treating case (a) and (b) in a different way or if it would be better to get rid of these inconsistencies in future postgres versions.

Thanks for all your input and help.
BR, 
Karl






On Wed, Nov 20, 2013 at 3:20 PM, Karl Hafner <karl@scoreloop.com> wrote:

Well, probably you are right, because PostgreSQL would shutdown
before a wraparound happens, and you would have noticed that.
 
I don't know how it happened, but your database has suffered
corruption.  Any chance you can restore from a backup?
Can you still "SELECT *" from the table that has the problem?
 
I cannot run a full SELECT * on that table. It stops after a few minutes with the same error.
I am currently testing different things on a DB created from a backup like exporting per month/per day … 
but I am running into this error again and again. 
Running a VACUUM --FULL gives me:  WARNING:  concurrent delete in progress within table "the_bad_table" … then it also breaks.


BR, 
Karl


Re: MultiXactId Error in Autovacuum

От
Simon Riggs
Дата:
On 3 December 2013 08:55, Karl Hafner <karl@scoreloop.com> wrote:

> We are wondering it there is a deeper reason behind treating case (a) and
> (b) in a different way or if it would be better to get rid of these
> inconsistencies in future postgres versions.

I believe your problems have been fixed in the latest point release, 9.3.2

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services