Обсуждение: Template zero xid issue
Our template 0 database has an XID issue:
postgres=# select datname, age(datfrozenxid) from pg_database;
datname | age
--------------+------------
postgres | 1073965753
.
.
.
template1 | 1073892377
template0 | 2146706522
(7 rows)
When we try a vacuum full, it skips template 0:
[root@p02c06d130 mxl]# /mxl/var/pgsql/bin/vacuumdb -a -z -U postgres
vacuumdb: vacuuming database "postgres"
VACUUM
.
.
.
VACUUM
vacuumdb: vacuuming database "template1"
VACUUM
Suggestions for how to fix this issue?
Thanks,
Keaton
"Keaton Adams" <kadams@mxlogic.com> writes: > Our template 0 database has an XID issue: No, it doesn't. template0 never needs vacuuming. regards, tom lane
Our production DB shut down because of it: <2007-08-06 09:44:08 MDT>LOG: background writer process (PID 24439) exited with exit code 0 <2007-08-06 09:44:08 MDT>LOG: terminating any other active server processes <2007-08-06 09:44:08 MDT>LOG: all server processes terminated; reinitializing <2007-08-06 09:44:08 MDT>LOG: database system was shut down at 2007-08-06 09:44:08 MDT <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: checkpoint record is at 1A7/4C987A2C <2007-08-06 09:44:08 MDT>LOG: redo record is at 1A7/4C987A2C; undo record is at 0/0; shutdown TRUE <2007-08-06 09:44:08 MDT>LOG: next transaction ID: 2146747335; next OID: 677503791 <2007-08-06 09:44:08 MDT>LOG: next MultiXactId: 2937; next MultiXactOffset: 5901 <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: database system is ready <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: transaction ID wrap limit is 2147484146, limited by database "template0" <2007-08-06 09:44:08 MDT>WARNING: database "template0" must be vacuumed within 736811 transactions <2007-08-06 09:44:08 MDT>HINT: To avoid a database shutdown, execute a full-database VACUUM in "template0". <2007-08-06 09:44:08 MDT>LOG: shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: database system is shut down -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, August 06, 2007 9:51 AM To: Keaton Adams Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Template zero xid issue "Keaton Adams" <kadams@mxlogic.com> writes: > Our template 0 database has an XID issue: No, it doesn't. template0 never needs vacuuming. regards, tom lane
So it is odd that over 14,000 XIDs are being consumed on template zero in just over one minute? bash-3.00$ date; psql -dpostgres -frun.sql Mon Aug 6 10:18:12 MDT 2007 datname | age --------------+------------ postgres | 1074342864 . . . template1 | 1074269488 template0 | 1074078525 (7 rows) bash-3.00$ date; psql -dpostgres -frun.sql Mon Aug 6 10:19:27 MDT 2007 datname | age --------------+------------ postgres | 1074357079 . . . template1 | 1074283703 template0 | 1074092740 (7 rows) What would be using template0 at this rate? -Keaton -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Keaton Adams Sent: Monday, August 06, 2007 10:14 AM To: Tom Lane Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Template zero xid issue Our production DB shut down because of it: <2007-08-06 09:44:08 MDT>LOG: background writer process (PID 24439) exited with exit code 0 <2007-08-06 09:44:08 MDT>LOG: terminating any other active server processes <2007-08-06 09:44:08 MDT>LOG: all server processes terminated; reinitializing <2007-08-06 09:44:08 MDT>LOG: database system was shut down at 2007-08-06 09:44:08 MDT <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: checkpoint record is at 1A7/4C987A2C <2007-08-06 09:44:08 MDT>LOG: redo record is at 1A7/4C987A2C; undo record is at 0/0; shutdown TRUE <2007-08-06 09:44:08 MDT>LOG: next transaction ID: 2146747335; next OID: 677503791 <2007-08-06 09:44:08 MDT>LOG: next MultiXactId: 2937; next MultiXactOffset: 5901 <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: database system is ready <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: transaction ID wrap limit is 2147484146, limited by database "template0" <2007-08-06 09:44:08 MDT>WARNING: database "template0" must be vacuumed within 736811 transactions <2007-08-06 09:44:08 MDT>HINT: To avoid a database shutdown, execute a full-database VACUUM in "template0". <2007-08-06 09:44:08 MDT>LOG: shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>FATAL: the database system is shutting down <2007-08-06 09:44:08 MDT>LOG: database system is shut down -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, August 06, 2007 9:51 AM To: Keaton Adams Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Template zero xid issue "Keaton Adams" <kadams@mxlogic.com> writes: > Our template 0 database has an XID issue: No, it doesn't. template0 never needs vacuuming. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
"Keaton Adams" <kadams@mxlogic.com> writes: > Our production DB shut down because of it: [ squint... ] Which PG version is this exactly? Is template0 marked as datallowconn in pg_database, or not? regards, tom lane
Sorry, I should have mentioned that to begin with: psql 8.1.4 mxl=# select datname, datallowconn from pg_database; datname | datallowconn --------------+-------------- postgres | t . . . template1 | t template0 | f (7 rows) It's set to false. mxl=# \c template0; FATAL: database "template0" is not currently accepting connections Previous connection kept -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, August 06, 2007 10:30 AM To: Keaton Adams Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Template zero xid issue "Keaton Adams" <kadams@mxlogic.com> writes: > Our production DB shut down because of it: [ squint... ] Which PG version is this exactly? Is template0 marked as datallowconn in pg_database, or not? regards, tom lane
"Keaton Adams" <kadams@mxlogic.com> writes: > So it is odd that over 14,000 XIDs are being consumed on template zero > in just over one minute? You misunderstand what you are looking at --- that's the cluster-wide XID consumption rate, not any one database's. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Keaton Adams wrote: > Sorry, I should have mentioned that to begin with: psql 8.1.4 1. Upgrade to the actual stable release, which is 8.1.9 2. IIRC there is a bug in 8.1.4 that causes template0 to increment when it shouldn't. Update the template0 to allow connections, vacuum full, dissallow connections, and then upgrade. Sincerely, Joshua D. Drake > > > mxl=# select datname, datallowconn from pg_database; > datname | datallowconn > --------------+-------------- > postgres | t > . > . > . > template1 | t > template0 | f > (7 rows) > > It's set to false. > > mxl=# \c template0; > FATAL: database "template0" is not currently accepting connections > Previous connection kept > > > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Monday, August 06, 2007 10:30 AM > To: Keaton Adams > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Template zero xid issue > > "Keaton Adams" <kadams@mxlogic.com> writes: >> Our production DB shut down because of it: > > [ squint... ] Which PG version is this exactly? Is template0 marked as > datallowconn in pg_database, or not? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/ > - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGt1HXATb/zqfZUUQRAuOHAJ9jSKZXU9CzLWcgy2DYSRXLMTsqTQCgnw4V jVsSNPjay0w6MHxGWsdu8no= =oJhU -----END PGP SIGNATURE-----
"Keaton Adams" <kadams@mxlogic.com> writes: > Sorry, I should have mentioned that to begin with: psql 8.1.4 Hmm, you know that's pretty old don't you? The latest release in that branch is 8.1.9 (and there'll be an 8.1.10 before long). > It's set to false. > mxl=3D# \c template0; > FATAL: database "template0" is not currently accepting connections > Previous connection kept That's what it should be, but nonconnectable databases are supposed to be ignored in figuring the XID wraparound limit, so I don't see how you got the warning mentioning template0. Do you have autovacuum enabled? I think there are some relevant post-8.1.4 changes in autovac. regards, tom lane
"Joshua D. Drake" <jd@commandprompt.com> writes: > 2. IIRC there is a bug in 8.1.4 that causes template0 to increment when > it shouldn't. Update the template0 to allow connections, vacuum full, > dissallow connections, and then upgrade. That should be "vacuum freeze" but otherwise this is the correct recovery process. However, before destroying the evidence I'd like to find out how he got into this state ... regards, tom lane
We needed the DB up right away; one of the operations guys connected through a standalone backend to vacuum template0 and now the DB is up again: "sudo -u postgres /mxl/var/pgsql/bin/postgres -D /mxl/var/pgsql/data -O -P template0 Once I got in, I just ran a VACUUM VERBOSE ANALYZE; to "reset" the XID counter." So just to be clear on this statement: "You misunderstand what you are looking at --- that's the cluster-wide XID consumption rate, not any one database's." In the docs it says that "the value stored in the datfrozenxid field is the freeze cutoff XID that was used by the last database-wide vacuum operation. All normal XIDs older than this cutoff XID are guaranteed to have been replaced by FrozenXID within that database." The query to check is: select datname, age(datfrozenxid) from pg_database; "The age column measures the number of transactions from the cutoff XID to the current transaction's XID. When the age approaches two billion, the database must be vacuumed again to avoid risk of wraparound failures." So, if XID is database instance (cluster-wide), that would explain why the actual increment of the age value is the same across all DB's, even though the actual XID numbers are different per db? bash-3.00$ date; psql -dpostgres -frun.sql Mon Aug 6 11:20:03 MDT 2007 datname | age --------------+------------ postgres | 1075005030 mxl | 1075004446 . . . template1 | 1074931654 template0 | 1074740691 (7 rows) bash-3.00$ date; psql -dpostgres -frun.sql Mon Aug 6 11:23:53 MDT 2007 datname | age --------------+------------ postgres | 1075046568 mxl | 1075045984 . . . template1 | 1074973192 template0 | 1074782229 (7 rows) Postgres increment: 41,538 Mxl increment: 41,538 Template 1 increment: 41,538 Template 0 increment: 41,538 But, the actual reset of the FrozenXID needs to take place at the individual DB level, and that's where the full-db vacuum comes into play, correct? And the bug in 8.1.4 is allowing the FrozenXID on template0 to increment when it shouldn't be doing so, and since the normal vacuum ignores a read-only db such as template0, that is why we had to force a vacuum after the DB shut down? Do I have this right? Thanks again, Keaton -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, August 06, 2007 11:02 AM To: Joshua D. Drake Cc: Keaton Adams; pgsql-general@postgresql.org Subject: Re: [GENERAL] Template zero xid issue "Joshua D. Drake" <jd@commandprompt.com> writes: > 2. IIRC there is a bug in 8.1.4 that causes template0 to increment when > it shouldn't. Update the template0 to allow connections, vacuum full, > dissallow connections, and then upgrade. That should be "vacuum freeze" but otherwise this is the correct recovery process. However, before destroying the evidence I'd like to find out how he got into this state ... regards, tom lane
"Keaton Adams" <kadams@mxlogic.com> writes: > Do I have this right? It's not clear. You didn't answer the question about whether you were running autovac, but even if you were I'm not sure why you got the complaint about template0. AFAICS the only way for template0 to be installed as oldest_datname is if vac_truncate_clog() is invoked in template0, allowing it to be the initial selection before the loop starts. Which could happen if autovac is on, since it will try to vacuum template0, and in 8.1.4 it neglects to use the FREEZE option. But even then, if we just finished a DB-wide vacuum on template0, it should not have had the oldest datfrozenxid; and certainly not one old enough to provoke complaints. Actually there's a whole lot that isn't clear about this. Your first report showed a warning --- not an error --- about template0, so it's not at all clear that the "database system is shutting down" messages had anything to do with that at all. And you later posted some queries that weren't drawing any such message. What was the exact sequence of events here? Alvaro: you have any thoughts about this? I'm suspicious that we ought to change 8.1's vac_truncate_clog() to not include the current database automatically, since it might be not-datallowconn; but I don't see how that potential bug could actually be exposed as a fault. regards, tom lane
Auto vacuum is not enabled: #----------------------------------------------------------------------- ---- # AUTOVACUUM PARAMETERS #----------------------------------------------------------------------- ---- #autovacuum = off # enable autovacuum subprocess? #autovacuum_naptime = 60 # time between autovacuum runs, in secs #autovacuum_vacuum_threshold = 1000 # min # of tuple updates before # vacuum #autovacuum_analyze_threshold = 500 # min # of tuple updates before # analyze #autovacuum_vacuum_scale_factor = 0.4 # fraction of rel size before # vacuum #autovacuum_analyze_scale_factor = 0.2 # fraction of rel size before # analyze #autovacuum_vacuum_cost_delay = -1 # default vacuum cost delay for # autovac, -1 means use # vacuum_cost_delay #autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for # autovac, -1 means use # vacuum_cost_limit Instead, the vacuumdb command is run on the DBs to perform a full DB vacuum every Saturday. The DB template0 is skipped by the vacuumdb process, according to the output: /xyz/var/pgsql/bin/vacuumdb -a -z -U postgres vacuumdb: vacuuming database "postgres" VACUUM vacuumdb: vacuuming database "mxl" VACUUM vacuumdb: vacuuming database "template1" VACUUM 1. Ops has a job that monitors datfrozenxid. An event was triggered this morning because the count on template0 was approaching the limit. I don't have the original output, but it looked something like this: mxl=# select datname, age(datfrozenxid) from pg_database; datname | age --------------+------------ postgres | 1478038975 mxl_survey | 1478038263 mxl | 1478036889 template1 | 1478010730 template0 | 2146747335 2. As we were attempting to figure out why this was the case, Postgres stopped accepting commands, as reported in the log: <2007-08-06 09:42:50 MDT>ERROR: database is not accepting commands to avoid wraparound data loss in database "template0" <2007-08-06 09:42:50 MDT>HINT: Stop the postmaster and use a standalone backend to vacuum database "template0". 3. One of the admins initiated a shutdown request to the DB. The shutdown was NOT automatically done by Postgres as I had originally stated, but it became necessary nonetheless: <2007-08-06 09:44:05 MDT>LOG: received fast shutdown request <2007-08-06 09:44:05 MDT>LOG: aborting any active transactions <2007-08-06 09:44:05 MDT>FATAL: terminating connection due to administrator command <2007-08-06 09:44:05 MDT>FATAL: terminating connection due to administrator command <2007-08-06 09:44:05 MDT>FATAL: terminating connection due to administrator command 4. During the shutdown we continued to receive these messages in the log: <2007-08-06 09:44:08 MDT>LOG: transaction ID wrap limit is 2147484146, limited by database "template0" <2007-08-06 09:44:08 MDT>WARNING: database "template0" must be vacuumed within 736811 transactions <2007-08-06 09:44:08 MDT>HINT: To avoid a database shutdown, execute a full-database VACUUM in "template0". 5. Once the DB was down, a standalone backend was used to execute a vacuum against template0: sudo -u postgres /mxl/var/pgsql/bin/postgres -D /mxl/var/pgsql/data -O -P template0 >VACUUM VERBOSE ANALYZE; 6. The DB started successfully and normal operations resumed. We do have a second 8.1.4 DB running the same type of transaction load and the datfrozenxid query shows template0 having a larger 'age' value than any of the other databases in the cluster: postgres=# select datname, age(datfrozenxid) from pg_database; datname | age --------------+------------ postgres | 1108184564 mxl | 1108182478 template1 | 1108156319 template0 | 1497605405 So before it grows into the 2.1 billion value we'll need to do a manual reset of the FrozenXID on this DB as well. Of course, we are looking into an 8.1.9 upgrade to avoid this problem in the future. Thanks, Keaton -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, August 06, 2007 1:11 PM To: Keaton Adams Cc: Joshua D. Drake; Alvaro Herrera; pgsql-general@postgresql.org Subject: Re: [GENERAL] Template zero xid issue "Keaton Adams" <kadams@mxlogic.com> writes: > Do I have this right? It's not clear. You didn't answer the question about whether you were running autovac, but even if you were I'm not sure why you got the complaint about template0. AFAICS the only way for template0 to be installed as oldest_datname is if vac_truncate_clog() is invoked in template0, allowing it to be the initial selection before the loop starts. Which could happen if autovac is on, since it will try to vacuum template0, and in 8.1.4 it neglects to use the FREEZE option. But even then, if we just finished a DB-wide vacuum on template0, it should not have had the oldest datfrozenxid; and certainly not one old enough to provoke complaints. Actually there's a whole lot that isn't clear about this. Your first report showed a warning --- not an error --- about template0, so it's not at all clear that the "database system is shutting down" messages had anything to do with that at all. And you later posted some queries that weren't drawing any such message. What was the exact sequence of events here? Alvaro: you have any thoughts about this? I'm suspicious that we ought to change 8.1's vac_truncate_clog() to not include the current database automatically, since it might be not-datallowconn; but I don't see how that potential bug could actually be exposed as a fault. regards, tom lane
"Keaton Adams" <kadams@mxlogic.com> writes: > Auto vacuum is not enabled: Hmph. That shoots down the one semi-plausible idea I had about how you got into this state. It seems that the unexpected limit_datname value must have come from flatfiles.c, but that *certainly* won't pick up any database with datallowconn = false. Is it possible that someone was manually tinkering with the content of pg_database.datallowconn? I see several other hard-to-explain aspects about the log excerpts you sent, notably the key warning itself: > <2007-08-06 09:44:08 MDT>WARNING: database "template0" must be vacuumed > within 736811 transactions 8.1 was designed to clamp down *hard* at 1000000 transactions from the wraparound point, so how did you get to consume almost 300000 more transactions beyond that limit? Could you send me the whole postmaster log from the point of the first sign of trouble, rather than just excerpts? I suspect there might be useful evidence you omitted. regards, tom lane