Обсуждение: Recovering a database in danger of transaction wrap-around

Поиск
Список
Период
Сортировка

Recovering a database in danger of transaction wrap-around

От
Steven Rosenstein
Дата:
I recently inherited a server with a PostgreSQL 8.1.8 database.  The reason
I inherited it was because "it wasn't working anymore".

A quick look in the logfiles showed the following:

LOG:  transaction ID wrap limit is 2147484146, limited by database "vsa"
WARNING:  database "vsa" must be vacuumed within 998573 transactions
HINT:  To avoid a database shutdown, execute a full-database VACUUM in
"vsa".
FATAL:  database is not accepting commands to avoid wraparound data loss in
database "vsa"
HINT:  Stop the postmaster and use a standalone backend to vacuum database
"vsa".

I found out quick enough what that means...

I did as instructed, and fired up the standalone backend.  I then started
VACUUM.  About four days later, the standalone backend terminated with the
message:

WARNING:  database "vsa" must be vacuumed within 997403 transactions
HINT:  To avoid a database shutdown, execute a full-database VACUUM in
"vsa".
WARNING:  database "vsa" must be vacuumed within 997402 transactions
HINT:  To avoid a database shutdown, execute a full-database VACUUM in
"vsa".
WARNING:  database "vsa" must be vacuumed within 997401 transactions
HINT:  To avoid a database shutdown, execute a full-database VACUUM in
"vsa".
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
CONTEXT:  writing block 465 of relation 1663/16384/863912
-bash-3.00$

I used lsof to monitor which files the backend was actually working on.  It
took two of the four days for it to vacuum a single table with 43
one-gigabyte extents.  I have one table with over 300 extents.  I'm looking
at a vacuum process which can ultimately take weeks (if not months) to
complete.  Is there an easier way of monitoring VACUUM's progress in a
standalone backend?

As far as I can tell, absolutely no tuning was done to Postgres and
autovacuum was in the default "off" state.  This explains why it wasn't
working anymore.

The server itself has 8gb of RAM, but a very poor I/O channel.  The
physical size of the db on the disk is 856gb.

Bottom line.  Is there *any* way of faking out the 1 million transaction
limit which prevents the postmaster from running, long enough for me to use
pg_dump to rescue the data?  I found one article which described
xidStopLimit, but I have no idea where to go to reset this, or if changing
it would have any beneficial effect.  Is there any way of rescuing the
contents of this database, or do I need to accept inevitability?  If I
restart VACUUM in the standalone backend, does it have to repeat all the
work it did before the abend above?

One other possibility I came up with is to use COPY from the standalone
back end on each table, but we're talking about 100+ tables for each of 15
schemas.

Thanks in advance for any suggestions, advice, or words of condolence you
can offer,
--- Steve
___________________________________________________________________________________

Steven Rosenstein
IT Architect/Developer | IBM Virtual Server Administration
Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001
Text Messaging: 6463456978 @ mobile.mycingular.com
Email: srosenst @ us.ibm.com

"Learn from the mistakes of others because you can't live long enough to
make them all yourself." -- Eleanor Roosevelt


Re: Recovering a database in danger of transaction wrap-around

От
Tom Lane
Дата:
Steven Rosenstein <srosenst@us.ibm.com> writes:
> I did as instructed, and fired up the standalone backend.  I then started
> VACUUM.  About four days later, the standalone backend terminated with the
> message:

> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> CONTEXT:  writing block 465 of relation 1663/16384/863912

Ugh.  Something sent the standalone backend a SIGQUIT signal.  You need
to find out what did that.

> I used lsof to monitor which files the backend was actually working on.  It
> took two of the four days for it to vacuum a single table with 43
> one-gigabyte extents.  I have one table with over 300 extents.  I'm looking
> at a vacuum process which can ultimately take weeks (if not months) to
> complete.

Yipes.  You are just using plain VACUUM, right, not VACUUM FULL?
Have you checked that vacuum_cost_delay isn't enabled?

> Bottom line.  Is there *any* way of faking out the 1 million transaction
> limit which prevents the postmaster from running, long enough for me to use
> pg_dump to rescue the data?

In 8.1 those limits are all hard-wired; you'd need to modify
SetTransactionIdLimit() in src/backend/access/transam/varsup.c
and recompile.  Might be worth doing, if you think these tables
have been bloated by a complete lack of vacuuming.

            regards, tom lane

Re: Recovering a database in danger of transaction wrap-around

От
Tino Schwarze
Дата:
On Fri, Jan 25, 2008 at 02:10:06PM -0500, Tom Lane wrote:

> > I used lsof to monitor which files the backend was actually working on.  It
> > took two of the four days for it to vacuum a single table with 43
> > one-gigabyte extents.  I have one table with over 300 extents.  I'm looking
> > at a vacuum process which can ultimately take weeks (if not months) to
> > complete.
>
> Yipes.  You are just using plain VACUUM, right, not VACUUM FULL?
> Have you checked that vacuum_cost_delay isn't enabled?

pg_dump/pg_restore may be a lot faster here - we're in an emergency
situation anyway and after that, the whole DB will be clean again, all
indices rebuilt nicely, no bloat in the tables.

Tino.

--
www.craniosacralzentrum.de
www.spiritualdesign-chemnitz.de

Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz

Re: Recovering a database in danger of transaction wrap-around

От
Steven Rosenstein
Дата:
I used plain old VACUUM.  Do you think VACUUM FULL might be faster or more
effective?

Here is the Resource Usage section from the config file (sorry for the mild
spam).  Everything looks like it is still in a default state.  I'm sure
upping the shared_buffers and the max_fsm_pages would make a difference,
but its been a while since I did any real tuning work on postgres db, and
I'm not comfortable in making any changes to a sick database.

One other thing I should mention.  We allocate 1gb (out of 8gb) of memory
to shared memory (/proc/sys/kernel/shmmax=1073741824) however when I look
at shared memory (ipcs), the Postgres segments only consume about 12mb.
I'm sure that's a tuning parameter somewhere.  Do you think increasing
shared_buffers and max_fsm_pages might help?

#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

# - Memory -

shared_buffers = 1000                   # min 16 or max_connections*2, 8KB
each
#temp_buffers = 1000                    # min 100, 8KB each
#max_prepared_transactions = 5          # can be 0 or more
# note: increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
#work_mem = 1024                        # min 64, size in KB
#maintenance_work_mem = 16384           # min 1024, size in KB
#max_stack_depth = 2048                 # min 100, size in KB

# - Free Space Map -

max_fsm_pages = 176928                  # min max_fsm_relations*16, 6 bytes
each
#max_fsm_relations = 1000               # min 100, ~70 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000           # min 25
#preload_libraries = ''

# - Cost-Based Vacuum Delay -

#vacuum_cost_delay = 0                  # 0-1000 milliseconds
#vacuum_cost_page_hit = 1               # 0-10000 credits
#vacuum_cost_page_miss = 10             # 0-10000 credits
#vacuum_cost_page_dirty = 20            # 0-10000 credits
#vacuum_cost_limit = 200                # 0-10000 credits



___________________________________________________________________________________

Steven Rosenstein
IT Architect/Developer | IBM Virtual Server Administration
Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001
Text Messaging: 6463456978 @ mobile.mycingular.com
Email: srosenst @ us.ibm.com

"Learn from the mistakes of others because you can't live long enough to
make them all yourself." -- Eleanor Roosevelt



                             
  From:       Tom Lane <tgl@sss.pgh.pa.us>
                             

                             
  To:         Steven Rosenstein/New York/IBM@IBMUS
                             

                             
  Cc:         pgsql-admin@postgresql.org
                             

                             
  Date:       01/25/2008 02:11 PM
                             

                             
  Subject:    Re: [ADMIN] Recovering a database in danger of transaction wrap-around
                             

                             




Steven Rosenstein <srosenst@us.ibm.com> writes:
> I did as instructed, and fired up the standalone backend.  I then started
> VACUUM.  About four days later, the standalone backend terminated with
the
> message:

> WARNING:  terminating connection because of crash of another server
process
> DETAIL:  The postmaster has commanded this server process to roll back
the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> CONTEXT:  writing block 465 of relation 1663/16384/863912

Ugh.  Something sent the standalone backend a SIGQUIT signal.  You need
to find out what did that.

> I used lsof to monitor which files the backend was actually working on.
It
> took two of the four days for it to vacuum a single table with 43
> one-gigabyte extents.  I have one table with over 300 extents.  I'm
looking
> at a vacuum process which can ultimately take weeks (if not months) to
> complete.

Yipes.  You are just using plain VACUUM, right, not VACUUM FULL?
Have you checked that vacuum_cost_delay isn't enabled?

> Bottom line.  Is there *any* way of faking out the 1 million transaction
> limit which prevents the postmaster from running, long enough for me to
use
> pg_dump to rescue the data?

In 8.1 those limits are all hard-wired; you'd need to modify
SetTransactionIdLimit() in src/backend/access/transam/varsup.c
and recompile.  Might be worth doing, if you think these tables
have been bloated by a complete lack of vacuuming.

                                     regards, tom lane



Re: Recovering a database in danger of transaction wrap-around

От
Steven Rosenstein
Дата:
I am unable to run pg_dump because even though I can successfully start the
postmaster, every time I try to do something I receive the warning that the
postmaster will not allow any activity because we crossed the minimum
transaction threshold.  pg_dump dies almost immediately:

Starting dump to
/vsa/backups/db/20080125170401.vsa_pgsql_db_bak.gz...pg_dump: [archiver
(db)] connection to database "vsa" failed: FATAL:  database is not
accepting commands to avoid wraparound data loss in database "vsa"
HINT:  Stop the postmaster and use a standalone backend to vacuum database
"vsa".
The backup completed with the condition coded = 0

Any other suggestions?
___________________________________________________________________________________

Steven Rosenstein
IT Architect/Developer | IBM Virtual Server Administration
Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001
Text Messaging: 6463456978 @ mobile.mycingular.com
Email: srosenst @ us.ibm.com

"Learn from the mistakes of others because you can't live long enough to
make them all yourself." -- Eleanor Roosevelt



                      
  From:       Tino Schwarze <postgresql@tisc.de>
                      

                      
  To:         pgsql-admin@postgresql.org
                      

                      
  Date:       01/25/2008 02:27 PM
                      

                      
  Subject:    Re: [ADMIN] Recovering a database in danger of transaction wrap-around
                      

                      




On Fri, Jan 25, 2008 at 02:10:06PM -0500, Tom Lane wrote:

> > I used lsof to monitor which files the backend was actually working on.
It
> > took two of the four days for it to vacuum a single table with 43
> > one-gigabyte extents.  I have one table with over 300 extents.  I'm
looking
> > at a vacuum process which can ultimately take weeks (if not months) to
> > complete.
>
> Yipes.  You are just using plain VACUUM, right, not VACUUM FULL?
> Have you checked that vacuum_cost_delay isn't enabled?

pg_dump/pg_restore may be a lot faster here - we're in an emergency
situation anyway and after that, the whole DB will be clean again, all
indices rebuilt nicely, no bloat in the tables.

Tino.

--
www.craniosacralzentrum.de
www.spiritualdesign-chemnitz.de

Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org



Re: Recovering a database in danger of transaction wrap-around

От
Tom Lane
Дата:
Steven Rosenstein <srosenst@us.ibm.com> writes:
> I used plain old VACUUM.  Do you think VACUUM FULL might be faster or more
> effective?

No.  I think you probably want to do a dump and reload, but first you
have to get past the anti-wraparound check.

One possibility I hadn't thought of before is to use a standalone
backend to increment the pg_database.datfrozenxid values by a few
thousand transactions.  This would be a bad idea if you intended
to keep using the DB, but if you're just trying to get to a state
where you can run pg_dump, it seems acceptable.

            regards, tom lane

Re: Recovering a database in danger of transaction wrap-around

От
Steven Rosenstein
Дата:
Incrementing pg_database.datfrozenxid sounded like an excellent idea.
Before I made any changes I dumped the contents of pg_database to see what
it contained.  Here is that dump:

backend> select * from pg_database;
WARNING:  database "vsa" must be vacuumed within 997398 transactions
HINT:  To avoid a database shutdown, execute a full-database VACUUM in
"vsa".
         1: datname     (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datdba      (typeid = 26, len = 4, typmod = -1, byval = t)
         3: encoding    (typeid = 23, len = 4, typmod = -1, byval = t)
         4: datistemplate       (typeid = 16, len = 1, typmod = -1, byval =
t)
         5: datallowconn        (typeid = 16, len = 1, typmod = -1, byval =
t)
         6: datconnlimit        (typeid = 23, len = 4, typmod = -1, byval =
t)
         7: datlastsysoid       (typeid = 26, len = 4, typmod = -1, byval =
t)
         8: datvacuumxid        (typeid = 28, len = 4, typmod = -1, byval =
t)
         9: datfrozenxid        (typeid = 28, len = 4, typmod = -1, byval =
t)
        10: dattablespace       (typeid = 26, len = 4, typmod = -1, byval =
t)
        11: datconfig   (typeid = 1009, len = -1, typmod = -1, byval = f)
        12: datacl      (typeid = 1034, len = -1, typmod = -1, byval = f)
        ----
         1: datname = "postgres"        (typeid = 19, len = 64, typmod =
-1, byval = f)
         2: datdba = "10"       (typeid = 26, len = 4, typmod = -1, byval =
t)
         3: encoding = "8"      (typeid = 23, len = 4, typmod = -1, byval =
t)
         4: datistemplate = "f" (typeid = 16, len = 1, typmod = -1, byval =
t)
         5: datallowconn = "t"  (typeid = 16, len = 1, typmod = -1, byval =
t)
         6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval =
t)
         7: datlastsysoid = "10792"     (typeid = 26, len = 4, typmod = -1,
byval = t)
         8: datvacuumxid = "2146484345" (typeid = 28, len = 4, typmod = -1,
byval = t)
         9: datfrozenxid = "1072742522" (typeid = 28, len = 4, typmod = -1,
byval = t)
        10: dattablespace = "1663"      (typeid = 26, len = 4, typmod = -1,
byval = t)
        ----
         1: datname = "vsa"     (typeid = 19, len = 64, typmod = -1, byval
= f)
         2: datdba = "10"       (typeid = 26, len = 4, typmod = -1, byval =
t)
         3: encoding = "0"      (typeid = 23, len = 4, typmod = -1, byval =
t)
         4: datistemplate = "f" (typeid = 16, len = 1, typmod = -1, byval =
t)
         5: datallowconn = "t"  (typeid = 16, len = 1, typmod = -1, byval =
t)
         6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval =
t)
         7: datlastsysoid = "10792"     (typeid = 26, len = 4, typmod = -1,
byval = t)
         8: datvacuumxid = "499"        (typeid = 28, len = 4, typmod = -1,
byval = t)
         9: datfrozenxid = "499"        (typeid = 28, len = 4, typmod = -1,
byval = t)
        10: dattablespace = "1663"      (typeid = 26, len = 4, typmod = -1,
byval = t)
        ----
         1: datname = "template1"       (typeid = 19, len = 64, typmod =
-1, byval = f)
         2: datdba = "10"       (typeid = 26, len = 4, typmod = -1, byval =
t)
         3: encoding = "8"      (typeid = 23, len = 4, typmod = -1, byval =
t)
         4: datistemplate = "t" (typeid = 16, len = 1, typmod = -1, byval =
t)
         5: datallowconn = "t"  (typeid = 16, len = 1, typmod = -1, byval =
t)
         6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval =
t)
         7: datlastsysoid = "10792"     (typeid = 26, len = 4, typmod = -1,
byval = t)
         8: datvacuumxid = "499"        (typeid = 28, len = 4, typmod = -1,
byval = t)
         9: datfrozenxid = "499"        (typeid = 28, len = 4, typmod = -1,
byval = t)
        10: dattablespace = "1663"      (typeid = 26, len = 4, typmod = -1,
byval = t)
        12: datacl = "{postgres=CT/postgres}"   (typeid = 1034, len = -1,
typmod = -1, byval = f)
        ----
         1: datname = "template0"       (typeid = 19, len = 64, typmod =
-1, byval = f)
         2: datdba = "10"       (typeid = 26, len = 4, typmod = -1, byval =
t)
         3: encoding = "8"      (typeid = 23, len = 4, typmod = -1, byval =
t)
         4: datistemplate = "t" (typeid = 16, len = 1, typmod = -1, byval =
t)
         5: datallowconn = "f"  (typeid = 16, len = 1, typmod = -1, byval =
t)
         6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval =
t)
         7: datlastsysoid = "10792"     (typeid = 26, len = 4, typmod = -1,
byval = t)
         8: datvacuumxid = "499"        (typeid = 28, len = 4, typmod = -1,
byval = t)
         9: datfrozenxid = "499"        (typeid = 28, len = 4, typmod = -1,
byval = t)
        10: dattablespace = "1663"      (typeid = 26, len = 4, typmod = -1,
byval = t)
        12: datacl = "{postgres=CT/postgres}"   (typeid = 1034, len = -1,
typmod = -1, byval = f)
        ----

The database which contains all the working schemas and tables is the one
called "vsa".  If you look at datvacuumxid and datfrozenxid, they both seem
perfectly reasonable in the vsa database.  However, the same values in the
"postgres" database are approaching what appear to be the hard limits.
Before I charged ahead and made any changes I wanted to confirm that it was
"datfrozenxid" in the "postgres" database which I should *increment* by a
couple of thousand, and not *datvacuumxid" in the "postgres" database which
should be *decrement* by a couple of thousands.  My intuition says the
value should be decreased so that that the absolute value is more than
1,000,000 less than 2^31, but of course I might be wrong.

I have no idea what the "postgres" database is, where it came from, or why
the transaction IDs are so out of skew.  I don't think it is created at
database creation.  The person responsible for installing Postgres left the
company a few weeks ago and is not available to ask.  I checked on another
server with a similar configuration.  It has a "postgres" database, but the
values for datvacuumxid and datfrozenxid is the same as the other three
databases: 499 each.
___________________________________________________________________________________

Steven Rosenstein
IT Architect/Developer | IBM Virtual Server Administration
Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001
Text Messaging: 6463456978 @ mobile.mycingular.com
Email: srosenst @ us.ibm.com

"Learn from the mistakes of others because you can't live long enough to
make them all yourself." -- Eleanor Roosevelt



                             
  From:       Tom Lane <tgl@sss.pgh.pa.us>
                             

                             
  To:         Steven Rosenstein/New York/IBM@IBMUS
                             

                             
  Cc:         pgsql-admin@postgresql.org
                             

                             
  Date:       01/25/2008 04:14 PM
                             

                             
  Subject:    Re: [ADMIN] Recovering a database in danger of transaction wrap-around
                             

                             




Steven Rosenstein <srosenst@us.ibm.com> writes:
> I used plain old VACUUM.  Do you think VACUUM FULL might be faster or
more
> effective?

No.  I think you probably want to do a dump and reload, but first you
have to get past the anti-wraparound check.

One possibility I hadn't thought of before is to use a standalone
backend to increment the pg_database.datfrozenxid values by a few
thousand transactions.  This would be a bad idea if you intended
to keep using the DB, but if you're just trying to get to a state
where you can run pg_dump, it seems acceptable.

                                     regards, tom lane



Re: Recovering a database in danger of transaction wrap-around

От
Tom Lane
Дата:
Steven Rosenstein <srosenst@us.ibm.com> writes:
>          1: datname = "postgres"        (typeid = 19, len = 64, typmod =
>          8: datvacuumxid = "2146484345" (typeid = 28, len = 4, typmod = -1,
> byval = t)
>          9: datfrozenxid = "1072742522" (typeid = 28, len = 4, typmod = -1,
> byval = t)
>         ----
>          1: datname = "vsa"     (typeid = 19, len = 64, typmod = -1, byval
> = f)
>          8: datvacuumxid = "499"        (typeid = 28, len = 4, typmod = -1,
> byval = t)
>          9: datfrozenxid = "499"        (typeid = 28, len = 4, typmod = -1,
> byval = t)
>         ----
>          1: datname = "template1"       (typeid = 19, len = 64, typmod =
> -1, byval = f)
>          8: datvacuumxid = "499"        (typeid = 28, len = 4, typmod = -1,
> byval = t)
>          9: datfrozenxid = "499"        (typeid = 28, len = 4, typmod = -1,
> byval = t)
>         ----
>          1: datname = "template0"       (typeid = 19, len = 64, typmod =
> -1, byval = f)
>          8: datvacuumxid = "499"        (typeid = 28, len = 4, typmod = -1,
> byval = t)
>          9: datfrozenxid = "499"        (typeid = 28, len = 4, typmod = -1,
> byval = t)

Apparently, "postgres" is the only one of these that has ever had a
database-wide VACUUM done on it :-(.  A look at an 8.1 database here
confirms that 499 is what would be in those fields immediately after
initdb, so it's never been changed.

> If you look at datvacuumxid and datfrozenxid, they both seem
> perfectly reasonable in the vsa database.

No, they aren't, at least not for an installation that's existed awhile.

> However, the same values in the
> "postgres" database are approaching what appear to be the hard limits.

"postgres" is approaching the wraparound point, which is by no means a
hard limit.  The internal transaction counter (which you could check on
with pg_controldata) is presumably even closer to the wrap point.

> Before I charged ahead and made any changes I wanted to confirm that it was
> "datfrozenxid" in the "postgres" database which I should *increment* by a
> couple of thousand, and not *datvacuumxid" in the "postgres" database which
> should be *decrement* by a couple of thousands.

No, you need to leave "postgres" alone and increment the other ones, to
make it look like they got vacuumed sometime closer to current time.

> I have no idea what the "postgres" database is, where it came from, or why
> the transaction IDs are so out of skew.  I don't think it is created at
> database creation.

http://www.postgresql.org/docs/8.1/static/manage-ag-createdb.html

> The person responsible for installing Postgres left the
> company a few weeks ago and is not available to ask.  I checked on another
> server with a similar configuration.  It has a "postgres" database, but the
> values for datvacuumxid and datfrozenxid is the same as the other three
> databases: 499 each.

Then it's not being managed properly either ...

            regards, tom lane