Обсуждение: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

Поиск
Список
Период
Сортировка

[ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

От
Arjun Ranade
Дата:
We had some downtime recently and I thought it would be a good idea to do a periodic VACUUM FULL of one of our large Postgres DB's. 

However, when I tried to attempt the VACUUM FULL, I saw the following error:

INFO:  vacuuming "pg_catalog.pg_statistic"

vacuumdb: vacuuming of database "db1" failed: ERROR:  missing chunk number 0 for toast value 30382746 in pg_toast_2619

Given that pg_statistic is inessential data (it can be rebuilt by analyzing each table), I did a 'DELETE FROM pg_statistic;' which removed all the rows.  However, when I ran the VACUUM FULL again, I received the same error.

I also took a full base backup and restored it to another server.  On the restored server, I attempted the VACUUM FULL operation and it ran with no errors. 

Does anyone know what this could point to?  Is my original database corrupt?  I have tried REINDEX on the pg_toast_2619 table and other solutions I have found around the internet with no success. 

Thanks for any input you can provide.

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

От
Tom Lane
Дата:
Arjun Ranade <ranade@nodalexchange.com> writes:
> We had some downtime recently and I thought it would be a good idea to do a
> periodic VACUUM FULL of one of our large Postgres DB's.

> However, when I tried to attempt the VACUUM FULL, I saw the following error:
> INFO:  vacuuming "pg_catalog.pg_statistic"
> vacuumdb: vacuuming of database "db1" failed: ERROR:  missing chunk number
> 0 for toast value 30382746 in pg_toast_2619

We hear reports like this just often enough to make it seem like there's
some bug that afflicts pg_statistic specifically.  Nobody's ever found
a cause though.

> Given that pg_statistic is inessential data (it can be rebuilt by analyzing
> each table), I did a 'DELETE FROM pg_statistic;' which removed all the
> rows.

That would've been my advice ...

> However, when I ran the VACUUM FULL again, I received the same error.

... but that's really interesting.  Is this still reproducible?  If you
could get a stack trace from the point of the error, that might yield
useful data.  (Set a gdb breakpoint at "errfinish", run the VACUUM FULL,
and when it stops, get the stack with "bt".  Don't use VERBOSE, or you'll
reach errfinish for each line of verbose output...)

Also, what PG version is this exactly?
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
Yes, it is still reproducible.  When I try to VACUUM FULL right now, I get the same error.

This is postgres 9.4.  I am new to gdb, but based on what you said, I have the following output:

Breakpoint 1, 0x0000000000768be0 in errfinish ()
(gdb) bt
#0  0x0000000000768be0 in errfinish ()
#1  0x000000000076998c in elog_finish ()
#2  0x0000000000495960 in ?? ()
#3  0x0000000000496075 in heap_tuple_fetch_attr ()
#4  0x0000000000496572 in toast_insert_or_update ()
#5  0x0000000000492ce1 in ?? ()
#6  0x0000000000493733 in rewrite_heap_tuple ()
#7  0x000000000053ebdf in ?? ()
#8  0x000000000053f68e in cluster_rel ()
#9  0x0000000000590a1b in ?? ()
#10 0x0000000000590f9f in vacuum ()
#11 0x000000000068ee77 in standard_ProcessUtility ()
#12 0x00007f50c6b583d4 in ?? () from /usr/pgsql-9.4/lib/pglogical.so
#13 0x000000000068bb17 in ?? ()
#14 0x000000000068caad in ?? ()
#15 0x000000000068d132 in PortalRun ()
#16 0x000000000068979e in ?? ()
#17 0x000000000068ae18 in PostgresMain ()
#18 0x0000000000635d69 in PostmasterMain ()
#19 0x00000000005cd248 in main ()
(gdb) cont


On Tue, Oct 31, 2017 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> We had some downtime recently and I thought it would be a good idea to do a
> periodic VACUUM FULL of one of our large Postgres DB's.

> However, when I tried to attempt the VACUUM FULL, I saw the following error:
> INFO:  vacuuming "pg_catalog.pg_statistic"
> vacuumdb: vacuuming of database "db1" failed: ERROR:  missing chunk number
> 0 for toast value 30382746 in pg_toast_2619

We hear reports like this just often enough to make it seem like there's
some bug that afflicts pg_statistic specifically.  Nobody's ever found
a cause though.

> Given that pg_statistic is inessential data (it can be rebuilt by analyzing
> each table), I did a 'DELETE FROM pg_statistic;' which removed all the
> rows.

That would've been my advice ...

> However, when I ran the VACUUM FULL again, I received the same error.

... but that's really interesting.  Is this still reproducible?  If you
could get a stack trace from the point of the error, that might yield
useful data.  (Set a gdb breakpoint at "errfinish", run the VACUUM FULL,
and when it stops, get the stack with "bt".  Don't use VERBOSE, or you'll
reach errfinish for each line of verbose output...)

Also, what PG version is this exactly?

                        regards, tom lane

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

От
Tom Lane
Дата:
Arjun Ranade <ranade@nodalexchange.com> writes:
> Yes, it is still reproducible.  When I try to VACUUM FULL right now, I get
> the same error.

> This is postgres 9.4.  I am new to gdb, but based on what you said, I have
> the following output:

> Breakpoint 1, 0x0000000000768be0 in errfinish ()
> (gdb) bt
> #0  0x0000000000768be0 in errfinish ()
> #1  0x000000000076998c in elog_finish ()
> #2  0x0000000000495960 in ?? ()
> #3  0x0000000000496075 in heap_tuple_fetch_attr ()
> #4  0x0000000000496572 in toast_insert_or_update ()
> #5  0x0000000000492ce1 in ?? ()
> #6  0x0000000000493733 in rewrite_heap_tuple ()
> #7  0x000000000053ebdf in ?? ()
> #8  0x000000000053f68e in cluster_rel ()
> #9  0x0000000000590a1b in ?? ()
> #10 0x0000000000590f9f in vacuum ()
> #11 0x000000000068ee77 in standard_ProcessUtility ()

Hm.  I suspect what is happening is that VACUUM is trying to copy a tuple
that you deleted, but which it thinks is still visible to some open
transaction.  Do you have any really old transactions hanging around,
according to pg_stat_activity or pg_prepared_xacts?

If that is not it, probably truncating pg_statistic instead of just
doing "delete from pg_statistic" would solve the problem.  But we're
no closer to understanding what was the original source of the issue.
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
There is nothing in pg_prepared_xacts, however, in pg_stat_activity there are two pglogical processes that have a "backend_start" of "2017-06-17" when the last time we restarted this server.  Both are not waiting and have null for "state."   This might be expected behavior for pglogical.

I did try TRUNCATE before but even as the postgres user, but for some reason it didn't allow me to truncate because it was a system table.



On Tue, Oct 31, 2017 at 5:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> Yes, it is still reproducible.  When I try to VACUUM FULL right now, I get
> the same error.

> This is postgres 9.4.  I am new to gdb, but based on what you said, I have
> the following output:

> Breakpoint 1, 0x0000000000768be0 in errfinish ()
> (gdb) bt
> #0  0x0000000000768be0 in errfinish ()
> #1  0x000000000076998c in elog_finish ()
> #2  0x0000000000495960 in ?? ()
> #3  0x0000000000496075 in heap_tuple_fetch_attr ()
> #4  0x0000000000496572 in toast_insert_or_update ()
> #5  0x0000000000492ce1 in ?? ()
> #6  0x0000000000493733 in rewrite_heap_tuple ()
> #7  0x000000000053ebdf in ?? ()
> #8  0x000000000053f68e in cluster_rel ()
> #9  0x0000000000590a1b in ?? ()
> #10 0x0000000000590f9f in vacuum ()
> #11 0x000000000068ee77 in standard_ProcessUtility ()

Hm.  I suspect what is happening is that VACUUM is trying to copy a tuple
that you deleted, but which it thinks is still visible to some open
transaction.  Do you have any really old transactions hanging around,
according to pg_stat_activity or pg_prepared_xacts?

If that is not it, probably truncating pg_statistic instead of just
doing "delete from pg_statistic" would solve the problem.  But we're
no closer to understanding what was the original source of the issue.

                        regards, tom lane

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

От
Tom Lane
Дата:
Arjun Ranade <ranade@nodalexchange.com> writes:
> There is nothing in pg_prepared_xacts, however, in pg_stat_activity there
> are two pglogical processes that have a "backend_start" of "2017-06-17"
> when the last time we restarted this server.  Both are not waiting and have
> null for "state."   This might be expected behavior for pglogical.

I don't know much about pglogical, but if it's holding back the xmin
horizon (as it appears to be doing) that is a really bad pglogical bug.

If you try vacuum verbose (doesn't have to be FULL) on some table that
gets lots of update traffic, does it report a large number of dead-but-
not-removable rows?

> I did try TRUNCATE before but even as the postgres user, but for some
> reason it didn't allow me to truncate because it was a system table.

Mmm.  You could get around that, but possibly not without restarting
in single-user mode, which is probably more interruption in service
than you want.  It might be easier to force-quit the pglogical sessions.
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
You might be onto something.  I force quit the pglogical supervisor and other session that had a backend_start of a few months ago.  That didn't fix the issue.  Additionally, those processes didn't start back up which is making me worry about the status of my logical replication.  The subscribers are reporting that they are still replicating but I won't know. 

However... when I was checking on the subscribers, I dropped their replication links and recreated them (to see if that would get the pglogical processes to start back up).  I checked the pg_log on the master (same server having the vacuum problem) to see if there was anything about the subscriptions resuming and I found these messages in the log:

2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d WARNING:  oldest xmin is far in the past
2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d HINT:  Close open transactions soon to avoid wraparound problems.
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b WARNING:  oldest xmin is far in the past
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b HINT:  Close open transactions soon to avoid wraparound problems

However, I am now in a situation where I cannot find any old transactions from pg_stat_activity or pg_prepared_xacts. 

It may also be worth noting that the last time our db was restarted (~6 months ago, same date as the old processes that I killed) was due to the datafile partition filling up.

FWIW, here are my vacuum related settings on this db:

name    setting          
autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 50
autovacuum_freeze_max_age 200000000
autovacuum_max_workers 3
autovacuum_multixact_freeze_max_age 400000000
autovacuum_naptime 60
autovacuum_vacuum_cost_delay 20
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 50
autovacuum_work_mem -1
log_autovacuum_min_duration -1
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_defer_cleanup_age 0
vacuum_freeze_min_age 50000000
vacuum_freeze_table_age 150000000
vacuum_multixact_freeze_min_age 5000000
vacuum_multixact_freeze_table_age 150000000

On Tue, Oct 31, 2017 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> There is nothing in pg_prepared_xacts, however, in pg_stat_activity there
> are two pglogical processes that have a "backend_start" of "2017-06-17"
> when the last time we restarted this server.  Both are not waiting and have
> null for "state."   This might be expected behavior for pglogical.

I don't know much about pglogical, but if it's holding back the xmin
horizon (as it appears to be doing) that is a really bad pglogical bug.

If you try vacuum verbose (doesn't have to be FULL) on some table that
gets lots of update traffic, does it report a large number of dead-but-
not-removable rows?

> I did try TRUNCATE before but even as the postgres user, but for some
> reason it didn't allow me to truncate because it was a system table.

Mmm.  You could get around that, but possibly not without restarting
in single-user mode, which is probably more interruption in service
than you want.  It might be easier to force-quit the pglogical sessions.

                        regards, tom lane

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
So we had a maintenance window scheduled today and I restarted the DB.  When it came back up, it reported the same toast error when vacuum'ing pg_statistic.  It also reported the "oldest xmin is far in the past" error.

I even brought the DB into single user mode and tried to manually vacuum pg_statistic but got the same error.  I also tried to TRUNCATE pg_statistic in single user mode which failed (see below).

At this point, I am not sure what to make of the issue.  The DB is still complaining about old XMIN even when there are no users connected to it after a fresh restart:

2017-11-01 21:45:26.838 EDT pid=14654 database= user= rhost= tid=11/39 sessionid=59fa7847.393e WARNING:  oldest xmin is far in the past
2017-11-01 21:45:26.838 EDT pid=14654 database= user= rhost= tid=11/39 sessionid=59fa7847.393e HINT:  Close open transactions soon to avoid wraparound problems.

If we cannot understand what is happening, we will likely restore from a backup, rebuild all of our standbys/logical replicas and start fresh.  It would be too costly for the db to shutdown during a workday due to the wraparound issue.

Any insight you can provide would be appreciated.

Thanks,
AR

Relevant log output from single user mode:

$ /usr/pgsql-9.4/bin/postmaster --single -D /var/lib/pgsql/9.4/data prod

PostgreSQL stand-alone backend 9.4.14
backend> vacuum full pg_statistic;
2017-11-01 21:35:38.781 EDT pid=12732 database= user= rhost= tid=1/3 sessionid=59fa765f.31bc ERROR:  missing chunk number 0 for toast value 30382746 in pg_toast_2619
2017-11-01 21:35:38.783 EDT pid=12732 database= user= rhost= tid=1/3 sessionid=59fa765f.31bc STATEMENT:  vacuum full pg_statistic;

backend> reindex table pg_statistic;
backend> vacuum full pg_statistic;
2017-11-01 21:36:43.602 EDT pid=12732 database= user= rhost= tid=1/7 sessionid=59fa765f.31bc ERROR:  missing chunk number 0 for toast value 30382746 in pg_toast_2619
2017-11-01 21:36:43.602 EDT pid=12732 database= user= rhost= tid=1/7 sessionid=59fa765f.31bc STATEMENT:  vacuum full pg_statistic;

backend> truncate table pg_statistic;
2017-11-01 21:37:44.867 EDT pid=12732 database= user= rhost= tid=1/14 sessionid=59fa765f.31bc ERROR:  permission denied: "pg_statistic" is a system catalog
2017-11-01 21:37:44.867 EDT pid=12732 database= user= rhost= tid=1/14 sessionid=59fa765f.31bc STATEMENT:  truncate table pg_statistic;




On Tue, Oct 31, 2017 at 10:37 PM, Arjun Ranade <ranade@nodalexchange.com> wrote:
You might be onto something.  I force quit the pglogical supervisor and other session that had a backend_start of a few months ago.  That didn't fix the issue.  Additionally, those processes didn't start back up which is making me worry about the status of my logical replication.  The subscribers are reporting that they are still replicating but I won't know. 

However... when I was checking on the subscribers, I dropped their replication links and recreated them (to see if that would get the pglogical processes to start back up).  I checked the pg_log on the master (same server having the vacuum problem) to see if there was anything about the subscriptions resuming and I found these messages in the log:

2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d WARNING:  oldest xmin is far in the past
2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d HINT:  Close open transactions soon to avoid wraparound problems.
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b WARNING:  oldest xmin is far in the past
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b HINT:  Close open transactions soon to avoid wraparound problems

However, I am now in a situation where I cannot find any old transactions from pg_stat_activity or pg_prepared_xacts. 

It may also be worth noting that the last time our db was restarted (~6 months ago, same date as the old processes that I killed) was due to the datafile partition filling up.

FWIW, here are my vacuum related settings on this db:

name    setting          
autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 50
autovacuum_freeze_max_age 200000000
autovacuum_max_workers 3
autovacuum_multixact_freeze_max_age 400000000
autovacuum_naptime 60
autovacuum_vacuum_cost_delay 20
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 50
autovacuum_work_mem -1
log_autovacuum_min_duration -1
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_defer_cleanup_age 0
vacuum_freeze_min_age 50000000
vacuum_freeze_table_age 150000000
vacuum_multixact_freeze_min_age 5000000
vacuum_multixact_freeze_table_age 150000000

On Tue, Oct 31, 2017 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> There is nothing in pg_prepared_xacts, however, in pg_stat_activity there
> are two pglogical processes that have a "backend_start" of "2017-06-17"
> when the last time we restarted this server.  Both are not waiting and have
> null for "state."   This might be expected behavior for pglogical.

I don't know much about pglogical, but if it's holding back the xmin
horizon (as it appears to be doing) that is a really bad pglogical bug.

If you try vacuum verbose (doesn't have to be FULL) on some table that
gets lots of update traffic, does it report a large number of dead-but-
not-removable rows?

> I did try TRUNCATE before but even as the postgres user, but for some
> reason it didn't allow me to truncate because it was a system table.

Mmm.  You could get around that, but possibly not without restarting
in single-user mode, which is probably more interruption in service
than you want.  It might be easier to force-quit the pglogical sessions.

                        regards, tom lane


Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
Here is a normal vacuum (without FULL) of pg_statistic.  It does report a large number of dead-but-not-removable rows:

# vacuum analyze verbose pg_statistic;
WARNING:  oldest xmin is far in the past
HINT:  Close open transactions soon to avoid wraparound problems.
INFO:  vacuuming "pg_catalog.pg_statistic"
INFO:  index "pg_statistic_relid_att_inh_index" now contains 482186 row versions in 1325 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_statistic": found 0 removable, 567740 nonremovable row versions in 34667 out of 34667 pages
DETAIL:  558093 dead row versions cannot be removed yet.
There were 21663 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.12u sec elapsed 0.13 sec.
WARNING:  oldest xmin is far in the past
HINT:  Close open transactions soon to avoid wraparound problems.
INFO:  vacuuming "pg_toast.pg_toast_2619"
INFO:  index "pg_toast_2619_index" now contains 3319 row versions in 12 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_2619": found 0 removable, 3319 nonremovable row versions in 667 out of 667 pages
DETAIL:  3097 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM


On Tue, Oct 31, 2017 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> There is nothing in pg_prepared_xacts, however, in pg_stat_activity there
> are two pglogical processes that have a "backend_start" of "2017-06-17"
> when the last time we restarted this server.  Both are not waiting and have
> null for "state."   This might be expected behavior for pglogical.

I don't know much about pglogical, but if it's holding back the xmin
horizon (as it appears to be doing) that is a really bad pglogical bug.

If you try vacuum verbose (doesn't have to be FULL) on some table that
gets lots of update traffic, does it report a large number of dead-but-
not-removable rows?

> I did try TRUNCATE before but even as the postgres user, but for some
> reason it didn't allow me to truncate because it was a system table.

Mmm.  You could get around that, but possibly not without restarting
in single-user mode, which is probably more interruption in service
than you want.  It might be easier to force-quit the pglogical sessions.

                        regards, tom lane

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

От
Tom Lane
Дата:
Arjun Ranade <ranade@nodalexchange.com> writes:
> So we had a maintenance window scheduled today and I restarted the DB.
> When it came back up, it reported the same toast error when vacuum'ing
> pg_statistic.  It also reported the "oldest xmin is far in the past" error.

> I even brought the DB into single user mode and tried to manually vacuum
> pg_statistic but got the same error.  I also tried to TRUNCATE pg_statistic
> in single user mode which failed (see below).

> At this point, I am not sure what to make of the issue.  The DB is still
> complaining about old XMIN even when there are no users connected to it
> after a fresh restart:

Clearly, something is holding back the xmin horizon, and if the behavior
persisted past a DB restart then it surely isn't a regular transaction.

Are you *really* sure there are no prepared transactions?

It now occurs to me that there's one other mechanism that could hold back
the xmin horizon, and that is replication slots.  Looking into the
pg_replication_slots view should show you what's up there.  I'm not
an expert on replication slots, but I do know that they were new as of
9.4.  I am thinking maybe your replication setup has "leaked" a slot
that's not being used but is still preventing global xmin from advancing.
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
Hi Tom,

I am positive that there are no prepared transactions, certainly none that would persist past a DB shutdown and the single user mode work I did.

However, your point about the replication slots seems to be valid.  Also, a leaking replication slot seems more likely since it was a replication slot issue with pglogical that caused the DB to crash before it was restarted a few months ago.

Below are the items I got from querying the replication view.  It seems that you are correct, https://www.postgresql.org/docs/9.4/static/catalog-pg-replication-slots.html says that 'catalog_xmin' is "The oldest transaction affecting the system catalogs that this slot needs the database to retain. VACUUM cannot remove catalog tuples deleted by any later transaction."

db1=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
-----------------------------------------------+------------------+-----------+--------+----------+--------+-----------+--------------+--------------
repmgr_slot_3 | | physical | | | t | 252861646 | | 157/E1755E90
repmgr_slot_4 | | physical | | | t | 324101780 | 162630233 | 157/E1755E90
repmgr_slot_2 | | physical | | | t | | | 157/E1755E90
pgl_tcs_dadfbc8d5_tcspg01_large_tab_ch356c67c | pglogical_output | logical | 21705 | prod | t | | 324101794 | 157/E16D62D8
pgl_tcs_dadfbc8d5_tcspg01_all_tables | pglogical_output | logical | 21705 | prod | t | | 324101794 | 157/E16D62D8
(5 rows)

db1=# SELECT datname, datfrozenxid FROM pg_database;
datname | datfrozenxid
-----------+--------------
template1 | 162630233
template0 | 162630233
repmgr_db | 138378717
postgres | 138378717
rundeck | 138378717
db1 | 162630233
(6 rows)




On Wed, Nov 1, 2017 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> So we had a maintenance window scheduled today and I restarted the DB.
> When it came back up, it reported the same toast error when vacuum'ing
> pg_statistic.  It also reported the "oldest xmin is far in the past" error.

> I even brought the DB into single user mode and tried to manually vacuum
> pg_statistic but got the same error.  I also tried to TRUNCATE pg_statistic
> in single user mode which failed (see below).

> At this point, I am not sure what to make of the issue.  The DB is still
> complaining about old XMIN even when there are no users connected to it
> after a fresh restart:

Clearly, something is holding back the xmin horizon, and if the behavior
persisted past a DB restart then it surely isn't a regular transaction.

Are you *really* sure there are no prepared transactions?

It now occurs to me that there's one other mechanism that could hold back
the xmin horizon, and that is replication slots.  Looking into the
pg_replication_slots view should show you what's up there.  I'm not
an expert on replication slots, but I do know that they were new as of
9.4.  I am thinking maybe your replication setup has "leaked" a slot
that's not being used but is still preventing global xmin from advancing.

                        regards, tom lane

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
After dropping the replication slot, VACUUM FULL runs fine now and no longer reports the "oldest xmin is far in the past"

Thank you so much for your help Tom!

Arjun

On Thu, Nov 2, 2017 at 10:44 AM, Arjun Ranade <ranade@nodalexchange.com> wrote:
Hi Tom,

I am positive that there are no prepared transactions, certainly none that would persist past a DB shutdown and the single user mode work I did.

However, your point about the replication slots seems to be valid.  Also, a leaking replication slot seems more likely since it was a replication slot issue with pglogical that caused the DB to crash before it was restarted a few months ago.

Below are the items I got from querying the replication view.  It seems that you are correct, https://www.postgresql.org/docs/9.4/static/catalog-pg-replication-slots.html says that 'catalog_xmin' is "The oldest transaction affecting the system catalogs that this slot needs the database to retain. VACUUM cannot remove catalog tuples deleted by any later transaction."

db1=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
-----------------------------------------------+------------------+-----------+--------+----------+--------+-----------+--------------+--------------
repmgr_slot_3 | | physical | | | t | 252861646 | | 157/E1755E90
repmgr_slot_4 | | physical | | | t | 324101780 | 162630233 | 157/E1755E90
repmgr_slot_2 | | physical | | | t | | | 157/E1755E90
pgl_tcs_dadfbc8d5_tcspg01_large_tab_ch356c67c | pglogical_output | logical | 21705 | prod | t | | 324101794 | 157/E16D62D8
pgl_tcs_dadfbc8d5_tcspg01_all_tables | pglogical_output | logical | 21705 | prod | t | | 324101794 | 157/E16D62D8
(5 rows)

db1=# SELECT datname, datfrozenxid FROM pg_database;
datname | datfrozenxid
-----------+--------------
template1 | 162630233
template0 | 162630233
repmgr_db | 138378717
postgres | 138378717
rundeck | 138378717
db1 | 162630233
(6 rows)




On Wed, Nov 1, 2017 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> So we had a maintenance window scheduled today and I restarted the DB.
> When it came back up, it reported the same toast error when vacuum'ing
> pg_statistic.  It also reported the "oldest xmin is far in the past" error.

> I even brought the DB into single user mode and tried to manually vacuum
> pg_statistic but got the same error.  I also tried to TRUNCATE pg_statistic
> in single user mode which failed (see below).

> At this point, I am not sure what to make of the issue.  The DB is still
> complaining about old XMIN even when there are no users connected to it
> after a fresh restart:

Clearly, something is holding back the xmin horizon, and if the behavior
persisted past a DB restart then it surely isn't a regular transaction.

Are you *really* sure there are no prepared transactions?

It now occurs to me that there's one other mechanism that could hold back
the xmin horizon, and that is replication slots.  Looking into the
pg_replication_slots view should show you what's up there.  I'm not
an expert on replication slots, but I do know that they were new as of
9.4.  I am thinking maybe your replication setup has "leaked" a slot
that's not being used but is still preventing global xmin from advancing.

                        regards, tom lane


Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

От
Tom Lane
Дата:
Arjun Ranade <ranade@nodalexchange.com> writes:
> After dropping the replication slot, VACUUM FULL runs fine now and no
> longer reports the "oldest xmin is far in the past"

Excellent.  Maybe we should think about providing better tools to notice
"stuck" replication slots.

In the meantime, you probably realize this already, but if global xmin
has been stuck for months then you're going to have terrible bloat
everywhere.  Database-wide VACUUM FULL seems called for.
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Stephen Frost
Дата:
Tom, Arjun,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Arjun Ranade <ranade@nodalexchange.com> writes:
> > After dropping the replication slot, VACUUM FULL runs fine now and no
> > longer reports the "oldest xmin is far in the past"
>
> Excellent.  Maybe we should think about providing better tools to notice
> "stuck" replication slots.

+1

> In the meantime, you probably realize this already, but if global xmin
> has been stuck for months then you're going to have terrible bloat
> everywhere.  Database-wide VACUUM FULL seems called for.

This, really, is also a lesson in "monitor your distance to transaction
wrap-around"..  You really should know something is up a lot sooner than
the warnings from PG showing up in the logs.

Thanks!

Stephen

Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?

От
Arjun Ranade
Дата:
Yes, we are now in the process of adding custom metrics/alerts around the xmin horizon across all of our postgres databases. 

We will do a DB-wide VACUUM FULL as well (ironically, this incident started because VACUUM FULL failed last weekend).

Appreciate all the input on this.

Arjun


On Thu, Nov 2, 2017 at 11:06 AM, Stephen Frost <sfrost@snowman.net> wrote:
Tom, Arjun,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Arjun Ranade <ranade@nodalexchange.com> writes:
> > After dropping the replication slot, VACUUM FULL runs fine now and no
> > longer reports the "oldest xmin is far in the past"
>
> Excellent.  Maybe we should think about providing better tools to notice
> "stuck" replication slots.

+1

> In the meantime, you probably realize this already, but if global xmin
> has been stuck for months then you're going to have terrible bloat
> everywhere.  Database-wide VACUUM FULL seems called for.

This, really, is also a lesson in "monitor your distance to transaction
wrap-around"..  You really should know something is up a lot sooner than
the warnings from PG showing up in the logs.

Thanks!

Stephen