Обсуждение: re PG 9.6x and found xmin from before relfrozenxid and removal of pg_internal.init file(s)
re PG 9.6x and found xmin from before relfrozenxid and removal of pg_internal.init file(s)
От
Reid Thompson
Дата:
We have a planned upgrade that would permanentlty remedy this. regarding the below errors on our PG 9.6.x instance. 2020-09-28 09:08:15.741 EDT,,,26212,,5f71e03f.6664,1,,2020-09-28 09:08:15 EDT,250/9136136,0,ERROR,XX001,"found xmin 2675436435from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_authid""",,,,"" 2020-09-28 09:08:15.751 EDT,,,26212,,5f71e03f.6664,2,,2020-09-28 09:08:15 EDT,250/9136138,0,ERROR,XX001,"found xmin 2675019557from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_auth_members""",,,,"" I know that a db restart will reset the 'counter' (reset the pg_internal.init files??) on this issue. We would rather avoid a restart if possible. Some research on the internet found a page suggesting that removal of pg_internal.init file(s) on a running system would provide the same temporary resolution as a restart??? Is this a valid temporary work around for this issue? If it is, is there a particular pg_internal.init file that needs to be removed, or just remove all pg_internal.init that can be found? Thanks, reid
Reid Thompson <Reid.Thompson@omnicell.com> writes: > We have a planned upgrade that would permanentlty remedy this. > regarding the below errors on our PG 9.6.x instance. > 2020-09-28 09:08:15.741 EDT,,,26212,,5f71e03f.6664,1,,2020-09-28 09:08:15 EDT,250/9136136,0,ERROR,XX001,"found xmin 2675436435from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_authid""",,,,"" > 2020-09-28 09:08:15.751 EDT,,,26212,,5f71e03f.6664,2,,2020-09-28 09:08:15 EDT,250/9136138,0,ERROR,XX001,"found xmin 2675019557from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_auth_members""",,,,"" > I know that a db restart will reset the 'counter' (reset the > pg_internal.init files??) on this issue. We would rather avoid a > restart if possible. Some research on the internet found a page > suggesting that removal of pg_internal.init file(s) on a running system > would provide the same temporary resolution as a restart??? Is this a > valid temporary work around for this issue? If it is, is there a > particular pg_internal.init file that needs to be removed, or just > remove all pg_internal.init that can be found? I'm a bit dubious that that'd actually help, but it's perfectly safe if you want to try it. pg_internal.init is just a cache file that will be rebuilt if it's missing. regards, tom lane
On Mon, 2020-09-28 at 12:15 -0400, Tom Lane wrote: > Reid Thompson <Reid.Thompson@omnicell.com> writes: > > We have a planned upgrade that would permanentlty remedy this. > > regarding the below errors on our PG 9.6.x instance. > > > 2020-09-28 09:08:15.741 EDT,,,26212,,5f71e03f.6664,1,,2020-09-28 09:08:15 EDT,250/9136136,0,ERROR,XX001,"found xmin 2675436435from before relfrozenxid 321165377",,,,,"automatic vacuum of table > > ""postgres.pg_catalog.pg_authid""",,,,"" > > 2020-09-28 09:08:15.751 EDT,,,26212,,5f71e03f.6664,2,,2020-09-28 09:08:15 EDT,250/9136138,0,ERROR,XX001,"found xmin 2675019557from before relfrozenxid 321165377",,,,,"automatic vacuum of table > > ""postgres.pg_catalog.pg_auth_members""",,,,"" > > > I know that a db restart will reset the 'counter' (reset the > > pg_internal.init files??) on this issue. We would rather avoid a > > restart if possible. Some research on the internet found a page > > suggesting that removal of pg_internal.init file(s) on a running system > > would provide the same temporary resolution as a restart??? Is this a > > valid temporary work around for this issue? If it is, is there a > > particular pg_internal.init file that needs to be removed, or just > > remove all pg_internal.init that can be found? > > I'm a bit dubious that that'd actually help, but it's perfectly safe > if you want to try it. pg_internal.init is just a cache file that > will be rebuilt if it's missing. > > regards, tom lane appears to allow to vacuum to complete... and stops the error messages to the log file. [local]:db-server-101:5432 rthompso@xxxx=> set role super; SET Time: 0.900 ms [local]:db-server-101:5432 rthompso@xxxx=# vacuum pg_catalog.pg_authid; vacuum pg_catalog.pg_auth_members; ERROR: found xmin 2675436435 from before relfrozenxid 321165377 Time: 14.698 ms ERROR: found xmin 2675019557 from before relfrozenxid 321165377 Time: 3.956 ms [local]:db-server-101:5432 rthompso@xxxx=# vacuum pg_catalog.pg_authid; vacuum pg_catalog.pg_auth_members; ERROR: found xmin 2675436435 from before relfrozenxid 321165377 Time: 13.889 ms ERROR: found xmin 2675019557 from before relfrozenxid 321165377 Time: 2.479 ms [local]:db-server-101:5432 rthompso@xxxx=# vacuum pg_catalog.pg_authid; vacuum pg_catalog.pg_auth_members; VACUUM Time: 22.778 ms VACUUM Time: 4.486 ms [local]:db-server-101:5432 rthompso@xxxx=# vacuum verbose pg_catalog.pg_authid; vacuum verbose pg_catalog.pg_auth_members; INFO: vacuuming "pg_catalog.pg_authid" INFO: index "pg_authid_rolname_index" now contains 123 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_authid_oid_index" now contains 123 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_authid": found 0 removable, 123 nonremovable row versions in 3 out of 3 pages DETAIL: 0 dead row versions cannot be removed yet. There were 55 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM Time: 25.909 ms INFO: vacuuming "pg_catalog.pg_auth_members" INFO: index "pg_auth_members_role_member_index" now contains 143 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_auth_members_member_role_index" now contains 143 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_auth_members": found 0 removable, 143 nonremovable row versions in 1 out of 1 pages DETAIL: 0 dead row versions cannot be removed yet. There were 13 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM Time: 6.494 ms $ grep pg_auth *Mon*csv ...snip... 2020-09-28 14:00:23.564 EDT,,,9998,,5f7224b7.270e,1,,2020-09-28 14:00:23 EDT,271/14248144,0,ERROR,XX001,"found xmin 2675436435from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_authid""",,,,"" 2020-09-28 14:00:23.575 EDT,,,9998,,5f7224b7.270e,2,,2020-09-28 14:00:23 EDT,271/14248146,0,ERROR,XX001,"found xmin 2675019557from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_auth_members""",,,,"" 2020-09-28 14:00:28.571 EDT,,,10283,,5f7224bc.282b,1,,2020-09-28 14:00:28 EDT,271/14248188,0,ERROR,XX001,"found xmin 2675436435from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_authid""",,,,"" 2020-09-28 14:00:28.581 EDT,,,10283,,5f7224bc.282b,2,,2020-09-28 14:00:28 EDT,271/14248190,0,ERROR,XX001,"found xmin 2675019557from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_auth_members""",,,,"" 2020-09-28 14:00:33.577 EDT,,,10432,,5f7224c1.28c0,1,,2020-09-28 14:00:33 EDT,271/14248210,0,ERROR,XX001,"found xmin 2675436435from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_authid""",,,,"" 2020-09-28 14:00:33.587 EDT,,,10432,,5f7224c1.28c0,2,,2020-09-28 14:00:33 EDT,271/14248212,0,ERROR,XX001,"found xmin 2675019557from before relfrozenxid 321165377",,,,,"automatic vacuum of table ""postgres.pg_catalog.pg_auth_members""",,,,"" 2020-09-28 14:00:37.822 EDT,"rthompso","ateb",8067,"[local]",5f72248e.1f83,4,"VACUUM",2020-09-28 13:59:42 EDT,270/18855606,0,ERROR,XX001,"foundxmin 2675436435 from before relfrozenxid 321165377",,,,,,"vacuum pg_catalog.pg_authid;",,,"psql" 2020-09-28 14:00:37.824 EDT,"rthompso","ateb",8067,"[local]",5f72248e.1f83,5,"VACUUM",2020-09-28 13:59:42 EDT,270/18855608,0,ERROR,XX001,"foundxmin 2675019557 from before relfrozenxid 321165377",,,,,,"vacuum pg_catalog.pg_auth_members;",,,"psql" 2020-09-28 14:00:38.586 EDT,,,10565,,5f7224c6.2945,1,,2020-09-28 14:00:38 EDT,271/14248236,0,LOG,00000,"automatic vacuumof table ""postgres.pg_catalog.pg_authid"": index scans: 1 2020-09-28 14:00:38.600 EDT,,,10565,,5f7224c6.2945,2,,2020-09-28 14:00:38 EDT,271/14248238,0,LOG,00000,"automatic vacuumof table ""postgres.pg_catalog.pg_auth_members"": index scans: 1 2020-09-28 14:00:44.011 EDT,,,10578,,5f7224cb.2952,13,,2020-09-28 14:00:43 EDT,271/14248265,0,LOG,00000,"automatic vacuumof table ""ateb.pg_catalog.pg_authid"": index scans: 0 2020-09-28 14:00:46.184 EDT,,,10578,,5f7224cb.2952,68,,2020-09-28 14:00:43 EDT,271/14248373,0,LOG,00000,"automatic vacuumof table ""ateb.pg_catalog.pg_auth_members"": index scans: 0 [postgres@db-server-101 pg_log]$
Reid Thompson <Reid.Thompson@omnicell.com> writes: > On Mon, 2020-09-28 at 12:15 -0400, Tom Lane wrote: >> I'm a bit dubious that that'd actually help, but it's perfectly safe >> if you want to try it. pg_internal.init is just a cache file that >> will be rebuilt if it's missing. > appears to allow to vacuum to complete... and stops the error messages > to the log file. Ah, after digging around in our git history, this symptom seems to match this bug fix: Author: Andres Freund <andres@anarazel.de> Branch: master Release: REL_11_BR [a54e1f158] 2018-06-12 11:13:21 -0700 Branch: REL_10_STABLE Release: REL_10_5 [2ce64caaf] 2018-06-12 11:13:21 -0700 Branch: REL9_6_STABLE Release: REL9_6_10 [6a46aba1c] 2018-06-12 11:13:21 -0700 Branch: REL9_5_STABLE Release: REL9_5_14 [14b3ec6f3] 2018-06-12 11:13:21 -0700 Branch: REL9_4_STABLE Release: REL9_4_19 [817f9f9a8] 2018-06-12 11:13:22 -0700 Branch: REL9_3_STABLE Release: REL9_3_24 [9b9b622b2] 2018-06-12 11:13:22 -0700 Fix bugs in vacuum of shared rels, by keeping their relcache entries current. When vacuum processes a relation it uses the corresponding relcache entry's relfrozenxid / relminmxid as a cutoff for when to remove tuples etc. Unfortunately for nailed relations (i.e. critical system catalogs) bugs could frequently lead to the corresponding relcache entry being stale. This set of bugs could cause actual data corruption as vacuum would potentially not remove the correct row versions, potentially reviving them at a later point. After 699bf7d05c some corruptions in this vein were prevented, but the additional error checks could also trigger spuriously. Examples of such errors are: ERROR: found xmin ... from before relfrozenxid ... and ERROR: found multixact ... from before relminmxid ... To be caused by this bug the errors have to occur on system catalog tables. The two bugs are: 1) Invalidations for nailed relations were ignored, based on the theory that the relcache entry for such tables doesn't change. Which is largely true, except for fields like relfrozenxid etc. This means that changes to relations vacuumed in other sessions weren't picked up by already existing sessions. Luckily autovacuum doesn't have particularly longrunning sessions. 2) For shared *and* nailed relations, the shared relcache init file was never invalidated while running. That means that for such tables (e.g. pg_authid, pg_database) it's not just already existing sessions that are affected, but even new connections are as well. That explains why the reports usually were about pg_authid et. al. To fix 1), revalidate the rd_rel portion of a relcache entry when invalid. This implies a bit of extra complexity to deal with bootstrapping, but it's not too bad. The fix for 2) is simpler, simply always remove both the shared and local init files. Author: Andres Freund Reviewed-By: Alvaro Herrera Discussion: https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de https://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bPgG+_GDMxe25TvUY4Sugg@mail.gmail.com https://postgr.es/m/CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@mail.gmail.com https://postgr.es/m/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K+A@mail.gmail.com Backpatch: 9.3- That bug is pretty narrow, but it explains failures on pg_authid and related catalogs. So updating to 9.6.10 or later should be enough to prevent a recurrence. regards, tom lane