Обсуждение: ERROR: catalog is missing 9 attribute(s) for relid 10297
We have run into a situation where our postgresql 8.2.5 database appears to be corrupt and we are no longer able to run pg_dump. We don't know what is causing the data corruption issues and unfortunately this is one of those cases where we cannot reliably reproduce the problem (so no test case can be provided). The error message output from running pg_dump in verbose mode is: pg_dump: reading schemas pg_dump: SQL command failed pg_dump: Error message from server: ERROR: catalog is missing 9 attribute(s) for relid 10297 pg_dump: The command was: SELECT tableoid, oid, nspname, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = nspowner) as rolname, nspacl FROM pg_namespace pg_dump: *** aborted because of error Querying 'pg_catalog.pg_roles' directly also produces the error. The pg_roles object is simply a view of pg_authid that blanks out the password field. Querying the pg_authid table directly does not produce the error. Our server setup is: Linux 2.4 Postgresql 8.2.5 Our application is fairly transaction heavy, about 50-200 updates per second on a few small tables, and we have around 40 or so client connections querying the database. Checking the postgresql server log file did not provide any additional useful information beyond what pg_dump provided. The first time we ran into this problem on 2007-12-16 we decided that since the database system objects appeared to be corrupt we wiped the server clean and started over. We also checked for hardware issues but none were found. We checked for (disk errors, memory errors, fs errors) as well as (overheating drives, intermittent drive slowness not reported by the kernel, drive read prediction failures, errors that only surface under extreme load). We were not able to test (overheating CPU, bad power supply (bad voltage)). A few weeks later on Dec 25th the same exact error cropped up again "catalog is missing 9 attribute(s) for relid 10297". My questions are: 1) Does this error message generally indicate that the database system objects are corrupt? 2) To resolve this type of issue what would be the recommended course of action? In our case we decided that the database was a lost cause and we started over (unfortunately only to run into the same issue 2 weeks later). 3) Does anyone have any suggestions for additional logging or monitoring we could turn on to try and determine the root cause of this issue? Thanks for any help, Brendan
"O'Shea, Brendan" <boshea@akamai.com> writes: > pg_dump: reading schemas > pg_dump: SQL command failed > pg_dump: Error message from server: ERROR: catalog is missing 9 > attribute(s) for relid 10297 Ugh. Does it work if you do export PGOPTIONS="--ignore_system_indexes=1" first? If so, the problem would evidently be corruption of the indexes for pg_attribute, specifically pg_attribute_relid_attnum_index. It seems pretty suspicious that lightning would strike twice in the same place ... regards, tom lane
"O'Shea, Brendan" <boshea@akamai.com> writes: >> "Lane, Tom" <tgl@sss.pgh.pa.us> writes: >> Ugh. Does it work if you do >> export PGOPTIONS="--ignore_system_indexes=1" > I tried that, but unfortunately pg_dump still fails to run and the error > message is identical to previous attempts. Huh. So it's not index corruption then. Table corruption is still a possibility but it seems likely that you'd be getting other errors during the seqscan that looks for the rows. This suggests that the rows actually disappeared from pg_attribute, which is a bit hard to credit. The only mechanisms I can think of are that VACUUM decided they were dead or something physically truncated the table. The latter would probably have zapped a lot of other rows though. Have you checked to see if pg_roles is the only relation with this problem? Try select c.relname from pg_class c left join pg_attribute a on a.attrelid = c.oid and a.attnum > 0 group by c.oid,c.relname,c.relnatts having count(*) != c.relnatts; > Any suggestions for additional logging we might turn on to help > determine the cause of this issue? Maybe VACUUM VERBOSE on pg_attribute? Although you'd have to get lucky enough to catch the time that it zapped the rows, if that's what the problem is. regards, tom lane
"Lane, Tom" <tgl@sss.pgh.pa.us> writes:> > "O'Shea, Brendan" <boshea@akamai.com> writes: > >> "Lane, Tom" <tgl@sss.pgh.pa.us> writes: > >> Ugh. Does it work if you do > >> export PGOPTIONS="--ignore_system_indexes=1" > > > I tried that, but unfortunately pg_dump still fails to run > and the error > > message is identical to previous attempts. > > Huh. So it's not index corruption then. Table corruption is still a > possibility but it seems likely that you'd be getting other errors > during the seqscan that looks for the rows. This suggests > that the rows > actually disappeared from pg_attribute, which is a bit hard to credit. > The only mechanisms I can think of are that VACUUM decided they were > dead or something physically truncated the table. The latter would > probably have zapped a lot of other rows though. > The auto vacuum daemon is running pretty frequently on the database since one of our tables is very heavily updated, about 50-200 per second, and the rows themselves are large, about 1k in size on average and about 7000 rows in the table. We also run a nightly 'vacuum full' operation on this table since it can sometimes grow fairly large (around 50-100MB or more) if there are any long running transactions that last 20-30 minutes or more. Not sure if this is related but thought it was worth a mention. > Have you checked to see if pg_roles is the only relation with this > problem? Try > select c.relname from pg_class c left join pg_attribute a > on a.attrelid = c.oid and a.attnum > 0 > group by c.oid,c.relname,c.relnatts having count(*) != c.relnatts; > This SQL produces the output: "pg_roles" "pg_group" "pg_rules" "pg_user" "pg_shadow" Querying each table produces the error: "pg_roles" => ERROR: catalog is missing 9 attribute(s) for relid 10297 "pg_group" => ERROR: catalog is missing 3 attribute(s) for relid 10303 "pg_rules" => ERROR: invalid attribute number 0 for pg_rules "pg_user" => ERROR: catalog is missing 8 attribute(s) for relid 10306 "pg_shadow" => ERROR: catalog is missing 8 attribute(s) for relid 10300 > > Any suggestions for additional logging we might turn on to help > > determine the cause of this issue? > > Maybe VACUUM VERBOSE on pg_attribute? Although you'd have to > get lucky > enough to catch the time that it zapped the rows, if that's what the > problem is. > So are you saying that on our newly installed system we could periodically run "VACUUM VERBOSE pg_attribute", append the output to a log file, and then after the catalog error starts happening again we could go to this log file to look for potential clues? Running "VACUUM VERBOSE pg_attribute" now produces: INFO: vacuuming "pg_catalog.pg_attribute" INFO: scanned index "pg_attribute_relid_attnam_index" to remove 30 row versions DETAIL: CPU 0.00s/0.00u sec elapsed 0.03 sec. INFO: scanned index "pg_attribute_relid_attnum_index" to remove 30 row versions DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_attribute": removed 30 row versions in 1 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_attribute_relid_attnam_index" now contains 2670 row versions in 48 pages DETAIL: 30 index row versions were removed. 11 index pages have been deleted, 11 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_attribute_relid_attnum_index" now contains 2670 row versions in 13 pages DETAIL: 30 index row versions were removed. 2 index pages have been deleted, 2 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_attribute": found 30 removable, 2670 nonremovable row versions in 62 pages DETAIL: 0 dead row versions cannot be removed yet. There were 965 unused item pointers. 20 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.04 sec. Query returned successfully with no result in 500 ms. Thanks for your help, Brendan
> "Lane, Tom" <tgl@sss.pgh.pa.us> writes: > > Ugh. Does it work if you do > > export PGOPTIONS="--ignore_system_indexes=1" > > first? If so, the problem would evidently be corruption of > the indexes for pg_attribute, specifically > pg_attribute_relid_attnum_index. It seems pretty suspicious > that lightning would strike twice in the same place ... > > regards, tom lane > I tried that, but unfortunately pg_dump still fails to run and the error message is identical to previous attempts. Any suggestions for additional logging we might turn on to help determine the cause of this issue? Our current logging setup uses the default postgresql.conf setup with the following changes to the logging section: log_line_prefix='%t %p %u %h ' log_connections=on Thanks, Brendan