Обсуждение: pg_dumpall problem - duplicated users
Hi, few days ago we made vacuum of all databases on our server. After that we canot dump databases: # pg_dumpall -i -U postgres > test.sql pg_dump: query to obtain list of schemas failed: ERROR: More than one tuple returned by a subselect used as an expression. pg_dumpall: pg_dump failed on database "alibi", exiting There are duplicated system users, pg_catalogs, pg_temp1, pg_toast and public... template1=> select * from pg_catalog.pg_shadow where usesysid = 1; usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig ----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- postgres | 1 | t | t | t | | | postgres | 1 | t | t | t | md5c084502ed11efa9d3d96d29717a5e555 | | (2 rows) Any suggestions how to solve the problem? There are many databases on the server we can't loose. Applications can connect to databases - all works fine, hoewer we must make dumps... Thanks for any help, Greetings Przemek
Hi, i have similar problem - can't make dumps of any db with the same error message (and all applications using this pg serverworks fine) . But in my case, i think, only user 'postgres' is duplicated: template1=> select * from pg_catalog.pg_shadow where usename = 'postgres'; usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig ----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- postgres | 1 | t | t | t | | | postgres | 1 | t | t | t | md5c084502ed11efa9d3d96d29717a5e555 | | (2 rows) template1=> select * from pg_namespace ; nspname | nspowner | nspacl ------------+----------+-------- pg_catalog | 1 | {=U} pg_toast | 1 | {=} public | 1 | {=UC} pg_temp_1 | 1 | (4 rows) And every change i made to user postgres has effect to the one with passwd set. I can delete only him too. Someone can help?Im running out of time so every advice will be appreciated. Maybe someone know how to delete whole pg_shadow table andcreate users on db with no users at all (cause pg_shadow will be empty :). Thanks for any (and quick :) help, greetings, Bartek
"Bartek GRUbY" <grubby@epf.pl> writes: > i have similar problem - can't make dumps of any db with the same error message (and all applications using this pg serverworks fine) . But in my case, i think, only user 'postgres' is duplicated: > template1=> select * from pg_catalog.pg_shadow where usename = 'postgres'; > usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig > ----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- > postgres | 1 | t | t | t | | | > postgres | 1 | t | t | t | md5c084502ed11efa9d3d96d29717a5e555 | | > (2 rows) What PG version are you using exactly? Could we see the system columns for those rows, ie, select ctid,xmin,xmax,cmin,xmax,* from pg_shadow where usename = 'postgres'; regards, tom lane
Bartosz Nowak <grubby@go2.pl> writes: > postgres=> select ctid,xmin,xmax,cmin,xmax,* from pg_shadow where > usename = 'postgres'; > ctid | xmin | xmax | cmin | xmax | usename | usesysid | > usecreatedb | usesuper | usecatupd | passwd > | valuntil | useconfig > --------+------------+-------+-------+-------+----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- > (0,1) | 1 | 50469 | 50469 | 50469 | postgres | 1 | > t | t | t | > | | > (1,25) | 2559800612 | 0 | 0 | 0 | postgres | 1 | > t | t | t | md5c084502ed11efa9d3d96d29717a5e555 > | | > (2 rows) Hmm --- clearly, that second xmin is corrupt. I'd venture that this is the result of a rolled-back (crashed?) ALTER USER SET PASSWORD operation, in which somehow the new tuple's xmin got clobbered ... and more than likely, its XMIN_COMMITTED bit got set at the same time. You could manually delete either row, probably better to zap the second one: delete from pg_shadow where ctid = '(1,25)'; and then things should be OK. A dump and reload wouldn't be a bad idea though, since there may be other corruption elsewhere that you haven't noticed yet. regards, tom lane
Tom Lane napisał(a): >"Bartek GRUbY" <grubby@epf.pl> writes: > > >>i have similar problem - can't make dumps of any db with the same error message (and all applications using this pg serverworks fine) . But in my case, i think, only user 'postgres' is duplicated: >> >> > > > >>template1=> select * from pg_catalog.pg_shadow where usename = 'postgres'; >> usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig >>----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- >> postgres | 1 | t | t | t | | | >> postgres | 1 | t | t | t | md5c084502ed11efa9d3d96d29717a5e555 | | >>(2 rows) >> >> > >What PG version are you using exactly? Could we see the system columns >for those rows, ie, > > select ctid,xmin,xmax,cmin,xmax,* from pg_shadow where usename = 'postgres'; > > regards, tom lane > > I'm using exactly PostgreSQL 7.3.9 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3) :] If it helps: postgres=> select ctid,xmin,xmax,cmin,xmax,* from pg_shadow where usename = 'postgres'; ctid | xmin | xmax | cmin | xmax | usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig --------+------------+-------+-------+-------+----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- (0,1) | 1 | 50469 | 50469 | 50469 | postgres | 1 | t | t | t | | | (1,25) | 2559800612 | 0 | 0 | 0 | postgres | 1 | t | t | t | md5c084502ed11efa9d3d96d29717a5e555 | | (2 rows) If U have any ideas or more questions please write. Like I said, my deadline is near :[ Now i'm thinking about writing a script to backup all data and schemas from all dbs by myself and reinstallin PG... but it is a lot of work and i cant check if i did everything right for sure :( Greetings, Bartek
Bartosz Nowak <grubby@go2.pl> writes: > Tom Lane napisa�(a): >> You could manually delete either row, probably better to zap the second >> one: >> delete from pg_shadow where ctid = '(1,25)'; > Heh... i wish it was that easy - i tried it already :] When i delete the > 'second' postgres user (with passwd set) PG is acting like there is no > postgres account at all: > ... > And i cannot modify row of 'first' postgres user (without passwd set) > with or without 'second' present: > mw=> delete from pg_shadow where ctid = '(0,1)' ; > DELETE 0 Ugh. That's looking more like you have a transaction ID wraparound problem. How long has it been since pg_shadow was last vacuumed? You could try a "VACUUM FREEZE pg_shadow" and see if the rows act any more normally after that. (Better take a filesystem-level backup first, so you can get out of it if that makes things worse.) regards, tom lane
Tom Lane napisał(a): > Bartosz Nowak <grubby@go2.pl> writes: > > >> postgres=> select ctid,xmin,xmax,cmin,xmax,* from pg_shadow where >> usename = 'postgres'; >> ctid | xmin | xmax | cmin | xmax | usename | usesysid | >> usecreatedb | usesuper | usecatupd | >> passwd | valuntil | useconfig >> --------+------------+-------+-------+-------+----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- >> > > > > >> (0,1) | 1 | 50469 | 50469 | 50469 | postgres | 1 | >> t | t | t >> | | | >> (1,25) | 2559800612 | 0 | 0 | 0 | postgres | 1 | >> t | t | t | >> md5c084502ed11efa9d3d96d29717a5e555 | | >> (2 rows) >> > > > Hmm --- clearly, that second xmin is corrupt. I'd venture that this > is the result of a rolled-back (crashed?) ALTER USER SET PASSWORD > operation, in which somehow the new tuple's xmin got clobbered ... and > more than likely, its XMIN_COMMITTED bit got set at the same time. > > You could manually delete either row, probably better to zap the second > one: > delete from pg_shadow where ctid = '(1,25)'; > and then things should be OK. A dump and reload wouldn't be a bad idea > though, since there may be other corruption elsewhere that you haven't > noticed yet. > > regards, tom lane > > Heh... i wish it was that easy - i tried it already :] When i delete the 'second' postgres user (with passwd set) PG is acting like there is no postgres account at all: pg_dumpall -i -U mw > test.sql pg_dump: SQL command failed pg_dump: Error message from server: ERROR: pg_class_aclcheck: invalid user id 1 pg_dump: The command was: SELECT (SELECT usename FROM pg_user WHERE usesysid = datdba) as dba, pg_encoding_to_char(encoding) as encoding, datpath FROM pg_database WHERE datname = 'alibi' pg_dumpall: pg_dump failed on database "alibi", exiting pg_dumpall -i -U postgres > test.sql pg_dumpall: could not connect to database "template1": FATAL: user "postgres" does not exist psql -U postgres psql: FATAL: user "postgres" does not exist And i cannot modify row of 'first' postgres user (without passwd set) with or without 'second' present: mw=> delete from pg_shadow where ctid = '(0,1)' ; DELETE 0 mw=> update pg_catalog.pg_shadow set passwd='test' where usename='postgres' and ctid = '(0,1)'; UPDATE 0 So further help will be welcomed :) And thank U for trying to help me. Greetings, Bartek
Tom Lane napisał(a): >Bartosz Nowak <grubby@go2.pl> writes: > > >>Tom Lane napisał(a): >> >> >>>You could manually delete either row, probably better to zap the second >>>one: >>>delete from pg_shadow where ctid = '(1,25)'; >>> >>> > > > >>Heh... i wish it was that easy - i tried it already :] When i delete the >>'second' postgres user (with passwd set) PG is acting like there is no >>postgres account at all: >>... >>And i cannot modify row of 'first' postgres user (without passwd set) >>with or without 'second' present: >>mw=> delete from pg_shadow where ctid = '(0,1)' ; >>DELETE 0 >> >> > >Ugh. That's looking more like you have a transaction ID wraparound >problem. How long has it been since pg_shadow was last vacuumed? > >You could try a "VACUUM FREEZE pg_shadow" and see if the rows act any >more normally after that. (Better take a filesystem-level backup >first, so you can get out of it if that makes things worse.) > > regards, tom lane > > We vacuum whole base regularly, and lately it was about week ago (then the problem occured). Since pg_dump don't work we are making filesystem-level backups, and still im experimenting on local copy of this db for safety. VACUUM FREEZE pg_shadow changed the pg_shadow table: select ctid,xmin,xmax,cmin,xmax,* from pg_shadow where usename = 'postgres'; ctid | xmin | xmax | cmin | xmax | usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig -------+------+-------+-------+-------+----------+----------+-------------+----------+-----------+-------------------------------------+----------+----------- (0,1) | 1 | 50469 | 50469 | 50469 | postgres | 1 | t | t | t | | | (1,4) | 2 | 0 | 0 | 0 | postgres | 1 | t | t | t | md5c084502ed11efa9d3d96d29717a5e555 | | (2 rows) But rows still act unnormally :[, i.e.: update pg_catalog.pg_shadow set passwd='test' where usename='postgres' and ctid = '(0,1)'; UPDATE 0 delete from pg_shadow where ctid = '(0,1)' ; DELETE 0 delete from pg_shadow where usesysid = 1 and passwd != 'md5c084502ed11efa9d3d96d29717a5e555' ; DELETE 0 delete from pg_shadow where usesysid = 1 ; DELETE 1 select ctid,xmin,xmax,cmin,xmax,* from pg_shadow where usename = 'postgres'; ctid | xmin | xmax | cmin | xmax | usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig -------+------+-------+-------+-------+----------+----------+-------------+----------+-----------+--------+----------+----------- (0,1) | 1 | 50469 | 50469 | 50469 | postgres | 1 | t | t | t | | | (1 row) update pg_catalog.pg_shadow set passwd='test' where usename='postgres' ; UPDATE 0 pg_dumpall -i -U mw > test.sql pg_dump: SQL command failed pg_dump: Error message from server: ERROR: pg_class_aclcheck: invalid user id 1 pg_dump: The command was: select (select usename from pg_user where usesysid = datdba) as dba, encoding, datpath from pg_database where datname = 'alibi' pg_dumpall: pg_dump failed on alibi, exiting Greetings, Bartek -- Pozdrawiam, GRUbY
Bartosz Nowak <grubby@go2.pl> writes: > Tom Lane napisa�(a): >>> Yeah. We need to look at the contents of the infomask for these rows. > Item 1 -- Length: 95 Offset: 8096 (0x1fa0) Flags: USED > XID: min (1) CMIN|XMAX: 50469 CMAX|XVAC: 0 > Block Id: 0 linp Index: 29 Attributes: 8 Size: 24 > infomask: 0x0501 (HASNULL|XMIN_COMMITTED|XMAX_COMMITTED) > t_bits: [0]: 0x1f > 1fa0: 01000000 25c50000 00000000 00000000 ....%........... > 1fb0: 1d000800 0105181f 706f7374 67726573 ........postgres Well, you are definitely suffering from transaction ID wraparound. Other rows in the table bear XIDs as high as 2559800779, so your current transaction counter is at least that much. Transaction 50469, which obsoleted this row, was a *long* time ago. The reason that this row is suddenly being seen as valid, rather than deleted, is that the XID counter has wrapped around and so 50469 is now seen as "in the future" not "in the past" --- that is, the code thinks the row is deleted, but by a concurrent transaction that started later than our own transaction. So it's valid, but you can't delete it because it's already deleted. I can't see any evidence of data corruption. I think you just forgot to vacuum pg_shadow for a very long time. This row should have been physically deleted long ago --- but evidently VACUUM was never run across pg_shadow, until it was too late. While you could perhaps manually repair this one row with a hex editor, I don't think that will get you out of trouble. The odds are good that there are similar problems elsewhere, including both system catalogs and user tables (ie, if you weren't vacuuming pg_shadow, what else weren't you vacuuming?). I'm not sure that you are going to be able to get out of this without losing data. Here's what I'd try (but bear in mind that I just made up this procedure on the spot): 1. Stop the postmaster and take a physical dump of the $PGDATA tree, if you don't have one already. This will at least let you get back to where you are now if the rest doesn't work. 2. Restart the postmaster, but don't let anyone else in (for safety I'd suggest modifying pg_hba.conf to guarantee this). You're going to be down for a little bit :-( 3. Do a VACUUM FREEZE (database-wide) in *each* database, or at least each one you care about recovering. *Don't* use FULL. 4. Stop the postmaster again. Use pg_controldata to note the current transaction counter ("latest checkpoint's NextXID"). Then use pg_resetxlog with the -x switch to deliberately back up the XID counter to something less than 2 billion. 5. Restart the postmaster, and try to pg_dump everything. Also do whatever sanity tests you can think of --- look for missing and duplicated rows in particular. 6. If you think you have a good dump, initdb and reload the dump, and you're back in business. If it didn't work, well, revert to your last backup :-( After you get out of your immediate problem, you had better look at your vacuuming procedures and find out why you are in this fix. See http://www.postgresql.org/docs/7.3/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND regards, tom lane
Bartosz Nowak <grubby@go2.pl> writes: > By the way: isn't that situation a bug in PG? What about a situation > when this row was obsoleted just before XID counter wrap around and > there is no way to do vacuum between those facts? The same situation > would appear. No, because XID space is circular: you always have exactly 2 billion transactions before any newly-created row wraps around. (I strongly suggest you read the manual page I pointed you to.) The bug is that PG didn't force you to vacuum the table within that time. 8.1 will have code that warns increasingly stridently and eventually shuts down the server before wraparound can occur... > Why should i note the current trans. counter? See below. Also, if it's not something a bit past 2 billion then this whole theory is wrong and you don't want to proceed. > And shouldn't i set XID counter to the max possible value (2^32 = over 4 billion) to avoid some transactions to be seenas "in the future" ? No. The VACUUM FREEZE will replace every XID in the "immediate past" 2 billion transactions with FrozenXid, which will make that stuff safe against backing up the XID counter, and then you want to back up so that the oldest transactions are seen as "in the past" again. You need to use a starting XID enough less than 2 billion that you have time to validate and dump the data before those oldest transactions go out of range again ... but not more than 2 billion less than where it is when you shut down, else you will have some unfrozen rows "in the future". At least that's my theory about how to get out of this. Like I said, it's untried. regards, tom lane