Обсуждение: Problems dumping database from 7.3.1
I've been using a PGSQL 7.3.1 system as my development box for some time, and so far, it worked fine. (This is running onSlackware distro with a 2.2.19 kernel under VMWare 4) However, yesterday, I added a column to a base table (that is inherited by several other tables). Shortly afterwards, wheneverI tried to access either the base table or child tables, I get the message saying that pgsql has lost communicationwith the server. Assuming that I had stumbled across a bug, I decided to upgrade to 7.4.2. Before starting, I tried to do a pg_dump, and itfailed. Again, it failed partway through, loosing the connection to the server. Below is the erorr I am getting gtg@Vmware1:~$ pg_dump -f /tmp/sw4.sql -U postgres switchboards pg_dump: connection not open pg_dump: lost synchronization with server, resetting connection pg_dump: SQL command to dump the contents of table "StaffMember" failed: PQendcopy() failed. pg_dump: Error message from server: FATAL: The database system is starting up pg_dump: The command was: COPY public."StaffMember" ("Id", "ExDirectory", "Name", "JobTitle", "Title", "FirstName", "MiddleInitial","LastName", "Status", "WelshSpeakingAbility", "JobTitleC", "OrgGroupID", "OtherNumber", "PostNumber", "BandId","JobDescE", "JobDescW", "EMailAddr", "NDSObject", "SectionId", dnid) TO stdout; gtg@Vmware1:~$ The log file says that the server process was terminated by a signal 11, and it then restarts. Any ideas as to how I can bring my database back to life ? Thanks, GTG Gordon Ross, Network Manager/Rheolwr Rhydwaith Countryside Council for Wales/Cyngor Cefn Gwlad Cymru
"Gordon Ross" <G.Ross@ccw.gov.uk> writes: > I've been using a PGSQL 7.3.1 system as my development box for some time, and so far, it worked fine. (This is runningon Slackware distro with a 2.2.19 kernel under VMWare 4) > However, yesterday, I added a column to a base table (that is inherited by several other tables). Shortly afterwards, wheneverI tried to access either the base table or child tables, I get the message saying that pgsql has lost communicationwith the server. Sounds pretty messy. If you're really lucky, an in-place update to 7.3.6 might fix it, but I can't offhand think of any bug fixes in the 7.3.* branch that had anything to do with inheritance. Of course it's not necessarily true that this has anything to do with inheritance, either. What we'd need to know before we could speculate much further is where exactly the crash is happening. A gdb backtrace from the SEGV would help a lot. I'd suggest: 1. Start a psql session. Figure out the PID of the connected backend (pg_stat_activity may help, or just use "ps"). 2. In another shell window, attach to the backend process with gdb: gdb /path/to/postgres-executable PID ... gdb> (You can do this either as root or as the postgres user.) 3. Tell gdb to let the backend continue: gdb> cont 4. Do whatever it takes in the psql session to provoke crash. gdb should announce that the backend has gotten a signal 11. Now say gdb> bt gdb> quit and send along the output of bt. regards, tom lane
>4. Do whatever it takes in the psql session to provoke crash. gdb >should announce that the backend has gotten a signal 11. Now say > gdb> bt > gdb> quit > >and send along the output of bt. Program received signal SIGSEGV, Segmentation fault. 0x806ba0a in nocachegetattr () at eval.c:88 88 eval.c: No such file or directory. (gdb) bt #0 0x806ba0a in nocachegetattr () at eval.c:88 #1 0x80d41ec in ExecEvalVar () at eval.c:88 #2 0x80d549c in ExecEvalExpr () at eval.c:88 #3 0x80d5809 in ExecTargetList () at eval.c:88 #4 0x80d5a9b in ExecProject () at eval.c:88 #5 0x80d5b79 in ExecScan () at eval.c:88 #6 0x80dbcee in ExecSeqScan () at eval.c:88 #7 0x80d3929 in ExecProcNode () at eval.c:88 #8 0x80d76d0 in ExecProcAppend () at eval.c:88 #9 0x80d3919 in ExecProcNode () at eval.c:88 #10 0x80dbac1 in ExecResult () at eval.c:88 #11 0x80d3909 in ExecProcNode () at eval.c:88 #12 0x80d288e in ExecutePlan () at eval.c:88 #13 0x80d1f80 in ExecutorRun () at eval.c:88 #14 0x8121aab in ProcessQuery () at eval.c:88 #15 0x811ff7d in pg_exec_query_string () at eval.c:88 #16 0x8121093 in PostgresMain () at eval.c:88 #17 0x8108b9a in DoBackend () at eval.c:88 #18 0x810848f in BackendStartup () at eval.c:88 #19 0x810763c in ServerLoop () at eval.c:88 #20 0x810719a in PostmasterMain () at eval.c:88 #21 0x80e46ff in main () at eval.c:88 #22 0x400e92eb in __libc_start_main (main=0x80e4530 <main>, argc=2, ubp_av=0xbffffd44, init=0x806a7b0 <_init>, fini=0x81785fc <_fini>, ---Type <return> to continue, or q <return> to quit--- rtld_fini=0x4000c130 <_dl_fini>, stack_end=0xbffffd3c) at ../sysdeps/generic/libc-start.c:129 Gordon Ross, Network Manager/Rheolwr Rhydwaith Countryside Council for Wales/Cyngor Cefn Gwlad Cymru
"Gordon Ross" <G.Ross@ccw.gov.uk> writes: >> 4. Do whatever it takes in the psql session to provoke crash >> and send along the output of bt. > Program received signal SIGSEGV, Segmentation fault. > 0x806ba0a in nocachegetattr () at eval.c:88 > 88 eval.c: No such file or directory. > (gdb) bt > #0 0x806ba0a in nocachegetattr () at eval.c:88 > #1 0x80d41ec in ExecEvalVar () at eval.c:88 > #2 0x80d549c in ExecEvalExpr () at eval.c:88 > #3 0x80d5809 in ExecTargetList () at eval.c:88 Hmm. The file/line references are obviously wrong, but let's assume that the function names are right (they at least look plausible). What exactly was the query that provoked this? Also, could we see \d output for the table(s) involved? regards, tom lane
switchboards=# \d "DNOwner"; Table "public.DNOwner" Column | Type | Modifiers -------------+---------+---------------------------------------------------- Id | integer | not null default nextval('"DNOwner_id_seq"'::text) ExDirectory | boolean | default 'f' Name | text | dnid | integer | switchboards=# select * from "DNOwner"; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select on thoseproduce the same result. GTG >>> Tom Lane <tgl@sss.pgh.pa.us> 05/12/04 17:51 PM >>> "Gordon Ross" <G.Ross@ccw.gov.uk> writes: >> 4. Do whatever it takes in the psql session to provoke crash >> and send along the output of bt. > Program received signal SIGSEGV, Segmentation fault. > 0x806ba0a in nocachegetattr () at eval.c:88 > 88 eval.c: No such file or directory. > (gdb) bt > #0 0x806ba0a in nocachegetattr () at eval.c:88 > #1 0x80d41ec in ExecEvalVar () at eval.c:88 > #2 0x80d549c in ExecEvalExpr () at eval.c:88 > #3 0x80d5809 in ExecTargetList () at eval.c:88 Hmm. The file/line references are obviously wrong, but let's assume that the function names are right (they at least look plausible). What exactly was the query that provoked this? Also, could we see \d output for the table(s) involved? regards, tom lane
"Gordon Ross" <G.Ross@ccw.gov.uk> writes: > switchboards=# select * from "DNOwner"; > server closed the connection unexpectedly > The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select onthose produce the same result. Hmm. Does 'select * from only "DNOwner";' work, by any chance? I have a suspicion now that you were bit by the bug repaired in this 7.3.2 bug fix: 2003-01-04 19:56 tgl * src/backend/optimizer/prep/prepunion.c (REL7_3_STABLE): Fix inherited UPDATE for cases where child column numbering doesn't match parent table. This used to work, but was broken in 7.3 by rearrangement of code that handles targetlist sorting. Add a regression test to catch future breakage. What I am thinking happened is that you did an 'UPDATE "DNOwner"' to insert values for the new column, and that because of the aforesaid bug, the updates of the child tables actually corrupted the data (by storing the intended new dnid value into the wrong column position). The crash is probably happening because SELECT is trying to interpret an integer as a text field, or something along that line. If that theory is correct, then the parent table should still be okay (thus the request to try an ONLY select). However, I'm afraid all your child tables are toast, or at least all the rows you updated this way are toast. You could maybe get to a dumpable state by deleting all the child rows, but if the DELETE crashes then you'll have little recourse but to drop the child tables completely. Updating Postgres to 7.3.2 or later will prevent fresh occurrences of the same problem, but unfortunately won't bring back your old data :-(. I hope you have backups ... regards, tom lane
*SIGH* Oh, well. I suppose I should have updated up development machine a while ago. ./configure; make here we come. Thank you for looking at this. At least I can be re-assured that the problem is fixed in later versions of Postgres. GTG >>> Tom Lane <tgl@sss.pgh.pa.us> 05/12/04 20:36 PM >>> "Gordon Ross" <G.Ross@ccw.gov.uk> writes: > switchboards=# select * from "DNOwner"; > server closed the connection unexpectedly > The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select onthose produce the same result. Hmm. Does 'select * from only "DNOwner";' work, by any chance? I have a suspicion now that you were bit by the bug repaired in this 7.3.2 bug fix: 2003-01-04 19:56 tgl * src/backend/optimizer/prep/prepunion.c (REL7_3_STABLE): Fix inherited UPDATE for cases where child column numbering doesn't match parent table. This used to work, but was broken in 7.3 by rearrangement of code that handles targetlist sorting. Add a regression test to catch future breakage. What I am thinking happened is that you did an 'UPDATE "DNOwner"' to insert values for the new column, and that because of the aforesaid bug, the updates of the child tables actually corrupted the data (by storing the intended new dnid value into the wrong column position). The crash is probably happening because SELECT is trying to interpret an integer as a text field, or something along that line. If that theory is correct, then the parent table should still be okay (thus the request to try an ONLY select). However, I'm afraid all your child tables are toast, or at least all the rows you updated this way are toast. You could maybe get to a dumpable state by deleting all the child rows, but if the DELETE crashes then you'll have little recourse but to drop the child tables completely. Updating Postgres to 7.3.2 or later will prevent fresh occurrences of the same problem, but unfortunately won't bring back your old data :-(. I hope you have backups ... regards, tom lane