Обсуждение: Problems dumping database from 7.3.1

Поиск
Список
Период
Сортировка

Problems dumping database from 7.3.1

От
"Gordon Ross"
Дата:
I've been using a PGSQL 7.3.1 system as my development box for some time, and so far, it worked fine. (This is running
onSlackware distro with a 2.2.19 kernel under VMWare 4) 

However, yesterday, I added a column to a base table (that is inherited by several other tables). Shortly afterwards,
wheneverI tried to access either the base table or child tables, I get the message saying that pgsql has lost
communicationwith the server. 

Assuming that I had stumbled across a bug, I decided to upgrade to 7.4.2. Before starting, I tried to do a pg_dump, and
itfailed. Again, it failed partway through, loosing the connection to the server. 

Below is the erorr I am getting

gtg@Vmware1:~$ pg_dump -f /tmp/sw4.sql -U postgres switchboards
pg_dump: connection not open
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table "StaffMember" failed: PQendcopy() failed.
pg_dump: Error message from server: FATAL:  The database system is starting up

pg_dump: The command was: COPY public."StaffMember" ("Id", "ExDirectory", "Name", "JobTitle", "Title", "FirstName",
"MiddleInitial","LastName", "Status", "WelshSpeakingAbility", "JobTitleC", "OrgGroupID", "OtherNumber", "PostNumber",
"BandId","JobDescE", "JobDescW", "EMailAddr", "NDSObject", "SectionId", dnid) TO stdout; 
gtg@Vmware1:~$

The log file says that the server process was terminated by a signal 11, and it then restarts.


Any ideas as to how I can bring my database back to life ?

Thanks,

GTG

Gordon Ross,
Network Manager/Rheolwr Rhydwaith
Countryside Council for Wales/Cyngor Cefn Gwlad Cymru


Re: Problems dumping database from 7.3.1

От
Tom Lane
Дата:
"Gordon Ross" <G.Ross@ccw.gov.uk> writes:
> I've been using a PGSQL 7.3.1 system as my development box for some time, and so far, it worked fine. (This is
runningon Slackware distro with a 2.2.19 kernel under VMWare 4) 
> However, yesterday, I added a column to a base table (that is inherited by several other tables). Shortly afterwards,
wheneverI tried to access either the base table or child tables, I get the message saying that pgsql has lost
communicationwith the server. 

Sounds pretty messy.  If you're really lucky, an in-place update to
7.3.6 might fix it, but I can't offhand think of any bug fixes in the
7.3.* branch that had anything to do with inheritance.

Of course it's not necessarily true that this has anything to do with
inheritance, either.  What we'd need to know before we could speculate
much further is where exactly the crash is happening.  A gdb backtrace
from the SEGV would help a lot.  I'd suggest:

1. Start a psql session.  Figure out the PID of the connected backend
(pg_stat_activity may help, or just use "ps").

2. In another shell window, attach to the backend process with gdb:
    gdb /path/to/postgres-executable PID
    ...
    gdb>
(You can do this either as root or as the postgres user.)

3. Tell gdb to let the backend continue:
    gdb> cont

4. Do whatever it takes in the psql session to provoke crash.  gdb
should announce that the backend has gotten a signal 11.  Now say
    gdb> bt
    gdb> quit

and send along the output of bt.

            regards, tom lane

Re: Problems dumping database from 7.3.1

От
"Gordon Ross"
Дата:
>4. Do whatever it takes in the psql session to provoke crash.  gdb
>should announce that the backend has gotten a signal 11.  Now say
>    gdb> bt
>    gdb> quit
>
>and send along the output of bt.

Program received signal SIGSEGV, Segmentation fault.
0x806ba0a in nocachegetattr () at eval.c:88
88      eval.c: No such file or directory.
(gdb) bt
#0  0x806ba0a in nocachegetattr () at eval.c:88
#1  0x80d41ec in ExecEvalVar () at eval.c:88
#2  0x80d549c in ExecEvalExpr () at eval.c:88
#3  0x80d5809 in ExecTargetList () at eval.c:88
#4  0x80d5a9b in ExecProject () at eval.c:88
#5  0x80d5b79 in ExecScan () at eval.c:88
#6  0x80dbcee in ExecSeqScan () at eval.c:88
#7  0x80d3929 in ExecProcNode () at eval.c:88
#8  0x80d76d0 in ExecProcAppend () at eval.c:88
#9  0x80d3919 in ExecProcNode () at eval.c:88
#10 0x80dbac1 in ExecResult () at eval.c:88
#11 0x80d3909 in ExecProcNode () at eval.c:88
#12 0x80d288e in ExecutePlan () at eval.c:88
#13 0x80d1f80 in ExecutorRun () at eval.c:88
#14 0x8121aab in ProcessQuery () at eval.c:88
#15 0x811ff7d in pg_exec_query_string () at eval.c:88
#16 0x8121093 in PostgresMain () at eval.c:88
#17 0x8108b9a in DoBackend () at eval.c:88
#18 0x810848f in BackendStartup () at eval.c:88
#19 0x810763c in ServerLoop () at eval.c:88
#20 0x810719a in PostmasterMain () at eval.c:88
#21 0x80e46ff in main () at eval.c:88
#22 0x400e92eb in __libc_start_main (main=0x80e4530 <main>, argc=2,
    ubp_av=0xbffffd44, init=0x806a7b0 <_init>, fini=0x81785fc <_fini>,
---Type <return> to continue, or q <return> to quit---
    rtld_fini=0x4000c130 <_dl_fini>, stack_end=0xbffffd3c)
    at ../sysdeps/generic/libc-start.c:129

Gordon Ross,
Network Manager/Rheolwr Rhydwaith
Countryside Council for Wales/Cyngor Cefn Gwlad Cymru


Re: Problems dumping database from 7.3.1

От
Tom Lane
Дата:
"Gordon Ross" <G.Ross@ccw.gov.uk> writes:
>> 4. Do whatever it takes in the psql session to provoke crash
>> and send along the output of bt.

> Program received signal SIGSEGV, Segmentation fault.
> 0x806ba0a in nocachegetattr () at eval.c:88
> 88      eval.c: No such file or directory.
> (gdb) bt
> #0  0x806ba0a in nocachegetattr () at eval.c:88
> #1  0x80d41ec in ExecEvalVar () at eval.c:88
> #2  0x80d549c in ExecEvalExpr () at eval.c:88
> #3  0x80d5809 in ExecTargetList () at eval.c:88

Hmm.  The file/line references are obviously wrong, but let's assume
that the function names are right (they at least look plausible).
What exactly was the query that provoked this?
Also, could we see \d output for the table(s) involved?

            regards, tom lane

Re: Problems dumping database from 7.3.1

От
"Gordon Ross"
Дата:
switchboards=# \d "DNOwner";
                           Table "public.DNOwner"
   Column    |  Type   |                     Modifiers
-------------+---------+----------------------------------------------------
 Id          | integer | not null default nextval('"DNOwner_id_seq"'::text)
 ExDirectory | boolean | default 'f'
 Name        | text    |
 dnid        | integer |

switchboards=# select * from "DNOwner";
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.


The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select on
thoseproduce the same result. 

GTG

>>> Tom Lane <tgl@sss.pgh.pa.us> 05/12/04 17:51 PM >>>
"Gordon Ross" <G.Ross@ccw.gov.uk> writes:
>> 4. Do whatever it takes in the psql session to provoke crash
>> and send along the output of bt.

> Program received signal SIGSEGV, Segmentation fault.
> 0x806ba0a in nocachegetattr () at eval.c:88
> 88      eval.c: No such file or directory.
> (gdb) bt
> #0  0x806ba0a in nocachegetattr () at eval.c:88
> #1  0x80d41ec in ExecEvalVar () at eval.c:88
> #2  0x80d549c in ExecEvalExpr () at eval.c:88
> #3  0x80d5809 in ExecTargetList () at eval.c:88

Hmm.  The file/line references are obviously wrong, but let's assume
that the function names are right (they at least look plausible).
What exactly was the query that provoked this?
Also, could we see \d output for the table(s) involved?

            regards, tom lane


Re: Problems dumping database from 7.3.1

От
Tom Lane
Дата:
"Gordon Ross" <G.Ross@ccw.gov.uk> writes:
> switchboards=# select * from "DNOwner";
> server closed the connection unexpectedly

> The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select
onthose produce the same result. 

Hmm.  Does 'select * from only "DNOwner";' work, by any chance?

I have a suspicion now that you were bit by the bug repaired in this
7.3.2 bug fix:

2003-01-04 19:56  tgl

    * src/backend/optimizer/prep/prepunion.c (REL7_3_STABLE): Fix
    inherited UPDATE for cases where child column numbering doesn't
    match parent table.  This used to work, but was broken in 7.3 by
    rearrangement of code that handles targetlist sorting.    Add a
    regression test to catch future breakage.

What I am thinking happened is that you did an 'UPDATE "DNOwner"' to
insert values for the new column, and that because of the aforesaid
bug, the updates of the child tables actually corrupted the data (by
storing the intended new dnid value into the wrong column position).
The crash is probably happening because SELECT is trying to interpret
an integer as a text field, or something along that line.

If that theory is correct, then the parent table should still be okay
(thus the request to try an ONLY select).  However, I'm afraid all your
child tables are toast, or at least all the rows you updated this way
are toast.  You could maybe get to a dumpable state by deleting all the
child rows, but if the DELETE crashes then you'll have little recourse
but to drop the child tables completely.

Updating Postgres to 7.3.2 or later will prevent fresh occurrences of
the same problem, but unfortunately won't bring back your old data :-(.
I hope you have backups ...

            regards, tom lane

Re: Problems dumping database from 7.3.1

От
"Gordon Ross"
Дата:
*SIGH* Oh, well. I suppose I should have updated up development machine a while ago. ./configure; make here we come.

Thank you for looking at this. At least I can be re-assured that the problem is fixed in later versions of Postgres.

GTG


>>> Tom Lane <tgl@sss.pgh.pa.us> 05/12/04 20:36 PM >>>
"Gordon Ross" <G.Ross@ccw.gov.uk> writes:
> switchboards=# select * from "DNOwner";
> server closed the connection unexpectedly

> The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select
onthose produce the same result. 

Hmm.  Does 'select * from only "DNOwner";' work, by any chance?

I have a suspicion now that you were bit by the bug repaired in this
7.3.2 bug fix:

2003-01-04 19:56  tgl

    * src/backend/optimizer/prep/prepunion.c (REL7_3_STABLE): Fix
    inherited UPDATE for cases where child column numbering doesn't
    match parent table.  This used to work, but was broken in 7.3 by
    rearrangement of code that handles targetlist sorting.    Add a
    regression test to catch future breakage.

What I am thinking happened is that you did an 'UPDATE "DNOwner"' to
insert values for the new column, and that because of the aforesaid
bug, the updates of the child tables actually corrupted the data (by
storing the intended new dnid value into the wrong column position).
The crash is probably happening because SELECT is trying to interpret
an integer as a text field, or something along that line.

If that theory is correct, then the parent table should still be okay
(thus the request to try an ONLY select).  However, I'm afraid all your
child tables are toast, or at least all the rows you updated this way
are toast.  You could maybe get to a dumpable state by deleting all the
child rows, but if the DELETE crashes then you'll have little recourse
but to drop the child tables completely.

Updating Postgres to 7.3.2 or later will prevent fresh occurrences of
the same problem, but unfortunately won't bring back your old data :-(.
I hope you have backups ...

            regards, tom lane