Обсуждение: simple update closes connection. why?
hi,
this simple update closes the connection.  I use the database to dynamically update titles and directory on my web
pages.  
I show table structure first then the command then, uh, the closed connection.
calling=# \d calling;
                Table "public.calling"
   Column    |          Type          | Modifiers
-------------+------------------------+-----------
 filename    | character varying(40)  |
 description | character varying(180) |
 uname       | character varying(30)  |
Indexes: unique_filename unique btree (filename),
         unique_filename_idx btree (filename)
calling=# select * from calling limit 5;
    filename     |             description             | uname
-----------------+-------------------------------------+-------
 c_java_python   |                                     |
 conf            | old conf directory                  |
 mod_perl_webapp | main weba pplication, now a symlink |
 python          |                                     |
 ssh+ssl         |                                     |
(5 rows)
calling=# update calling set uname='pencilhead';
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!# \q
[pencilhead@www postgres]$ psql -d calling
Password:
Welcome to psql 7.3.1, the PostgreSQL interactive terminal.
Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit
calling=# select count(*) from calling;
 count
-------
  1683
(1 row)
calling=# \dp calling
       Access privileges for database "calling"
 Schema | Table |        Access privileges
--------+-------+---------------------------------
 public | calling   | {=,pencilhead=arwdRxt}
(1 row)
joe speigle
			
		hi,
I increased the debugging in postgresql.conf but not many changes
calling=# update calling set uname='joe';
DEBUG:  StartTransactionCommand
DEBUG:  ProcessQuery
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!#
--
joe speigle
			
		joseph speigle <joe.speigle@jklh.us> writes:
> this simple update closes the connection.
My first guess is there's some data corruption in the table or one of
its indexes.  It would be useful to get a stack trace showing where
in the backend the core dump occurs --- can you manage that?
Other things to check: can you do a "select * from calling" without
crashing?  What shows up in the postmaster log when the crash occurs?
If you just want to fix the database ASAP, you could try a REINDEX
on the table --- if the problem is in the indexes that should get
rid of it.  But it would also destroy the evidence of exactly what
went wrong.
            regards, tom lane
			
		tom,
i have a few things for you.  namely, i increased the debug settings for logging and i ran gdb by attaching to the
runningprocess.   
history:
i had altered the table when it was okay with the following sequence
alter table calling add column uname character varying(20);
 alter table calling alter column uname set not null;
said I couldn't do that because there were null values in it, so that's when i tried the
update calling set uname='joe';
and got errors.
maybe the attempt to set to not null with empty records was bad.
=======================================================================================================
pg_dump -d 'calling' > dump.sql
....................
pg_dump: DEBUG:  StartTransactionCommand
pg_dump: DEBUG:  ProcessUtility
pg_dump: DEBUG:  CommitTransactionCommand
DEBUG:  StartTransactionCommand
DEBUG:  ProcessUtility
pg_dump: DEBUG:  StartTransactionCommand
pg_dump: DEBUG:  ProcessUtility
pg_dump: dumpClasses(): SQL command failed
pg_dump: Error message from server: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
pg_dump: The command was: FETCH 100 FROM _pg_dump_cursor
DEBUG:  reaping dead processes
DEBUG:  child process (pid 16468) was terminated by signal 11
LOG:  server process (pid 16468) was terminated by signal 11
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing shared memory and semaphores
DEBUG:  shmem_exit(0)
DEBUG:  invoking IpcMemoryCreate(size=3620864)
-bash-2.05b$ LOG:  database system was interrupted at 2004-02-01 01:16:07 CST
LOG:  checkpoint record is at 0/344EA680
LOG:  redo record is at 0/344EA680; undo record is at 0/0; shutdown TRUE
LOG:  next transaction id: 8800413; next oid: 614903
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  ReadRecord: record with zero length at 0/344EA6C0
LOG:  redo is not required
LOG:  database system is ready
DEBUG:  proc_exit(0)
DEBUG:  shmem_exit(0)
DEBUG:  exit(0)
DEBUG:  reaping dead processes
==================================================================
postgres 17460  0.0  0.3 10952 3476 pts/4    S    01:52   0:00 postgres: joe calling [local] idle
postgres 17461  0.0  0.0  2620  704 pts/4    R    01:52   0:00 ps -axu
-bash-2.05b$ gdb attach 17460
GNU gdb Red Hat Linux (5.2.1-4)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...attach: No such file or directory.
Attaching to process 17460
Reading symbols from /usr/local/pgsql/bin/postgres...done.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libreadline.so.4...done.
Loaded symbols for /usr/lib/libreadline.so.4
Reading symbols from /usr/lib/libncurses.so.5...done.
Loaded symbols for /usr/lib/libncurses.so.5
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/i686/libm.so.6...done.
Loaded symbols for /lib/i686/libm.so.6
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0x420dabc2 in recv () from /lib/i686/libc.so.6
(gdb) c
Continuing.
DEBUG:  StartTransactionCommand
DEBUG:  ProcessQuery
Program received signal SIGSEGV, Segmentation fault.
0x0806b286 in nocachegetattr ()
(gdb) where
#0  0x0806b286 in nocachegetattr ()
#1  0x080cef8b in ExecEvalVar ()
#2  0x080d01e0 in ExecEvalExpr ()
#3  0x080d04ea in ExecTargetList ()
#4  0x080d0797 in ExecProject ()
#5  0x080d0839 in ExecScan ()
#6  0x080d674b in ExecSeqScan ()
#7  0x080ce842 in ExecProcNode ()
#8  0x080d736e in ExecLimit ()
#9  0x080ce879 in ExecProcNode ()
#10 0x080cd7cb in ExecutePlan ()
#11 0x080cce92 in ExecutorRun ()
#12 0x0811a645 in ProcessQuery ()
#13 0x08118dc1 in pg_exec_query_string ()
#14 0x08119c21 in PostgresMain ()
#15 0x08101ac2 in DoBackend ()
#16 0x0810163a in BackendStartup ()
#17 0x08100781 in ServerLoop ()
#18 0x08100230 in PostmasterMain ()
#19 0x080defe4 in main ()
#20 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
(gdb)
RECOMPILE  RECOMPILE  RECOMPILE  RECOMPILE  RECOMPILE  RECOMPILE  RECOMPILE   -g
(gdb) bt
#0  0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at
heaptuple.c:349
#1  0x080f79a3 in ExecEvalVar (variable=0x82f5080, econtext=0x82f5b08, isNull=0xbfffe9d3 "") at execQual.c:377
#2  0x080f972b in ExecEvalExpr (expression=0x82f5080, econtext=0x82f5b08, isNull=0xbfffe9d3 "", isDone=0xbfffe9f0)
    at execQual.c:1682
#3  0x080f9c3d in ExecTargetList (targetlist=0x82f50a8, nodomains=1, targettype=0x82f5ee8, values=0x82f5fc8,
    econtext=0x82f5b08, isDone=0xbfffec04) at execQual.c:2058
#4  0x080fa025 in ExecProject (projInfo=0x82f5fa0, isDone=0xbfffec04) at execQual.c:2282
#5  0x080fa140 in ExecScan (node=0x82f59f0, accessMtd=0x8102498 <SeqNext>) at execScan.c:133
#6  0x0810259c in ExecSeqScan (node=0x82f59f0) at nodeSeqscan.c:133
#7  0x080f6c28 in ExecProcNode (node=0x82f59f0, parent=0x0) at execProcnode.c:291
#8  0x080f56ac in ExecutePlan (estate=0x82f5e20, plan=0x82f59f0, operation=CMD_SELECT, numberTuples=0,
    direction=ForwardScanDirection, destfunc=0x82f5fe0) at execMain.c:954
#9  0x080f4a87 in ExecutorRun (queryDesc=0x82f5df8, estate=0x82f5e20, direction=ForwardScanDirection, count=0)
    at execMain.c:195
#10 0x0815ffc9 in ProcessQuery (parsetree=0x82f33e8, plan=0x82f59f0, dest=Remote, completionTag=0xbfffedd0 "")
    at pquery.c:242
#11 0x0815e232 in pg_exec_query_string (query_string=0x82f3048, dest=Remote, parse_context=0x82ed1c8) at postgres.c:838
#12 0x0815f4da in PostgresMain (argc=4, argv=0xbffff070, username=0x82ac1a1 "joe") at postgres.c:2016
#13 0x0813e836 in DoBackend (port=0x82ac070) at postmaster.c:2293
#14 0x0813df3f in BackendStartup (port=0x82ac070) at postmaster.c:1915
#15 0x0813cd37 in ServerLoop () at postmaster.c:1000
#16 0x0813c810 in PostmasterMain (argc=1, argv=0x8293198) at postmaster.c:779
#17 0x0810e22f in main (argc=1, argv=0xbffff9d4) at main.c:210
#18 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
(gdb)
(gdb) info all-registers
eax            0x0      0
ecx            0x924    2340
edx            0xc      12
ebx            0x1a     26
esp            0xbfffe8d0       0xbfffe8d0
ebp            0xbfffe918       0xbfffe918
esi            0x40012020       1073815584
edi            0xbffff9d4       -1073743404
eip            0x806b854        0x806b854
eflags         0x10283  66179
cs             0x23     35
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x0      0
gs             0x0      0
st0            0        (raw 0x00000000000000000000)
st1            0        (raw 0x00000000000000000000)
st2            0        (raw 0x00000000000000000000)
st3            0        (raw 0x00000000000000000000)
st4            1        (raw 0x3fff8000000000000000)
st5            1        (raw 0x3fff8000000000000000)
st6            0        (raw 0x00000000000000000000)
st7            0        (raw 0x00000000000000000000)
fctrl          0x37f    895
fstat          0x4020   16416
ftag           0xffff   65535
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
xmm0           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
xmm1           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
xmm2           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
---Type <return> to continue, or q <return> to quit---
xmm3           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
xmm4           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
xmm5           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
xmm6           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
xmm7           {f = {0x0, 0x0, 0x0, 0x0}}       {f = {0, 0, 0, 0}}
mxcsr          0x1f80   8064
orig_eax       0xffffffff       -1
(gdb)
> joseph speigle <joe.speigle@jklh.us> writes:
> > this simple update closes the connection.
>
> My first guess is there's some data corruption in the table or one of
> its indexes.  It would be useful to get a stack trace showing where
> in the backend the core dump occurs --- can you manage that?
>
> Other things to check: can you do a "select * from calling" without
> crashing?  What shows up in the postmaster log when the crash occurs?
>
> If you just want to fix the database ASAP, you could try a REINDEX
> on the table --- if the problem is in the indexes that should get
> rid of it.  But it would also destroy the evidence of exactly what
> went wrong.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
--
joe speigle
			
		i can't reindex the table. I dropped the index but it wont recreate it because there are duplicates in the table ERROR: Cannot create unique index. Table contains non-unique values joe speigle
joseph speigle <joe.speigle@jklh.us> writes:
> i had altered the table when it was okay with the following sequence
> alter table calling add column uname character varying(20);
>  alter table calling alter column uname set not null;
> said I couldn't do that because there were null values in it, so that's when i tried the
> update calling set uname='joe';
> and got errors.
Hmm.  That's interesting and possibly relevant, but it can't be the
whole story --- people have been doing that for years.  What else can
you tell us about the history of this table?
> (gdb) bt
> #0  0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at
heaptuple.c:349
I'm beginning to think you must have corruption in the system catalogs,
because there's no way that control should have reached that line for
this table, seeing that all the columns are variable-width.  Could we
see the output of
select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass;
            regards, tom lane
			
		okay.  I"m making progress here.  :)   I want to know if I should just delete the oid that's too very large?
because, I can issue a select up to row 1100.  So, Select * from calling limit 1100 returns all rows, whereas
select * from calling limit 1101 kills the process.
Here is what i get for the select you asked for:
calling=# select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass;
-[ RECORD 1 ]-+------------
attrelid      | 577720
attname       | filename
atttypid      | 1043
attstattarget | -1
attlen        | -1
attnum        | 1
attndims      | 0
attcacheoff   | -1
atttypmod     | 44
attbyval      | f
attstorage    | x
attisset      | f
attalign      | i
attnotnull    | f
atthasdef     | f
attisdropped  | f
attislocal    | t
attinhcount   | 0
-[ RECORD 2 ]-+------------
attrelid      | 577720
attname       | description
atttypid      | 1043
attstattarget | -1
attlen        | -1
attnum        | 2
attndims      | 0
attcacheoff   | -1
atttypmod     | 184
attbyval      | f
attstorage    | x
attisset      | f
attalign      | i
attnotnull    | f
atthasdef     | f
attisdropped  | f
attislocal    | t
attinhcount   | 0
-[ RECORD 3 ]-+------------
attrelid      | 577720
attname       | uname
atttypid      | 1043
attstattarget | -1
attlen        | -1
attnum        | 3
attndims      | 0
attcacheoff   | -1
atttypmod     | 34
attbyval      | f
attstorage    | x
attisset      | f
attalign      | i
attnotnull    | f
atthasdef     | f
attisdropped  | f
attislocal    | t
attinhcount   | 0
and here is also something intersting
calling=# select oid from calling;
snip
.....
    583116
     583117
     583118
     583119
     583120
     584227
 2617245723
     588066
     588067
     578306
     578307
     578310
     588068
...... snip
( so thinking I can try to get where > 584227...)
calling=# select oid from calling where oid > 584227;
    oid
------------
 1342255253
     588066
     588067
     588068
     588069
     588090
     588091
     588093
     588094
     596284
     596285
     596286
     596287
snip
..................................
(it's a moving target!)
calling=# select oid from calling where oid > 584227;
-[ RECORD 1 ]---
oid | 1006632986
-[ RECORD 2 ]---
oid | 588066
-[ RECORD 3 ]---
oid | 588067
-[ RECORD 4 ]---
calling=# select * from pg_class where relname='calling';
-[ RECORD 1 ]--+--------------------------------------------------
relname        | calling
relnamespace   | 2200
reltype        | 577721
relowner       | 110
relam          | 0
relfilenode    | 577720
relpages       | 159
reltuples      | 1685
reltoastrelid  | 0
reltoastidxid  | 0
relhasindex    | t
relisshared    | f
relkind        | r
relnatts       | 3
relchecks      | 0
reltriggers    | 0
relukeys       | 0
relfkeys       | 0
relrefs        | 0
relhasoids     | t
relhaspkey     | f
relhasrules    | f
relhassubclass | f
relacl         | {=,joe=arwdRxt,perlcalling=arwdRxt,perl_user=arwdRxt}
calling=# select count(*) from calling;
-[ RECORD 1 ]
count | 1820
calling=#
On Sun, Feb 01, 2004 at 12:02:50PM -0500, Tom Lane wrote:
> joseph speigle <joe.speigle@jklh.us> writes:
> > i had altered the table when it was okay with the following sequence
>
> > alter table calling add column uname character varying(20);
> >  alter table calling alter column uname set not null;
> > said I couldn't do that because there were null values in it, so that's when i tried the
> > update calling set uname='joe';
> > and got errors.
>
> Hmm.  That's interesting and possibly relevant, but it can't be the
> whole story --- people have been doing that for years.  What else can
> you tell us about the history of this table?
>
> > (gdb) bt
> > #0  0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at
heaptuple.c:349
>
> I'm beginning to think you must have corruption in the system catalogs,
> because there's no way that control should have reached that line for
> this table, seeing that all the columns are variable-width.  Could we
> see the output of
> select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass;
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
--
joe speigle
			
		joseph speigle <joe.speigle@jklh.us> writes:
> okay.  I"m making progress here.  :)   I want to know if I should just delete the oid that's too very large?
> because, I can issue a select up to row 1100.  So, Select * from calling limit 1100 returns all rows, whereas
> select * from calling limit 1101 kills the process.
Okay, so you have a corrupted tuple --- from the evidence so far I'd
guess that the length word of the first column in that row is clobbered.
The wacky OIDs in some of the other rows look like data corruption too.
At this point my thoughts are heading in the direction of hardware
problems.  Have you tried running any memory or disk diagnostics?
memtest86 and badblocks seem to be the most widely used tests.
            regards, tom lane
			
		no, but perhaps I will. but since there's only this one corrupted table, it was trying to add a third column "uname" whichwas where it went bad. It might have been this command alter table calling add uname character varying(30) not null default 'joe'; Or, maybe I am confused and it didnt' like altertable calling alter column uname set not null; Thanks for helping out! All my commands were: \d calling; alter table calling add uname character varying(30) not null default 'joe'; alter table calling add uname character varying(30) not null; alter table calling add uname character varying(30); alter table calling uname character varying(30) set not null; alter table calling modify uname character varying(30) set not null; alter table calling alter column uname character varying(30) set not null; alter table calling alter column uname set not null; update calling set uname='joe'; \q update calling set uname='joe'; \q \dt \d calling; select * from calling limit 5; update calling set uname='joe'; \q update calling set uname='joe'; \q update calling set uname='joe'; \q select count(*) from calling; \d calling; \q \c calling; \dt On Sun, Feb 01, 2004 at 04:35:14PM -0500, Tom Lane wrote: > joseph speigle <joe.speigle@jklh.us> writes: > > okay. I"m making progress here. :) I want to know if I should just delete the oid that's too very large? > > because, I can issue a select up to row 1100. So, Select * from calling limit 1100 returns all rows, whereas > > select * from calling limit 1101 kills the process. > > Okay, so you have a corrupted tuple --- from the evidence so far I'd > guess that the length word of the first column in that row is clobbered. > The wacky OIDs in some of the other rows look like data corruption too. > > At this point my thoughts are heading in the direction of hardware > problems. Have you tried running any memory or disk diagnostics? > memtest86 and badblocks seem to be the most widely used tests. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- joe speigle
joseph speigle <joe.speigle@jklh.us> writes:
> no, but perhaps I will.  but since there's only this one corrupted
> table, it was trying to add a third column "uname" which was where it
> went bad. It might have been this command alter table calling add
> uname character varying(30) not null default 'joe';
Seems unlikely.  ALTER TABLE ADD COLUMN has been in Postgres since
forever; I think it's real unlikely that the sequence of operations
you showed would have misbehaved.  If you'd done some weird sequence
of column adds and drops, I'd be more prepared to credit a bug (since
DROP COLUMN is new in 7.3) ... but even then, damage consisting of
isolated corruption of different kinds in a few different rows doesn't
sound like a very probable symptom.  I suspect the damage was already
there and the UPDATE was just the first operation that happened to touch
the clobbered row.
I'd look for hardware explanations if I were you.
            regards, tom lane