Обсуждение: simple update closes connection. why?
hi,
this simple update closes the connection. I use the database to dynamically update titles and directory on my web
pages.
I show table structure first then the command then, uh, the closed connection.
calling=# \d calling;
Table "public.calling"
Column | Type | Modifiers
-------------+------------------------+-----------
filename | character varying(40) |
description | character varying(180) |
uname | character varying(30) |
Indexes: unique_filename unique btree (filename),
unique_filename_idx btree (filename)
calling=# select * from calling limit 5;
filename | description | uname
-----------------+-------------------------------------+-------
c_java_python | |
conf | old conf directory |
mod_perl_webapp | main weba pplication, now a symlink |
python | |
ssh+ssl | |
(5 rows)
calling=# update calling set uname='pencilhead';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!# \q
[pencilhead@www postgres]$ psql -d calling
Password:
Welcome to psql 7.3.1, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help on internal slash commands
\g or terminate with semicolon to execute query
\q to quit
calling=# select count(*) from calling;
count
-------
1683
(1 row)
calling=# \dp calling
Access privileges for database "calling"
Schema | Table | Access privileges
--------+-------+---------------------------------
public | calling | {=,pencilhead=arwdRxt}
(1 row)
joe speigle
hi,
I increased the debugging in postgresql.conf but not many changes
calling=# update calling set uname='joe';
DEBUG: StartTransactionCommand
DEBUG: ProcessQuery
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!#
--
joe speigle
joseph speigle <joe.speigle@jklh.us> writes:
> this simple update closes the connection.
My first guess is there's some data corruption in the table or one of
its indexes. It would be useful to get a stack trace showing where
in the backend the core dump occurs --- can you manage that?
Other things to check: can you do a "select * from calling" without
crashing? What shows up in the postmaster log when the crash occurs?
If you just want to fix the database ASAP, you could try a REINDEX
on the table --- if the problem is in the indexes that should get
rid of it. But it would also destroy the evidence of exactly what
went wrong.
regards, tom lane
tom,
i have a few things for you. namely, i increased the debug settings for logging and i ran gdb by attaching to the
runningprocess.
history:
i had altered the table when it was okay with the following sequence
alter table calling add column uname character varying(20);
alter table calling alter column uname set not null;
said I couldn't do that because there were null values in it, so that's when i tried the
update calling set uname='joe';
and got errors.
maybe the attempt to set to not null with empty records was bad.
=======================================================================================================
pg_dump -d 'calling' > dump.sql
....................
pg_dump: DEBUG: StartTransactionCommand
pg_dump: DEBUG: ProcessUtility
pg_dump: DEBUG: CommitTransactionCommand
DEBUG: StartTransactionCommand
DEBUG: ProcessUtility
pg_dump: DEBUG: StartTransactionCommand
pg_dump: DEBUG: ProcessUtility
pg_dump: dumpClasses(): SQL command failed
pg_dump: Error message from server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_dump: The command was: FETCH 100 FROM _pg_dump_cursor
DEBUG: reaping dead processes
DEBUG: child process (pid 16468) was terminated by signal 11
LOG: server process (pid 16468) was terminated by signal 11
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing shared memory and semaphores
DEBUG: shmem_exit(0)
DEBUG: invoking IpcMemoryCreate(size=3620864)
-bash-2.05b$ LOG: database system was interrupted at 2004-02-01 01:16:07 CST
LOG: checkpoint record is at 0/344EA680
LOG: redo record is at 0/344EA680; undo record is at 0/0; shutdown TRUE
LOG: next transaction id: 8800413; next oid: 614903
LOG: database system was not properly shut down; automatic recovery in progress
LOG: ReadRecord: record with zero length at 0/344EA6C0
LOG: redo is not required
LOG: database system is ready
DEBUG: proc_exit(0)
DEBUG: shmem_exit(0)
DEBUG: exit(0)
DEBUG: reaping dead processes
==================================================================
postgres 17460 0.0 0.3 10952 3476 pts/4 S 01:52 0:00 postgres: joe calling [local] idle
postgres 17461 0.0 0.0 2620 704 pts/4 R 01:52 0:00 ps -axu
-bash-2.05b$ gdb attach 17460
GNU gdb Red Hat Linux (5.2.1-4)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...attach: No such file or directory.
Attaching to process 17460
Reading symbols from /usr/local/pgsql/bin/postgres...done.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libreadline.so.4...done.
Loaded symbols for /usr/lib/libreadline.so.4
Reading symbols from /usr/lib/libncurses.so.5...done.
Loaded symbols for /usr/lib/libncurses.so.5
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/i686/libm.so.6...done.
Loaded symbols for /lib/i686/libm.so.6
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0x420dabc2 in recv () from /lib/i686/libc.so.6
(gdb) c
Continuing.
DEBUG: StartTransactionCommand
DEBUG: ProcessQuery
Program received signal SIGSEGV, Segmentation fault.
0x0806b286 in nocachegetattr ()
(gdb) where
#0 0x0806b286 in nocachegetattr ()
#1 0x080cef8b in ExecEvalVar ()
#2 0x080d01e0 in ExecEvalExpr ()
#3 0x080d04ea in ExecTargetList ()
#4 0x080d0797 in ExecProject ()
#5 0x080d0839 in ExecScan ()
#6 0x080d674b in ExecSeqScan ()
#7 0x080ce842 in ExecProcNode ()
#8 0x080d736e in ExecLimit ()
#9 0x080ce879 in ExecProcNode ()
#10 0x080cd7cb in ExecutePlan ()
#11 0x080cce92 in ExecutorRun ()
#12 0x0811a645 in ProcessQuery ()
#13 0x08118dc1 in pg_exec_query_string ()
#14 0x08119c21 in PostgresMain ()
#15 0x08101ac2 in DoBackend ()
#16 0x0810163a in BackendStartup ()
#17 0x08100781 in ServerLoop ()
#18 0x08100230 in PostmasterMain ()
#19 0x080defe4 in main ()
#20 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
(gdb)
RECOMPILE RECOMPILE RECOMPILE RECOMPILE RECOMPILE RECOMPILE RECOMPILE -g
(gdb) bt
#0 0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at
heaptuple.c:349
#1 0x080f79a3 in ExecEvalVar (variable=0x82f5080, econtext=0x82f5b08, isNull=0xbfffe9d3 "") at execQual.c:377
#2 0x080f972b in ExecEvalExpr (expression=0x82f5080, econtext=0x82f5b08, isNull=0xbfffe9d3 "", isDone=0xbfffe9f0)
at execQual.c:1682
#3 0x080f9c3d in ExecTargetList (targetlist=0x82f50a8, nodomains=1, targettype=0x82f5ee8, values=0x82f5fc8,
econtext=0x82f5b08, isDone=0xbfffec04) at execQual.c:2058
#4 0x080fa025 in ExecProject (projInfo=0x82f5fa0, isDone=0xbfffec04) at execQual.c:2282
#5 0x080fa140 in ExecScan (node=0x82f59f0, accessMtd=0x8102498 <SeqNext>) at execScan.c:133
#6 0x0810259c in ExecSeqScan (node=0x82f59f0) at nodeSeqscan.c:133
#7 0x080f6c28 in ExecProcNode (node=0x82f59f0, parent=0x0) at execProcnode.c:291
#8 0x080f56ac in ExecutePlan (estate=0x82f5e20, plan=0x82f59f0, operation=CMD_SELECT, numberTuples=0,
direction=ForwardScanDirection, destfunc=0x82f5fe0) at execMain.c:954
#9 0x080f4a87 in ExecutorRun (queryDesc=0x82f5df8, estate=0x82f5e20, direction=ForwardScanDirection, count=0)
at execMain.c:195
#10 0x0815ffc9 in ProcessQuery (parsetree=0x82f33e8, plan=0x82f59f0, dest=Remote, completionTag=0xbfffedd0 "")
at pquery.c:242
#11 0x0815e232 in pg_exec_query_string (query_string=0x82f3048, dest=Remote, parse_context=0x82ed1c8) at postgres.c:838
#12 0x0815f4da in PostgresMain (argc=4, argv=0xbffff070, username=0x82ac1a1 "joe") at postgres.c:2016
#13 0x0813e836 in DoBackend (port=0x82ac070) at postmaster.c:2293
#14 0x0813df3f in BackendStartup (port=0x82ac070) at postmaster.c:1915
#15 0x0813cd37 in ServerLoop () at postmaster.c:1000
#16 0x0813c810 in PostmasterMain (argc=1, argv=0x8293198) at postmaster.c:779
#17 0x0810e22f in main (argc=1, argv=0xbffff9d4) at main.c:210
#18 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
(gdb)
(gdb) info all-registers
eax 0x0 0
ecx 0x924 2340
edx 0xc 12
ebx 0x1a 26
esp 0xbfffe8d0 0xbfffe8d0
ebp 0xbfffe918 0xbfffe918
esi 0x40012020 1073815584
edi 0xbffff9d4 -1073743404
eip 0x806b854 0x806b854
eflags 0x10283 66179
cs 0x23 35
ss 0x2b 43
ds 0x2b 43
es 0x2b 43
fs 0x0 0
gs 0x0 0
st0 0 (raw 0x00000000000000000000)
st1 0 (raw 0x00000000000000000000)
st2 0 (raw 0x00000000000000000000)
st3 0 (raw 0x00000000000000000000)
st4 1 (raw 0x3fff8000000000000000)
st5 1 (raw 0x3fff8000000000000000)
st6 0 (raw 0x00000000000000000000)
st7 0 (raw 0x00000000000000000000)
fctrl 0x37f 895
fstat 0x4020 16416
ftag 0xffff 65535
fiseg 0x0 0
fioff 0x0 0
foseg 0x0 0
fooff 0x0 0
fop 0x0 0
xmm0 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
xmm1 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
xmm2 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
---Type <return> to continue, or q <return> to quit---
xmm3 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
xmm4 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
xmm5 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
xmm6 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
xmm7 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}}
mxcsr 0x1f80 8064
orig_eax 0xffffffff -1
(gdb)
> joseph speigle <joe.speigle@jklh.us> writes:
> > this simple update closes the connection.
>
> My first guess is there's some data corruption in the table or one of
> its indexes. It would be useful to get a stack trace showing where
> in the backend the core dump occurs --- can you manage that?
>
> Other things to check: can you do a "select * from calling" without
> crashing? What shows up in the postmaster log when the crash occurs?
>
> If you just want to fix the database ASAP, you could try a REINDEX
> on the table --- if the problem is in the indexes that should get
> rid of it. But it would also destroy the evidence of exactly what
> went wrong.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
--
joe speigle
i can't reindex the table. I dropped the index but it wont recreate it because there are duplicates in the table ERROR: Cannot create unique index. Table contains non-unique values joe speigle
joseph speigle <joe.speigle@jklh.us> writes:
> i had altered the table when it was okay with the following sequence
> alter table calling add column uname character varying(20);
> alter table calling alter column uname set not null;
> said I couldn't do that because there were null values in it, so that's when i tried the
> update calling set uname='joe';
> and got errors.
Hmm. That's interesting and possibly relevant, but it can't be the
whole story --- people have been doing that for years. What else can
you tell us about the history of this table?
> (gdb) bt
> #0 0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at
heaptuple.c:349
I'm beginning to think you must have corruption in the system catalogs,
because there's no way that control should have reached that line for
this table, seeing that all the columns are variable-width. Could we
see the output of
select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass;
regards, tom lane
okay. I"m making progress here. :) I want to know if I should just delete the oid that's too very large?
because, I can issue a select up to row 1100. So, Select * from calling limit 1100 returns all rows, whereas
select * from calling limit 1101 kills the process.
Here is what i get for the select you asked for:
calling=# select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass;
-[ RECORD 1 ]-+------------
attrelid | 577720
attname | filename
atttypid | 1043
attstattarget | -1
attlen | -1
attnum | 1
attndims | 0
attcacheoff | -1
atttypmod | 44
attbyval | f
attstorage | x
attisset | f
attalign | i
attnotnull | f
atthasdef | f
attisdropped | f
attislocal | t
attinhcount | 0
-[ RECORD 2 ]-+------------
attrelid | 577720
attname | description
atttypid | 1043
attstattarget | -1
attlen | -1
attnum | 2
attndims | 0
attcacheoff | -1
atttypmod | 184
attbyval | f
attstorage | x
attisset | f
attalign | i
attnotnull | f
atthasdef | f
attisdropped | f
attislocal | t
attinhcount | 0
-[ RECORD 3 ]-+------------
attrelid | 577720
attname | uname
atttypid | 1043
attstattarget | -1
attlen | -1
attnum | 3
attndims | 0
attcacheoff | -1
atttypmod | 34
attbyval | f
attstorage | x
attisset | f
attalign | i
attnotnull | f
atthasdef | f
attisdropped | f
attislocal | t
attinhcount | 0
and here is also something intersting
calling=# select oid from calling;
snip
.....
583116
583117
583118
583119
583120
584227
2617245723
588066
588067
578306
578307
578310
588068
...... snip
( so thinking I can try to get where > 584227...)
calling=# select oid from calling where oid > 584227;
oid
------------
1342255253
588066
588067
588068
588069
588090
588091
588093
588094
596284
596285
596286
596287
snip
..................................
(it's a moving target!)
calling=# select oid from calling where oid > 584227;
-[ RECORD 1 ]---
oid | 1006632986
-[ RECORD 2 ]---
oid | 588066
-[ RECORD 3 ]---
oid | 588067
-[ RECORD 4 ]---
calling=# select * from pg_class where relname='calling';
-[ RECORD 1 ]--+--------------------------------------------------
relname | calling
relnamespace | 2200
reltype | 577721
relowner | 110
relam | 0
relfilenode | 577720
relpages | 159
reltuples | 1685
reltoastrelid | 0
reltoastidxid | 0
relhasindex | t
relisshared | f
relkind | r
relnatts | 3
relchecks | 0
reltriggers | 0
relukeys | 0
relfkeys | 0
relrefs | 0
relhasoids | t
relhaspkey | f
relhasrules | f
relhassubclass | f
relacl | {=,joe=arwdRxt,perlcalling=arwdRxt,perl_user=arwdRxt}
calling=# select count(*) from calling;
-[ RECORD 1 ]
count | 1820
calling=#
On Sun, Feb 01, 2004 at 12:02:50PM -0500, Tom Lane wrote:
> joseph speigle <joe.speigle@jklh.us> writes:
> > i had altered the table when it was okay with the following sequence
>
> > alter table calling add column uname character varying(20);
> > alter table calling alter column uname set not null;
> > said I couldn't do that because there were null values in it, so that's when i tried the
> > update calling set uname='joe';
> > and got errors.
>
> Hmm. That's interesting and possibly relevant, but it can't be the
> whole story --- people have been doing that for years. What else can
> you tell us about the history of this table?
>
> > (gdb) bt
> > #0 0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at
heaptuple.c:349
>
> I'm beginning to think you must have corruption in the system catalogs,
> because there's no way that control should have reached that line for
> this table, seeing that all the columns are variable-width. Could we
> see the output of
> select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass;
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
--
joe speigle
joseph speigle <joe.speigle@jklh.us> writes:
> okay. I"m making progress here. :) I want to know if I should just delete the oid that's too very large?
> because, I can issue a select up to row 1100. So, Select * from calling limit 1100 returns all rows, whereas
> select * from calling limit 1101 kills the process.
Okay, so you have a corrupted tuple --- from the evidence so far I'd
guess that the length word of the first column in that row is clobbered.
The wacky OIDs in some of the other rows look like data corruption too.
At this point my thoughts are heading in the direction of hardware
problems. Have you tried running any memory or disk diagnostics?
memtest86 and badblocks seem to be the most widely used tests.
regards, tom lane
no, but perhaps I will. but since there's only this one corrupted table, it was trying to add a third column "uname" whichwas where it went bad. It might have been this command alter table calling add uname character varying(30) not null default 'joe'; Or, maybe I am confused and it didnt' like altertable calling alter column uname set not null; Thanks for helping out! All my commands were: \d calling; alter table calling add uname character varying(30) not null default 'joe'; alter table calling add uname character varying(30) not null; alter table calling add uname character varying(30); alter table calling uname character varying(30) set not null; alter table calling modify uname character varying(30) set not null; alter table calling alter column uname character varying(30) set not null; alter table calling alter column uname set not null; update calling set uname='joe'; \q update calling set uname='joe'; \q \dt \d calling; select * from calling limit 5; update calling set uname='joe'; \q update calling set uname='joe'; \q update calling set uname='joe'; \q select count(*) from calling; \d calling; \q \c calling; \dt On Sun, Feb 01, 2004 at 04:35:14PM -0500, Tom Lane wrote: > joseph speigle <joe.speigle@jklh.us> writes: > > okay. I"m making progress here. :) I want to know if I should just delete the oid that's too very large? > > because, I can issue a select up to row 1100. So, Select * from calling limit 1100 returns all rows, whereas > > select * from calling limit 1101 kills the process. > > Okay, so you have a corrupted tuple --- from the evidence so far I'd > guess that the length word of the first column in that row is clobbered. > The wacky OIDs in some of the other rows look like data corruption too. > > At this point my thoughts are heading in the direction of hardware > problems. Have you tried running any memory or disk diagnostics? > memtest86 and badblocks seem to be the most widely used tests. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- joe speigle
joseph speigle <joe.speigle@jklh.us> writes:
> no, but perhaps I will. but since there's only this one corrupted
> table, it was trying to add a third column "uname" which was where it
> went bad. It might have been this command alter table calling add
> uname character varying(30) not null default 'joe';
Seems unlikely. ALTER TABLE ADD COLUMN has been in Postgres since
forever; I think it's real unlikely that the sequence of operations
you showed would have misbehaved. If you'd done some weird sequence
of column adds and drops, I'd be more prepared to credit a bug (since
DROP COLUMN is new in 7.3) ... but even then, damage consisting of
isolated corruption of different kinds in a few different rows doesn't
sound like a very probable symptom. I suspect the damage was already
there and the UPDATE was just the first operation that happened to touch
the clobbered row.
I'd look for hardware explanations if I were you.
regards, tom lane