Обсуждение: Segfaults with 8.1.3 on amd64
Hi, our 8.1.3 system on quad Xeon has been happily chugging away for weeks with no stability problems until yesterday: /var/log/syslog:May 4 11:57:17 cayenne kernel: postmaster[19291]: segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418 error 4 /var/log/syslog.0:May 3 09:39:06 cayenne kernel: postmaster[32698]: segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418 error 4 /var/log/syslog.0:May 3 11:02:00 cayenne kernel: postmaster[12427]: segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418 error 4 I don't know what the rip + rsp values represent, but is it interesting that they are identical in all three cases? Not a single OS change has occurred on the machine - in fact the only thing happening other than pg itself is me tail'ing the logs.. I'm using Debian sarge with the 8.1.3 debs from backports.org which I trust; I doubt running postmaster under gdb will be workable due to the performance penalty. The pg logs don't show much of interest: 2006-05-04 11:57:17 BST LOG: server process (PID 19291) was terminated by signal 11 2006-05-04 11:57:17 BST LOG: terminating any other active server processes 2006-05-04 11:57:17 BST WARNING: terminating connection because of crash of another server process 2006-05-04 11:57:17 BST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2006-05-04 11:57:17 BST HINT: In a moment you should be able to reconnect to the database and repeat your command. [loads of these] 2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode 2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode 2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode 2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode 2006-05-04 11:57:18 BST LOG: all server processes terminated; reinitializing 2006-05-04 11:57:18 BST FATAL: the database system is starting up 2006-05-04 11:57:18 BST FATAL: the database system is starting up 2006-05-04 11:57:18 BST FATAL: the database system is starting up 2006-05-04 11:57:18 BST FATAL: the database system is starting up 2006-05-04 11:57:18 BST FATAL: the database system is starting up 2006-05-04 11:57:18 BST LOG: database system was interrupted at 2006-05-04 11:56:17 BST 2006-05-04 11:57:18 BST LOG: checkpoint record is at 68/A9D2F2E8 2006-05-04 11:57:18 BST LOG: redo record is at 68/A9D17DD0; undo record is at 0/0; shutdown FALSE 2006-05-04 11:57:18 BST LOG: next transaction ID: 728532363; next OID: 183302937 2006-05-04 11:57:18 BST LOG: next MultiXactId: 46957; next MultiXactOffset: 98539 2006-05-04 11:57:18 BST LOG: database system was not properly shut down; automatic recovery in progress 2006-05-04 11:57:18 BST LOG: redo starts at 68/A9D17DD0 2006-05-04 11:57:18 BST FATAL: the database system is starting up [ loads of these] 2006-05-04 11:57:19 BST LOG: record with zero length at 68/ABAF4F48 2006-05-04 11:57:19 BST LOG: redo done at 68/ABAF4F18 2006-05-04 11:57:19 BST LOG: could not truncate directory "pg_multixact/members": apparent wraparound 2006-05-04 11:57:19 BST LOG: database system is ready 2006-05-04 11:57:19 BST LOG: transaction ID wrap limit is 1362094701, limited by database "postgres" Encouragingly, pg_config shows that --enable_debug was passed as a ./configure argument: CONFIGURE = '--build=x86_64-linux' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/lib/postgresql-8.1' '--srcdir=.' '--disable-maintainer-mode' '--mandir=/usr/share/postgresql/8.1/man' '--with-docdir=/usr/share/doc/postgresql-doc-8.1' '--datadir=/usr/share/postgresql/8.1' '--bindir=/usr/lib/postgresql/8.1/bin' '--includedir=/usr/include/postgresql/' '--enable-nls' '--enable-integer-datetimes' '--enable-debug' '--disable-rpath' '--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-krb5' '--with-openssl' '--with-gnu-ld' '--with-tclconfig=/usr/lib/tcl8.4' '--with-tkconfig=/usr/lib/tk8.4' '--with-includes=/usr/include/tcl8.4' '--with-pgport=5432' '--enable-thread-safety' 'CC=cc' 'CFLAGS=-g -Wall -O2 -Wl,--as-needed' 'build_alias=x86_64-linux' CC = cc CPPFLAGS = -D_GNU_SOURCE -I/usr/include/tcl8.4 CFLAGS = -g -Wall -O2 -Wl,--as-needed -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wendif-labels -fno-strict-aliasing -g CFLAGS_SL = -fpic LDFLAGS = LDFLAGS_SL = LIBS = -lpgport -lpam -lssl -lcrypto -lkrb5 -lz -lreadline -lcrypt -lresolv -lnsl -ldl -lm VERSION = PostgreSQL 8.1.3 How can I enable coredumps or something similarly useful for debugging purposes? Cheers, Gavin.
On Thu, May 04, 2006 at 12:22:01PM +0100, Gavin Hamill wrote: > Hi, our 8.1.3 system on quad Xeon has been happily chugging away for > weeks with no stability problems until yesterday: > > /var/log/syslog:May 4 11:57:17 cayenne kernel: postmaster[19291]: > segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418 > error 4 <snip> > I don't know what the rip + rsp values represent, but is it interesting > that they are identical in all three cases? At a guess rip = return instruction pointer, rsp = return stack point. The fact that they're all the same seems to rule out hardware. > I'm using Debian sarge with the 8.1.3 debs from backports.org which I > trust; I doubt running postmaster under gdb will be workable due to the > performance penalty. I didn't think attaching gds had much effect on performance, but you may be right. <snip other usual output on server crash> > How can I enable coredumps or something similarly useful for debugging > purposes? Before starting the server, run "ulimit -S -c unlimited" If done properly it should enable core dumps for the backend. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout wrote: > On Thu, May 04, 2006 at 12:22:01PM +0100, Gavin Hamill wrote: >> Hi, our 8.1.3 system on quad Xeon has been happily chugging away for >> weeks with no stability problems until yesterday: >> >> /var/log/syslog:May 4 11:57:17 cayenne kernel: postmaster[19291]: >> segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp >> 00007fffffffd418 error 4 > > <snip> > >> I don't know what the rip + rsp values represent, but is it >> interesting that they are identical in all three cases? > > At a guess rip = return instruction pointer, rsp = return stack > point. The fact that they're all the same seems to rule out hardware. The R* registers in AMD64 are just the 64-bit extensions to the standard registers. They couldn't use EIP because that was taken in the expansion from 16-bit to 32-bit. So RIP is simply the 64-bit instruction pointer, RSP the 64-bit stack pointer. -- Guy Rouillier
Martijn van Oosterhout wrote: >On Thu, May 04, 2006 at 12:22:01PM +0100, Gavin Hamill wrote: > > > >At a guess rip = return instruction pointer, rsp = return stack point. >The fact that they're all the same seems to rule out hardware. > > > That's good to hear (in one way... :) >fore starting the server, run "ulimit -S -c unlimited" > >If done properly it should enable core dumps for the backend. > >Have a nice day, > > Great stuff - it's crashed again and dropped 6MB of core which points the finger squarely at Slony - I'll ask on the relevant list :) Core was generated by `postgres: sharp laterooms 194.24.250.135(54478) UPDATE '. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libpam.so.0...(no debugging symbols found)...done. .... Reading symbols from /usr/lib/postgresql/8.1/lib/slony1_funcs.so...done. Loaded symbols for /usr/lib/postgresql/8.1/lib/slony1_funcs.so Reading symbols from /usr/lib/postgresql/8.1/lib/xxid.so...done. Loaded symbols for /usr/lib/postgresql/8.1/lib/xxid.so #0 0x00002aaaab5e8c00 in strlen () from /lib/libc.so.6 (gdb) bt #0 0x00002aaaab5e8c00 in strlen () from /lib/libc.so.6 #1 0x00002aaaca65b062 in slon_quote_literal (str=0x0) at slony1_funcs.c:1044 #2 0x00002aaaca65c348 in _Slony_I_logTrigger (fcinfo=0x8f5ec5) at slony1_funcs.c:783 #3 0x00000000005ca9f9 in fmgr_internal_function () #4 0x00000000004ce6a4 in FreeTriggerDesc () #5 0x00000000004cf42e in ExecARUpdateTriggers () #6 0x00000000004cf873 in ExecARUpdateTriggers () #7 0x00000000004cfb10 in AfterTriggerEndQuery () #8 0x000000000055ef05 in FreeQueryDesc () #9 0x000000000055fecf in PortalRun () #10 0x000000000055f78f in PortalRun () #11 0x000000000055b721 in pg_plan_queries () #12 0x000000000055e14c in PostgresMain () #13 0x0000000000539cc1 in ClosePostmasterPorts () #14 0x0000000000539797 in ClosePostmasterPorts () #15 0x0000000000537d3d in PostmasterMain () #16 0x000000000053704e in PostmasterMain () #17 0x00000000004fdb58 in main () Cheers, Gavin,