Re: database vacuum from cron hanging
От | Kevin Grittner |
---|---|
Тема | Re: database vacuum from cron hanging |
Дата | |
Msg-id | s34cde25.009@gwmta.wicourts.gov обсуждение исходный текст |
Ответ на | database vacuum from cron hanging ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
Список | pgsql-hackers |
From an email from the tech who built the box: The hardware is: Server: IBM x346 Model: 8840-42U CPU: (2) 3.6 GHz Xeon processors RAM: 8 GB RAID 5 OS is SUSE Linux Profession 9.3, fully patched. [as of about a week ago] SuSE Linux 9.3 (i586) - Kernel \r (\l). Linux dtr2.wicourts.gov 2.6.11.4-21.9-bigsmp #1 SMP Fri Aug 19 11:58:59 UTC 2005 i686 i686 i386 GNU/Linux kgrittn@dtr2:~/postgresql-snapshot> gcc -v Reading specs from /usr/lib/gcc-lib/i586-suse-linux/3.3.5/specs Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info--mandir=/usr/share/man --enable-languages=c,c++,f77,objc,java,ada --disable-checking --libdir=/usr/lib--enable-libgcj --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit i586-suse-linux Thread model: posix gcc version 3.3.5 20050117 (prerelease) (SUSE Linux) I'm not sure what you mean regarding pg_config -- could you clarify? Your email came through as I was trying to figure out where to find the core dump. We restarted the server with cassert,and I find this in the log prior to my attempt to vacuum: [2005-10-12 09:09:49.634 CDT] 16603 LOG: database system was shut down at 2005-10-12 08:57:25 CDT [2005-10-12 09:09:49.635 CDT] 16603 LOG: checkpoint record is at 8/A601C904 [2005-10-12 09:09:49.635 CDT] 16603 LOG: redo record is at 8/A601C904; undo record is at 0/0; shutdown TRUE [2005-10-12 09:09:49.635 CDT] 16603 LOG: next transaction ID: 2313003; next OID: 857185 [2005-10-12 09:09:49.635 CDT] 16603 LOG: next MultiXactId: 1; next MultiXactOffset: 0 [2005-10-12 09:09:49.649 CDT] 16603 LOG: database system is ready [2005-10-12 09:09:49.656 CDT] 16603 LOG: transaction ID wrap limit is 1073743018, limited by database "dtr" [2005-10-12 09:09:49.684 CDT] 16607 LOG: autovacuum: processing database "postgres" [2005-10-12 09:10:00.521 CDT] 16611 LOG: autovacuum: processing database "dtr" [2005-10-12 09:10:04.522 CDT] 16618 <dtr dtr 127.0.0.1(33121)> ERROR: duplicate key violates unique constraint "DbTranRepositoryPK" [2005-10-12 09:10:04.522 CDT] 16617 <dtr dtr 127.0.0.1(33120)> ERROR: duplicate key violates unique constraint "DbTranRepositoryPK" [2005-10-12 09:10:04.522 CDT] 16616 <dtr dtr 127.0.0.1(33119)> ERROR: duplicate key violates unique constraint "DbTranRepositoryPK" [2005-10-12 09:10:04.522 CDT] 16619 <dtr dtr 127.0.0.1(33122)> ERROR: duplicate key violates unique constraint "DbTranRepositoryPK" [2005-10-12 09:10:04.522 CDT] 16615 <dtr dtr 127.0.0.1(33118)> ERROR: duplicate key violates unique constraint "DbTranRepositoryPK" [2005-10-12 09:10:05.154 CDT] 16619 <dtr dtr 127.0.0.1(33122)> ERROR: duplicate key violates unique constraint "DbTranRepositoryPK" TRAP: FailedAssertion("!(buf->refcount > 0)", File: "bufmgr.c", Line: 812) [2005-10-12 09:10:05.695 CDT] 16602 LOG: server process (PID 16619) was terminated by signal 6 [2005-10-12 09:10:05.695 CDT] 16602 LOG: terminating any other active server processes [2005-10-12 09:10:05.695 CDT] 16617 <dtr dtr 127.0.0.1(33120)> WARNING: terminating connection because of crash of anotherserver process [2005-10-12 09:10:05.695 CDT] 16617 <dtr dtr 127.0.0.1(33120)> DETAIL: The postmaster has commanded this server processto roll back the current transaction and exit, because another server process exited abnormally and possibly corruptedshared memory. [2005-10-12 09:10:05.695 CDT] 16617 <dtr dtr 127.0.0.1(33120)> HINT: In a moment you should be able to reconnect to thedatabase and repeat your command. [2005-10-12 09:10:05.695 CDT] 16618 <dtr dtr 127.0.0.1(33121)> WARNING: terminating connection because of crash of anotherserver process [2005-10-12 09:10:05.695 CDT] 16618 <dtr dtr 127.0.0.1(33121)> DETAIL: The postmaster has commanded this server processto roll back the current transaction and exit, because another server process exited abnormally and possibly corruptedshared memory. [2005-10-12 09:10:05.695 CDT] 16618 <dtr dtr 127.0.0.1(33121)> HINT: In a moment you should be able to reconnect to thedatabase and repeat your command. [2005-10-12 09:10:05.697 CDT] 16615 <dtr dtr 127.0.0.1(33118)> WARNING: terminating connection because of crash of anotherserver process [2005-10-12 09:10:05.697 CDT] 16615 <dtr dtr 127.0.0.1(33118)> DETAIL: The postmaster has commanded this server processto roll back the current transaction and exit, because another server process exited abnormally and possibly corruptedshared memory. [2005-10-12 09:10:05.697 CDT] 16615 <dtr dtr 127.0.0.1(33118)> HINT: In a moment you should be able to reconnect to thedatabase and repeat your command. [2005-10-12 09:10:05.698 CDT] 16616 <dtr dtr 127.0.0.1(33119)> WARNING: terminating connection because of crash of anotherserver process [2005-10-12 09:10:05.698 CDT] 16616 <dtr dtr 127.0.0.1(33119)> DETAIL: The postmaster has commanded this server processto roll back the current transaction and exit, because another server process exited abnormally and possibly corruptedshared memory. [2005-10-12 09:10:05.698 CDT] 16616 <dtr dtr 127.0.0.1(33119)> HINT: In a moment you should be able to reconnect to thedatabase and repeat your command. [2005-10-12 09:10:05.698 CDT] 16614 <dtr dtr 127.0.0.1(33117)> WARNING: terminating connection because of crash of anotherserver process [2005-10-12 09:10:05.698 CDT] 16614 <dtr dtr 127.0.0.1(33117)> DETAIL: The postmaster has commanded this server processto roll back the current transaction and exit, because another server process exited abnormally and possibly corruptedshared memory. [2005-10-12 09:10:05.698 CDT] 16614 <dtr dtr 127.0.0.1(33117)> HINT: In a moment you should be able to reconnect to thedatabase and repeat your command. [2005-10-12 09:10:05.703 CDT] 16602 LOG: all server processes terminated; reinitializing [2005-10-12 09:10:05.719 CDT] 16620 LOG: database system was interrupted at 2005-10-12 09:09:49 CDT [2005-10-12 09:10:05.719 CDT] 16620 LOG: checkpoint record is at 8/A601C904 [2005-10-12 09:10:05.719 CDT] 16620 LOG: redo record is at 8/A601C904; undo record is at 0/0; shutdown TRUE [2005-10-12 09:10:05.719 CDT] 16620 LOG: next transaction ID: 2313003; next OID: 857185 [2005-10-12 09:10:05.719 CDT] 16620 LOG: next MultiXactId: 1; next MultiXactOffset: 0 [2005-10-12 09:10:05.719 CDT] 16620 LOG: database system was not properly shut down; automatic recovery in progress [2005-10-12 09:10:05.719 CDT] 16621 <dtr dtr 127.0.0.1(33123)> FATAL: the database system is starting up [2005-10-12 09:10:05.720 CDT] 16620 LOG: redo starts at 8/A601C948 [2005-10-12 09:10:05.721 CDT] 16622 <dtr dtr 127.0.0.1(33124)> FATAL: the database system is starting up [2005-10-12 09:10:05.723 CDT] 16623 <dtr dtr 127.0.0.1(33125)> FATAL: the database system is starting up [2005-10-12 09:10:05.725 CDT] 16624 <dtr dtr 127.0.0.1(33126)> FATAL: the database system is starting up [2005-10-12 09:10:05.727 CDT] 16625 <dtr dtr 127.0.0.1(33127)> FATAL: the database system is starting up [2005-10-12 09:10:05.729 CDT] 16626 <dtr dtr 127.0.0.1(33128)> FATAL: the database system is starting up [2005-10-12 09:10:05.734 CDT] 16620 LOG: record with zero length at 8/A6112FB4 [2005-10-12 09:10:05.734 CDT] 16620 LOG: redo done at 8/A6112F8C [2005-10-12 09:10:05.761 CDT] 16620 LOG: database system is ready [2005-10-12 09:10:05.762 CDT] 16620 LOG: transaction ID wrap limit is 1073743018, limited by database "dtr" >>> Tom Lane <tgl@sss.pgh.pa.us> 10/12/05 9:26 AM >>> I wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> (gdb) p BufferDescriptors[781] >> $1 = {tag = {rnode = {spcNode = 1663, dbNode = 16385, relNode = 2666}, blockNum = 1}, flags = 70, usage_count = 5, refcount= 4294967294, >> wait_backend_pid = 748, buf_hdr_lock = 0 '\0', buf_id = 781, freeNext = -2, io_in_progress_lock = 1615, content_lock =1616} > Whoa. refcount -2? After meditating overnight, I have a theory. There seem to be two basic categories of possible explanations for the above state: 1. Some path of control decrements refcount more times than it increments it. 2. Occasionally, an intended increment gets lost. Yesterday I was thinking in terms of #1, but it really doesn't seem to fit the observed facts very well. I don't see a reason why such a bug would preferentially affect pg_constraint_contypid_index; also it seems like it would be fairly easily repeatable by many people. The pin tracking logic is all internal to individual backends and doesn't look very vulnerable to, say, timing-related glitches. On the other hand, it's not hard to concoct a plausible explanation using #2: suppose that two backends wanting to pin the same buffer at about the same time pick up the same original value of refcount, add one, store back. This is not supposed to happen of course, but maybe the compiler is optimizing some code in a way that gives this effect (ie, by reading refcount before the buffer header spinlock has been acquired). Now we can account for pg_constraint_contypid_index being hit: we know you use domains a lot, and that uncached catalog search in GetDomainConstraints would result in a whole lot of concurrent accesses to that particular index, so it would be a likely place for such a bug to manifest. And we can account for you being the only one seeing it: this theory makes it compiler- and platform-dependent. Accordingly: what's the platform exactly? (CPU type, and OS just in case.) What compiler was used? (If gcc, show "gcc -v" output.) Also please show the output of "pg_config". regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.orgso that your message can get through to the mailing list cleanly
В списке pgsql-hackers по дате отправления: