Re: database vacuum from cron hanging

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: database vacuum from cron hanging
Дата
Msg-id s34cde25.009@gwmta.wicourts.gov
обсуждение исходный текст
Ответ на database vacuum from cron hanging  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Список pgsql-hackers
From an email from the tech who built the box:
The hardware is:

Server:  IBM x346
Model:  8840-42U
CPU:  (2) 3.6 GHz Xeon processors
RAM:  8 GB
RAID 5

OS is SUSE Linux Profession 9.3, fully patched. [as of about a week ago]
SuSE Linux 9.3 (i586) - Kernel \r (\l).
Linux dtr2.wicourts.gov 2.6.11.4-21.9-bigsmp #1 SMP Fri Aug 19 11:58:59 UTC 2005 i686 i686 i386 GNU/Linux

kgrittn@dtr2:~/postgresql-snapshot> gcc -v
Reading specs from /usr/lib/gcc-lib/i586-suse-linux/3.3.5/specs
Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local
--infodir=/usr/share/info--mandir=/usr/share/man --enable-languages=c,c++,f77,objc,java,ada --disable-checking
--libdir=/usr/lib--enable-libgcj --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit
i586-suse-linux
Thread model: posix
gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)

I'm not sure what you mean regarding pg_config -- could you clarify?

Your email came through as I was trying to figure out where to find the core dump.  We restarted the server with
cassert,and I find this in the log prior to my attempt to vacuum: 
[2005-10-12 09:09:49.634 CDT] 16603 LOG:  database system was shut down at 2005-10-12 08:57:25 CDT
[2005-10-12 09:09:49.635 CDT] 16603 LOG:  checkpoint record is at 8/A601C904
[2005-10-12 09:09:49.635 CDT] 16603 LOG:  redo record is at 8/A601C904; undo record is at 0/0; shutdown TRUE
[2005-10-12 09:09:49.635 CDT] 16603 LOG:  next transaction ID: 2313003; next OID: 857185
[2005-10-12 09:09:49.635 CDT] 16603 LOG:  next MultiXactId: 1; next MultiXactOffset: 0
[2005-10-12 09:09:49.649 CDT] 16603 LOG:  database system is ready
[2005-10-12 09:09:49.656 CDT] 16603 LOG:  transaction ID wrap limit is 1073743018, limited by database "dtr"
[2005-10-12 09:09:49.684 CDT] 16607 LOG:  autovacuum: processing database "postgres"
[2005-10-12 09:10:00.521 CDT] 16611 LOG:  autovacuum: processing database "dtr"
[2005-10-12 09:10:04.522 CDT] 16618 <dtr dtr 127.0.0.1(33121)> ERROR:  duplicate key violates unique constraint
"DbTranRepositoryPK"
[2005-10-12 09:10:04.522 CDT] 16617 <dtr dtr 127.0.0.1(33120)> ERROR:  duplicate key violates unique constraint
"DbTranRepositoryPK"
[2005-10-12 09:10:04.522 CDT] 16616 <dtr dtr 127.0.0.1(33119)> ERROR:  duplicate key violates unique constraint
"DbTranRepositoryPK"
[2005-10-12 09:10:04.522 CDT] 16619 <dtr dtr 127.0.0.1(33122)> ERROR:  duplicate key violates unique constraint
"DbTranRepositoryPK"
[2005-10-12 09:10:04.522 CDT] 16615 <dtr dtr 127.0.0.1(33118)> ERROR:  duplicate key violates unique constraint
"DbTranRepositoryPK"
[2005-10-12 09:10:05.154 CDT] 16619 <dtr dtr 127.0.0.1(33122)> ERROR:  duplicate key violates unique constraint
"DbTranRepositoryPK"
TRAP: FailedAssertion("!(buf->refcount > 0)", File: "bufmgr.c", Line: 812)
[2005-10-12 09:10:05.695 CDT] 16602 LOG:  server process (PID 16619) was terminated by signal 6
[2005-10-12 09:10:05.695 CDT] 16602 LOG:  terminating any other active server processes
[2005-10-12 09:10:05.695 CDT] 16617 <dtr dtr 127.0.0.1(33120)> WARNING:  terminating connection because of crash of
anotherserver process 
[2005-10-12 09:10:05.695 CDT] 16617 <dtr dtr 127.0.0.1(33120)> DETAIL:  The postmaster has commanded this server
processto roll back the current transaction and exit, because another server process exited abnormally and possibly
corruptedshared memory. 
[2005-10-12 09:10:05.695 CDT] 16617 <dtr dtr 127.0.0.1(33120)> HINT:  In a moment you should be able to reconnect to
thedatabase and repeat your command. 
[2005-10-12 09:10:05.695 CDT] 16618 <dtr dtr 127.0.0.1(33121)> WARNING:  terminating connection because of crash of
anotherserver process 
[2005-10-12 09:10:05.695 CDT] 16618 <dtr dtr 127.0.0.1(33121)> DETAIL:  The postmaster has commanded this server
processto roll back the current transaction and exit, because another server process exited abnormally and possibly
corruptedshared memory. 
[2005-10-12 09:10:05.695 CDT] 16618 <dtr dtr 127.0.0.1(33121)> HINT:  In a moment you should be able to reconnect to
thedatabase and repeat your command. 
[2005-10-12 09:10:05.697 CDT] 16615 <dtr dtr 127.0.0.1(33118)> WARNING:  terminating connection because of crash of
anotherserver process 
[2005-10-12 09:10:05.697 CDT] 16615 <dtr dtr 127.0.0.1(33118)> DETAIL:  The postmaster has commanded this server
processto roll back the current transaction and exit, because another server process exited abnormally and possibly
corruptedshared memory. 
[2005-10-12 09:10:05.697 CDT] 16615 <dtr dtr 127.0.0.1(33118)> HINT:  In a moment you should be able to reconnect to
thedatabase and repeat your command. 
[2005-10-12 09:10:05.698 CDT] 16616 <dtr dtr 127.0.0.1(33119)> WARNING:  terminating connection because of crash of
anotherserver process 
[2005-10-12 09:10:05.698 CDT] 16616 <dtr dtr 127.0.0.1(33119)> DETAIL:  The postmaster has commanded this server
processto roll back the current transaction and exit, because another server process exited abnormally and possibly
corruptedshared memory. 
[2005-10-12 09:10:05.698 CDT] 16616 <dtr dtr 127.0.0.1(33119)> HINT:  In a moment you should be able to reconnect to
thedatabase and repeat your command. 
[2005-10-12 09:10:05.698 CDT] 16614 <dtr dtr 127.0.0.1(33117)> WARNING:  terminating connection because of crash of
anotherserver process 
[2005-10-12 09:10:05.698 CDT] 16614 <dtr dtr 127.0.0.1(33117)> DETAIL:  The postmaster has commanded this server
processto roll back the current transaction and exit, because another server process exited abnormally and possibly
corruptedshared memory. 
[2005-10-12 09:10:05.698 CDT] 16614 <dtr dtr 127.0.0.1(33117)> HINT:  In a moment you should be able to reconnect to
thedatabase and repeat your command. 
[2005-10-12 09:10:05.703 CDT] 16602 LOG:  all server processes terminated; reinitializing
[2005-10-12 09:10:05.719 CDT] 16620 LOG:  database system was interrupted at 2005-10-12 09:09:49 CDT
[2005-10-12 09:10:05.719 CDT] 16620 LOG:  checkpoint record is at 8/A601C904
[2005-10-12 09:10:05.719 CDT] 16620 LOG:  redo record is at 8/A601C904; undo record is at 0/0; shutdown TRUE
[2005-10-12 09:10:05.719 CDT] 16620 LOG:  next transaction ID: 2313003; next OID: 857185
[2005-10-12 09:10:05.719 CDT] 16620 LOG:  next MultiXactId: 1; next MultiXactOffset: 0
[2005-10-12 09:10:05.719 CDT] 16620 LOG:  database system was not properly shut down; automatic recovery in progress
[2005-10-12 09:10:05.719 CDT] 16621 <dtr dtr 127.0.0.1(33123)> FATAL:  the database system is starting up
[2005-10-12 09:10:05.720 CDT] 16620 LOG:  redo starts at 8/A601C948
[2005-10-12 09:10:05.721 CDT] 16622 <dtr dtr 127.0.0.1(33124)> FATAL:  the database system is starting up
[2005-10-12 09:10:05.723 CDT] 16623 <dtr dtr 127.0.0.1(33125)> FATAL:  the database system is starting up
[2005-10-12 09:10:05.725 CDT] 16624 <dtr dtr 127.0.0.1(33126)> FATAL:  the database system is starting up
[2005-10-12 09:10:05.727 CDT] 16625 <dtr dtr 127.0.0.1(33127)> FATAL:  the database system is starting up
[2005-10-12 09:10:05.729 CDT] 16626 <dtr dtr 127.0.0.1(33128)> FATAL:  the database system is starting up
[2005-10-12 09:10:05.734 CDT] 16620 LOG:  record with zero length at 8/A6112FB4
[2005-10-12 09:10:05.734 CDT] 16620 LOG:  redo done at 8/A6112F8C
[2005-10-12 09:10:05.761 CDT] 16620 LOG:  database system is ready
[2005-10-12 09:10:05.762 CDT] 16620 LOG:  transaction ID wrap limit is 1073743018, limited by database "dtr"


>>> Tom Lane <tgl@sss.pgh.pa.us> 10/12/05 9:26 AM >>>
I wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> (gdb) p BufferDescriptors[781]
>> $1 = {tag = {rnode = {spcNode = 1663, dbNode = 16385, relNode = 2666}, blockNum = 1}, flags = 70, usage_count = 5,
refcount= 4294967294, 
>> wait_backend_pid = 748, buf_hdr_lock = 0 '\0', buf_id = 781, freeNext = -2, io_in_progress_lock = 1615, content_lock
=1616} 

> Whoa.  refcount -2?

After meditating overnight, I have a theory.  There seem to be two basic
categories of possible explanations for the above state:

1. Some path of control decrements refcount more times than it increments it.
2. Occasionally, an intended increment gets lost.

Yesterday I was thinking in terms of #1, but it really doesn't seem to
fit the observed facts very well.  I don't see a reason why such a bug
would preferentially affect pg_constraint_contypid_index; also it seems
like it would be fairly easily repeatable by many people.  The pin
tracking logic is all internal to individual backends and doesn't look
very vulnerable to, say, timing-related glitches.

On the other hand, it's not hard to concoct a plausible explanation
using #2: suppose that two backends wanting to pin the same buffer at
about the same time pick up the same original value of refcount, add
one, store back.  This is not supposed to happen of course, but maybe
the compiler is optimizing some code in a way that gives this effect
(ie, by reading refcount before the buffer header spinlock has been
acquired).  Now we can account for pg_constraint_contypid_index being
hit: we know you use domains a lot, and that uncached catalog search in
GetDomainConstraints would result in a whole lot of concurrent accesses
to that particular index, so it would be a likely place for such a bug
to manifest.  And we can account for you being the only one seeing it:
this theory makes it compiler- and platform-dependent.

Accordingly: what's the platform exactly? (CPU type, and OS just in
case.)  What compiler was used?  (If gcc, show "gcc -v" output.)
Also please show the output of "pg_config".
        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate      subscribe-nomail command to
majordomo@postgresql.orgso that your      message can get through to the mailing list cleanly 



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: slow IN() clause for many cases
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: database vacuum from cron hanging