Re: Core dump

Поиск
Список
Период
Сортировка
От Dan Moschuk
Тема Re: Core dump
Дата
Msg-id 20001012164752.A3004@spirit.jaded.net
обсуждение исходный текст
Ответ на Re: Core dump  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Core dump  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
| > Sparc solaris 2.7 with postgres 7.0.2
| > It seems to be reproducable, the server crashes on us at a rate of about
| > every few hours.
| 
| That's a very bizarre backtrace.  Why the multiple levels of recursive
| entry to the quickdie() signal handler?  I wonder if you aren't looking
| at some kind of Solaris bug --- perhaps it's not able to cope with a
| signal handler turning around and issuing new kernel calls.

I'm not sure that is the issue, see below.

| The core file you are looking at is probably *not* from the original
| failure, whatever that is.  The sequence is probably
| 
| 1. Some backend crashes for unknown reason, dumping core.
| 
| 2. Postmaster observes messy death of a child, decides that mass suicide
|    followed by restart is called for.  Postmaster sends SIGUSR1 to all
|    remaining backends to make them commit hara-kiri.
| 
| 3. One or more other backends crash trying to obey postmaster's command.
|    The corefile left for you to examine comes from whichever crashed
|    last.
| 
| So there are at least two problems here, but we only have evidence of
| the second one.
| 
| Since the problem is fairly reproducible, I'd suggest you temporarily
| dike out the elog(NOTICE) call in quickdie() (in
| src/backend/tcop/postgres.c), which will probably allow the backends
| to honor SIGUSR1 without dumping core.  Then you have a shot at seeing
| the core from the original failure.

I will try this, however the database is currently running under light load.
Only under high load does postgres start to choke, and eventually die.

| Assuming that this works (ie, you find a core that's not got anything
| to do with quickdie()), I'd suggest an inquiry to Sun about whether
| their signal handler logic hasn't got a problem with write() issued
| from inside a signal handler.  Meanwhile let us know what the new
| backtrace shows.

I wrote a quick test program to test this theory.  Below is the code and the
output.

#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>

static void moo (int);

int
main (void)
{       signal(SIGUSR1, moo);       raise(SIGUSR1);
}

static void
moo (cow)       int cow;
{       printf("Getting ready for write()\n");       write(STDOUT_FILENO, "Hello!\n", 7);       printf("Done.\n");
}

static void
moo (cow)       int cow;
{       printf("Getting ready for write()\n");       write(STDOUT_FILENO, "Hello!\n", 7);       printf("Done.\n");
}

eclipse% ./x
Getting ready for write()
Hello!
Done.
eclipse% 

It would appear from that very rough test program that solaris doesn't mind
system calls from within a signal handler.

-- 
Man is a rational animal who always loses his temper when he is called
upon to act in accordance with the dictates of reason.               -- Oscar Wilde


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: possible constraint bug?
Следующее
От: Marko Kreen
Дата:
Сообщение: Re: Precedence of '|' operator (was Re: [patch,rfc] binary operators on integers)