Обсуждение: strange buildfarm failures

Поиск

Список

Период

Сортировка

strange buildfarm failures

От

Stefan Kaltenbrunner

Дата:

25 апреля 2007 г., 05:10:12

two of my buildfarm members had different but pretty weird looking
failures lately:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03

and

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02


any ideas on what might causing those ?


Stefan

Re: strange buildfarm failures

От

Stefan Kaltenbrunner

Дата:

25 апреля 2007 г., 19:11:35

Stefan Kaltenbrunner wrote:
> two of my buildfarm members had different but pretty weird looking
> failures lately:
> 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03
> 
> and
> 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02
> 
> 
> any ideas on what might causing those ?

lionfish just failed too:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-25%2005:30:09


Stefan

Re: strange buildfarm failures

От

Tom Lane

Дата:

25 апреля 2007 г., 22:06:21

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Stefan Kaltenbrunner wrote:
>> two of my buildfarm members had different but pretty weird looking
>> failures lately:
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03
>> and
>> 
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02
>> 
>> any ideas on what might causing those ?

> lionfish just failed too:

> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-25%2005:30:09

And had a similar failure a few days ago.  The curious thing is that
what we get in the postmaster log is

LOG:  server process (PID 23405) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

You would think SIGABRT would come from an assertion failure, but
there's no preceding assertion message in the log.  The other
characteristic of these crashes is that *all* of the failing regression
instances report "terminating connection because of crash of another
server process", which suggests strongly that the crash was in an
autovacuum process (if it were bgwriter or stats collector the
postmaster would've said so).  So I think the recent autovac patches
are at fault.  I spent a bit of time trolling for a spot where the code
might abort() without having printed anything, but didn't find one.

If any of the buildfarm owners can get a stack trace from the core dump
of one of these events, it'd be mighty helpful.
        regards, tom lane

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

26 апреля 2007 г., 03:05:18

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > Stefan Kaltenbrunner wrote:
> >> two of my buildfarm members had different but pretty weird looking
> >> failures lately:
> >> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03
> >> and
> >> 
> >> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02
> >> 
> >> any ideas on what might causing those ?

Just for the record, quagga and emu failures don't seem related to the
report below.  They don't crash; the regression.diffs contains data that
suggests that there may be data corruption of some sort.

INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
ERROR:  invalid cidr value: "%{"

This doesn't seem to make much sense.

> > lionfish just failed too:
> 
> > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-25%2005:30:09
> 
> And had a similar failure a few days ago.  The curious thing is that
> what we get in the postmaster log is
> 
> LOG:  server process (PID 23405) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> 
> You would think SIGABRT would come from an assertion failure, but
> there's no preceding assertion message in the log.  The other
> characteristic of these crashes is that *all* of the failing regression
> instances report "terminating connection because of crash of another
> server process", which suggests strongly that the crash was in an
> autovacuum process (if it were bgwriter or stats collector the
> postmaster would've said so).  So I think the recent autovac patches
> are at fault.  I spent a bit of time trolling for a spot where the code
> might abort() without having printed anything, but didn't find one.

Hmm.  I kept an eye on the buildfarm for a few days, but saw nothing
that could be connected to autovacuum so I neglected it.

This is the other failure:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-20%2005:30:14

It shows the same pattern.  I am baffled -- I don't understand how it
can die without reporting the error.

Apparently it crashes rather frequently, so it shouldn't be too
difficult to reproduce on manual runs.  If we could get it to run with a
higher debug level, it might prove helpful to further pinpoint the
problem.

The core file would be much better obviously (first and foremost to
confirm that it's autovacuum that's crashing ... )

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: strange buildfarm failures

От

Stefan Kaltenbrunner

Дата:

26 апреля 2007 г., 04:45:12

Alvaro Herrera wrote:
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> Stefan Kaltenbrunner wrote:
>>>> two of my buildfarm members had different but pretty weird looking
>>>> failures lately:
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03
>>>> and
>>>>
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02
>>>>
>>>> any ideas on what might causing those ?
> 
> Just for the record, quagga and emu failures don't seem related to the
> report below.  They don't crash; the regression.diffs contains data that
> suggests that there may be data corruption of some sort.
> 
> INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
> ERROR:  invalid cidr value: "%{"
> 
> This doesn't seem to make much sense.

yeah on further reflection it looks like the failures from emu and
quagga seem unrelated to the issue lionfish is experiencing

> 
> 
>>> lionfish just failed too:
>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-25%2005:30:09
>> And had a similar failure a few days ago.  The curious thing is that
>> what we get in the postmaster log is
>>
>> LOG:  server process (PID 23405) was terminated by signal 6: Aborted
>> LOG:  terminating any other active server processes
>>
>> You would think SIGABRT would come from an assertion failure, but
>> there's no preceding assertion message in the log.  The other
>> characteristic of these crashes is that *all* of the failing regression
>> instances report "terminating connection because of crash of another
>> server process", which suggests strongly that the crash was in an
>> autovacuum process (if it were bgwriter or stats collector the
>> postmaster would've said so).  So I think the recent autovac patches
>> are at fault.  I spent a bit of time trolling for a spot where the code
>> might abort() without having printed anything, but didn't find one.
> 
> Hmm.  I kept an eye on the buildfarm for a few days, but saw nothing
> that could be connected to autovacuum so I neglected it.
> 
> This is the other failure:
> 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-20%2005:30:14
> 
> It shows the same pattern.  I am baffled -- I don't understand how it
> can die without reporting the error.

I should have mentioned that initially - but I think the failure from
2007-04-20 is not related at all.
The failure from 2007-04-20 was very likely caused due to the kernel
running totally out of memory (lionfish is a very resource starved box
at only 48MB of RAM and 128MB of swap at that time - do we have a recent
patch that is increasing memory usage quite a lot?).
I immediatly added another 128MB of swap after that and I don't think
the failure from yesterday is the same (at least there are no kernel
logs that indicate a similiar issue)
> 
> Apparently it crashes rather frequently, so it shouldn't be too
> difficult to reproduce on manual runs.  If we could get it to run with a
> higher debug level, it might prove helpful to further pinpoint the
> problem.

a manual run of the buildfarm script takes ~4,5 hours on lionfish ;-)

> 
> The core file would be much better obviously (first and foremost to
> confirm that it's autovacuum that's crashing ... )

I will see what I can come up with ...


Stefan

Re: strange buildfarm failures

От

Stefan Kaltenbrunner

Дата:

28 апреля 2007 г., 12:44:52

Alvaro Herrera wrote:
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> Stefan Kaltenbrunner wrote:
>>>> two of my buildfarm members had different but pretty weird looking
>>>> failures lately:
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03
>>>> and
>>>>
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02
>>>>
>>>> any ideas on what might causing those ?
> 
> Just for the record, quagga and emu failures don't seem related to the
> report below.  They don't crash; the regression.diffs contains data that
> suggests that there may be data corruption of some sort.
> 
> INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
> ERROR:  invalid cidr value: "%{"
> 
> This doesn't seem to make much sense.

no idea - but quagga and emu seem to have similiar failure (in the sense
that they don't make any sense) and i have no reson to believe that the
hardware is a fault.

> 
> 
>>> lionfish just failed too:
>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-25%2005:30:09
>> And had a similar failure a few days ago.  The curious thing is that
>> what we get in the postmaster log is
>>
>> LOG:  server process (PID 23405) was terminated by signal 6: Aborted
>> LOG:  terminating any other active server processes
>>
>> You would think SIGABRT would come from an assertion failure, but
>> there's no preceding assertion message in the log.  The other
>> characteristic of these crashes is that *all* of the failing regression
>> instances report "terminating connection because of crash of another
>> server process", which suggests strongly that the crash was in an
>> autovacuum process (if it were bgwriter or stats collector the
>> postmaster would've said so).  So I think the recent autovac patches
>> are at fault.  I spent a bit of time trolling for a spot where the code
>> might abort() without having printed anything, but didn't find one.
> 
> Hmm.  I kept an eye on the buildfarm for a few days, but saw nothing
> that could be connected to autovacuum so I neglected it.
> 
> This is the other failure:
> 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-20%2005:30:14
> 
> It shows the same pattern.  I am baffled -- I don't understand how it
> can die without reporting the error.
> 
> Apparently it crashes rather frequently, so it shouldn't be too
> difficult to reproduce on manual runs.  If we could get it to run with a
> higher debug level, it might prove helpful to further pinpoint the
> problem.
> 
> The core file would be much better obviously (first and foremost to
> confirm that it's autovacuum that's crashing ... )


well - i now have a core file but it does not seem to be much worth
except to prove that autovacuum seems to be the culprit:

Core was generated by `postgres: autovacuum worker process                            '.
Program terminated with signal 6, Aborted.

[...]

#0  0x00000ed9 in ?? ()
warning: GDB can't find the start of the function at 0xed9.
   GDB is unable to find the start of the function at 0xed9
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.   This problem is most likely caused by an invalid program counter or
stack pointer.   However, if you think GDB should simply search farther back
from 0xed9 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.


Stefan

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

29 апреля 2007 г., 16:22:17

Stefan Kaltenbrunner wrote:

> well - i now have a core file but it does not seem to be much worth
> except to prove that autovacuum seems to be the culprit:
> 
> Core was generated by `postgres: autovacuum worker process
>                              '.
> Program terminated with signal 6, Aborted.
> 
> [...]
> 
> #0  0x00000ed9 in ?? ()
> warning: GDB can't find the start of the function at 0xed9.

Interesting.  Notice how it doesn't have the database name in the ps
display.  This means it must have crashed between the initial
init_ps_display and the set_ps_display call just before starting to
vacuum.  So the bug is probably in the startup code; probably the code
dealing with the PGPROC which is the newest and weirder stuff.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

29 апреля 2007 г., 16:26:04

Alvaro Herrera wrote:
> Stefan Kaltenbrunner wrote:
> 
> > well - i now have a core file but it does not seem to be much worth
> > except to prove that autovacuum seems to be the culprit:
> > 
> > Core was generated by `postgres: autovacuum worker process
> >                              '.
> > Program terminated with signal 6, Aborted.
> > 
> > [...]
> > 
> > #0  0x00000ed9 in ?? ()
> > warning: GDB can't find the start of the function at 0xed9.
> 
> Interesting.  Notice how it doesn't have the database name in the ps
> display.  This means it must have crashed between the initial
> init_ps_display and the set_ps_display call just before starting to
> vacuum.  So the bug is probably in the startup code; probably the code
> dealing with the PGPROC which is the newest and weirder stuff.

Oh, another thing that I think may be happening is that the stack is
restored in longjmp, so it is trying to report an error elsewhere but
it crashes because something got overwritten or something; i.e. a
bug in the error recovery code.  I don't know how feasible this is or
even if it makes sense (would longjmp() restore the ps display?), but we
had similar, very hard to debug errors in Mammoth Replicator, which is
why I'm mentioning it in case it rings a bell.

-- 
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"The only difference is that Saddam would kill you on private, where the
Americans will kill you in public" (Mohammad Saleh, 39, a building contractor)

Re: strange buildfarm failures

От

Tom Lane

Дата:

29 апреля 2007 г., 17:00:22

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Oh, another thing that I think may be happening is that the stack is
> restored in longjmp, so it is trying to report an error elsewhere but
> it crashes because something got overwritten or something; i.e. a
> bug in the error recovery code.

Hm, something trying to elog before the setjmp's been executed?
Although I thought it was coded so that elog.c would just proc_exit
if there was noplace to longjmp to.  A mistake here might explain
the lack of any message in the postmaster log: if elog.c thinks it
should longjmp then it doesn't print the message first.
        regards, tom lane

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

02 мая 2007 г., 03:01:35

Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Oh, another thing that I think may be happening is that the stack is
> > restored in longjmp, so it is trying to report an error elsewhere but
> > it crashes because something got overwritten or something; i.e. a
> > bug in the error recovery code.
> 
> Hm, something trying to elog before the setjmp's been executed?
> Although I thought it was coded so that elog.c would just proc_exit
> if there was noplace to longjmp to.  A mistake here might explain
> the lack of any message in the postmaster log: if elog.c thinks it
> should longjmp then it doesn't print the message first.

Well, there seems to be plenty of code which is happy to elog(ERROR)
before the longjmp target block has been set; for example
InitFileAccess(), which is called on BaseInit(), which comes before
sigsetjmp() both on postgres.c and autovacuum.c.  (This particular case
is elog(FATAL) not ERROR however).  mdinit() also does some memory
allocation which could certainly fail.

I'm wondering if it wouldn't be more robust to define a longjmp target
block before calling BaseInit(), and have it exit cleanly in case of
failure (which is what you say elog.c should be doing if there is no
target block).

In errstart(), it is checked if PG_exception_stack is NULL.  Now, this
symbol is defined in elog.c and initialized to NULL, but I wonder if a
child process inherits the value that postmaster set, or it comes back
to NULL.  The backend would not inherit any of the values the postmaster
set if the latter were the case, so I'm assuming that PG_exception_stack
stays as the postmaster left it.  I wonder what happens if the child
process finds that this is an invalid point to jump to?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: strange buildfarm failures

От

Tom Lane

Дата:

02 мая 2007 г., 03:43:02

Alvaro Herrera <alvherre@commandprompt.com> writes:
> I'm wondering if it wouldn't be more robust to define a longjmp target
> block before calling BaseInit(), and have it exit cleanly in case of
> failure (which is what you say elog.c should be doing if there is no
> target block).

No, because elog is already supposed to deal with that case; and does,
every time you connect to a bad database name for example.  If it's
failing, the question to answer is why.  

> In errstart(), it is checked if PG_exception_stack is NULL.  Now, this
> symbol is defined in elog.c and initialized to NULL, but I wonder if a
> child process inherits the value that postmaster set, or it comes back
> to NULL.

Hmm ... I was about to say that the postmaster never sets
PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW
could do it?  Does the postmaster ever execute PG_TRY?  (And if so,
should it?  The postmaster really ought not be dealing in anything
very hairy --- it should be passing such work off to children.)
        regards, tom lane

Re: strange buildfarm failures

От

Tom Lane

Дата:

02 мая 2007 г., 04:00:23

I wrote:
> Hmm ... I was about to say that the postmaster never sets
> PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW
> could do it?  Does the postmaster ever execute PG_TRY?

Doh, I bet that's it, and it's not the postmaster that's at issue
but PG_TRY blocks executed during subprocess startup.  Inheritance
of a PG_exception_stack setting from the postmaster could only happen if
the postmaster were to fork() within a PG_TRY block, which I think we
can safely say it doesn't.  But suppose we get an elog(ERROR) inside
a PG_TRY block when there is no outermost longjmp catcher.   elog.c
will think it should longjmp, and that will eventually lead to
executing

#define PG_RE_THROW()  \siglongjmp(*PG_exception_stack, 1)

with PG_exception_stack = NULL; which seems entirely likely to cause
a stack smash of gruesome dimensions.  What's more, nothing would have
been printed to the postmaster log beforehand, agreeing with observation.

Personally I think the correct fix is to make PG_RE_THROW deal sanely
with the case of PG_exception_stack = NULL, that is, turn it into an
elog(FATAL) with the original error text.  If you try to fix it by
making a setjmp occur earlier, there's still the problem of "what
about PG_TRY earlier than that"?

This might be more code than we want in a macro, though, especially
since this is presumably not a performance-critical path.  I'm tempted
to change the macro to just call a pg_re_throw() subroutine.  Thoughts?
        regards, tom lane

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

02 мая 2007 г., 13:18:59

Tom Lane wrote:
> I wrote:
> > Hmm ... I was about to say that the postmaster never sets
> > PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW
> > could do it?  Does the postmaster ever execute PG_TRY?
> 
> Doh, I bet that's it, and it's not the postmaster that's at issue
> but PG_TRY blocks executed during subprocess startup.  Inheritance
> of a PG_exception_stack setting from the postmaster could only happen if
> the postmaster were to fork() within a PG_TRY block, which I think we
> can safely say it doesn't.  But suppose we get an elog(ERROR) inside
> a PG_TRY block when there is no outermost longjmp catcher.   elog.c
> will think it should longjmp, and that will eventually lead to
> executing
> 
> #define PG_RE_THROW()  \
>     siglongjmp(*PG_exception_stack, 1)
> 
> with PG_exception_stack = NULL; which seems entirely likely to cause
> a stack smash of gruesome dimensions.  What's more, nothing would have
> been printed to the postmaster log beforehand, agreeing with observation.

I agree that that would be a bug and we should fix it, but I don't think
it explains the problem we're seeing because there is no PG_TRY block
in the autovac startup code that I can see :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: strange buildfarm failures

От

Tom Lane

Дата:

02 мая 2007 г., 13:55:57

Alvaro Herrera <alvherre@commandprompt.com> writes:
> I agree that that would be a bug and we should fix it, but I don't think
> it explains the problem we're seeing because there is no PG_TRY block
> in the autovac startup code that I can see :-(

I'm wondering if there is some code path that invokes a PG_TRY deep in
the bowels of the system.  Anyway, I'll go fix this, and we should know
soon enough if it changes the buildfarm behavior.
        regards, tom lane

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

02 мая 2007 г., 14:06:21

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I agree that that would be a bug and we should fix it, but I don't think
> > it explains the problem we're seeing because there is no PG_TRY block
> > in the autovac startup code that I can see :-(
> 
> I'm wondering if there is some code path that invokes a PG_TRY deep in
> the bowels of the system.

Well, I checked all the bowels involved in autovacuum startup.

> Anyway, I'll go fix this, and we should know soon enough if it changes
> the buildfarm behavior.

Agreed.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

02 мая 2007 г., 14:26:40

Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > > I agree that that would be a bug and we should fix it, but I don't think
> > > it explains the problem we're seeing because there is no PG_TRY block
> > > in the autovac startup code that I can see :-(
> > 
> > I'm wondering if there is some code path that invokes a PG_TRY deep in
> > the bowels of the system.
> 
> Well, I checked all the bowels involved in autovacuum startup.

Huh, hang on ... there is one caller, which is to set client_encoding
(call_string_assign_hook uses a PG_TRY block), but it is called *after*
the sigsetjmp block -- in InitPostgres.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: strange buildfarm failures

От

Tom Lane

Дата:

02 мая 2007 г., 15:05:22

Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Tom Lane wrote:
>>> I'm wondering if there is some code path that invokes a PG_TRY deep in
>>> the bowels of the system.

> Huh, hang on ... there is one caller, which is to set client_encoding
> (call_string_assign_hook uses a PG_TRY block), but it is called *after*
> the sigsetjmp block -- in InitPostgres.

While testing the PG_RE_THROW problem I noted that what I get here is
a SIGSEGV crash, rather than SIGABRT as seen on Stefan's machines, so
that's another hint that this may be unrelated.  Still, it's clearly
at risk of causing a problem as more PG_TRY's get added to the code,
so I'm going to fix it anyway.
        regards, tom lane

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

02 мая 2007 г., 15:23:49

Alvaro Herrera wrote:
> Stefan Kaltenbrunner wrote:
> 
> > well - i now have a core file but it does not seem to be much worth
> > except to prove that autovacuum seems to be the culprit:
> > 
> > Core was generated by `postgres: autovacuum worker process
> >                              '.
> > Program terminated with signal 6, Aborted.
> > 
> > [...]
> > 
> > #0  0x00000ed9 in ?? ()
> > warning: GDB can't find the start of the function at 0xed9.

I just noticed an ugly bug in the worker code which I'm fixing.  I think
this one would also throw SIGSEGV, not SIGABRT.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: strange buildfarm failures

От

Alvaro Herrera

Дата:

02 мая 2007 г., 15:38:41

Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > Stefan Kaltenbrunner wrote:
> >
> > > well - i now have a core file but it does not seem to be much worth
> > > except to prove that autovacuum seems to be the culprit:
> > >
> > > Core was generated by `postgres: autovacuum worker process
> > >                              '.
> > > Program terminated with signal 6, Aborted.
> > >
> > > [...]
> > >
> > > #0  0x00000ed9 in ?? ()
> > > warning: GDB can't find the start of the function at 0xed9.
>
> I just noticed an ugly bug in the worker code which I'm fixing.  I think
> this one would also throw SIGSEGV, not SIGABRT.

Nailed it -- this is the actual bug that causes the abort.  But I am
surprised that it doesn't print the error message in Stefan machine's;
here it outputs


TRAP: FailedAssertion("!((((unsigned long)(elem)) > ShmemBase))", File:
"/pgsql/source/00head/src/backend/storage/ipc/shmqueue.c",Line: 107) 
16496 2007-05-02 11:30:31 CLT DEBUG:  server process (PID 16540) was terminated by signal 6: Aborted
16496 2007-05-02 11:30:31 CLT LOG:  server process (PID 16540) was terminated by signal 6: Aborted
16496 2007-05-02 11:30:31 CLT LOG:  terminating any other active server processes
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16541
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16498
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16500
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16499
16541 2007-05-02 11:30:33 CLT WARNING:  terminating connection because of crash of another server process


Maybe stderr is going somewhere else?  That would be strange, I think.

I'll commit the fix shortly; attached.

--
Alvaro Herrera                         http://www.flickr.com/photos/alvherre/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)

Вложения

autovac-startingworker.patch

Re: strange buildfarm failures

От

Tom Lane

Дата:

02 мая 2007 г., 15:46:58

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Nailed it -- this is the actual bug that causes the abort.  But I am
> surprised that it doesn't print the error message in Stefan machine's;

Hm, maybe we need an fflush(stderr) in ExceptionalCondition?
        regards, tom lane

Re: strange buildfarm failures

От

Gregory Stark

Дата:

02 мая 2007 г., 17:44:24

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> Nailed it -- this is the actual bug that causes the abort.  But I am
>> surprised that it doesn't print the error message in Stefan machine's;
>
> Hm, maybe we need an fflush(stderr) in ExceptionalCondition?

stderr is supposed to be line-buffered by default. Couldn't hurt I suppose.



--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: strange buildfarm failures

Вложения