Обсуждение: Chasing "signal 11" issues

Поиск
Список
Период
Сортировка

Chasing "signal 11" issues

От
"Tass Chapman"
Дата:
Since Monday I have been seeing "terminated by signal 11" messages in my 7.4.6 + Slon 1.0.5 system,. but only on the master

I've done a dumapall, initdb and restore , which reduced the frequency but I still get them 6-8 times a day.

After turning up logging it seemed to die when calling a very small table (2 rows, 4 columns, 8 char text strings), but manually selecting caused no issues, so I then took a hit and shutdown the system and swapped out the RAM (from earlier list suggestions).

This seemed to work until 7 hours later when the problem has reappeared, at a higher frequency too.

It is ONLY occuring on the master, not on any of the leaf (replicated) nodes, and seems to be triggered by a few different systems connecting (so no common code base)

Suggestions/help ?

Re: Chasing "signal 11" issues

От
Douglas McNaught
Дата:
"Tass Chapman" <tasseh.postgres@gmail.com> writes:

> Since Monday I have been seeing "terminated by signal 11" messages
> in my 7.4.6 + Slon 1.0.5 system,. but only on the master

This kind of thing is almost always a hardware problem.  'memtest86'
is probably a good first step, and see if any of your cooling fans
hanve stopped working.

-Doug

Re: Chasing "signal 11" issues

От
Tom Lane
Дата:
Douglas McNaught <doug@mcnaught.org> writes:
> "Tass Chapman" <tasseh.postgres@gmail.com> writes:
>> Since Monday I have been seeing "terminated by signal 11" messages
>> in my 7.4.6 + Slon 1.0.5 system,. but only on the master

> This kind of thing is almost always a hardware problem.  'memtest86'
> is probably a good first step, and see if any of your cooling fans
> hanve stopped working.

If nothing about the software or the workload have changed recently,
I'd agree with Doug about what to look at.  Otherwise ... 7.4.6 is
pretty old and we have fixed a number of problems since then.  Even
if you don't have the energy to migrate to 8.* now, there's very little
excuse for not dropping in the latest 7.4 subrelease (7.4.12 I think).

            regards, tom lane

Re: Chasing "signal 11" issues

От
Scott Marlowe
Дата:
On Thu, 2006-03-30 at 07:02, Tass Chapman wrote:
> Since Monday I have been seeing "terminated by signal 11" messages in
> my 7.4.6 + Slon 1.0.5 system,. but only on the master
>
> I've done a dumapall, initdb and restore , which reduced the frequency
> but I still get them 6-8 times a day.
>
> After turning up logging it seemed to die when calling a very small
> table (2 rows, 4 columns, 8 char text strings), but manually selecting
> caused no issues, so I then took a hit and shutdown the system and
> swapped out the RAM (from earlier list suggestions).
>
> This seemed to work until 7 hours later when the problem has
> reappeared, at a higher frequency too.
>
> It is ONLY occuring on the master, not on any of the leaf (replicated)
> nodes, and seems to be triggered by a few different systems connecting
> (so no common code base)

As mentioned earlier, this tends to be caused by hardware.  Note that it
can be caused by buggy software or corrupted binaries as well.

It is possible that the binaries you're running on have become corrupted
in some small way.  You might want to run md5sum across all the binaries
(postgresql, slony, etc...) on the bad and good machine and compare
them.

If the problem is in the hardware, and I think it is, it could be
anywhere, bad drive, raid controller, raid cache, scsi interface, CPU,
memory, and so on.so, memtest86 might find the problem if it's mainboard
/ CPU / memory, but if it's an I/O problem, it won't.

The most common failures are mechanical in nature.  I've had machines
that were crashing, and all I had to do was reseat the CPU or memory or
heat sink and suddenly it was running fine.

However, you need to switch over to your failover machine immediately.
Running your main database on what is most likely faulty hardware is a
recipe for corruption of your database.