Обсуждение: Spontaneous PostgreSQL Server Reboot?

Поиск
Список
Период
Сортировка

Spontaneous PostgreSQL Server Reboot?

От
Andrew Biagioni
Дата:
Help!

I'm having a severe problem with my production PostgreSQL servers, and
I'm running out of ideas as to what could be the problem.

On three different machines running the same PostgreSQL version (7.3.5)
on Linux and almost identical databases, I have been plagued by
occasional, unexplainable (to me) reboots of the computer.

On the first two machines it was clearly connected to running a very
complex query (multi-hour processing time);  on the third, I could find
no such correlation.  Please note that only one machine was Production
at any one time;  the problem never occurred with the Development
servers (equivalent configuration).  This in and of itself may hint to a
threading issue (the DB is serving web servers, plus the occasional
maintenance cron job - no cron job running when the last reboot occurred).

When the last reboot occurred (this past Friday evening), I scoured all
logs for any information;  the only log that shows any activity in
within 5 minutes of the restart is $PG_DATA/serverlog:

2004-03-26 21:46:59 [10534]  LOG:  query: SELECT  v.* FROM [...]
2004-03-26 21:46:59 [10534]  LOG:  duration: 0.049330 sec

The machine restarted around 21:50 (based on syslog):

Mar 26 21:50:33 prod-db-2-1 syslogd 1.4.1#10: restart.

I especially found no panics, errors, warnings, ...  The machine is used
only as a DB server, nothing else running (not even a mail server).  I'm
running out of things to look for!

I'm running PostgreSQL 7.3.5 on Linux (pentium machines), configured as
follows:
1. Red Hat 7.3 with 120 GB HDD and 1 GB RAM;
2. Red Hat 7.3 with 80 GB HDD and 512 MB RAM;
3. Debian with a 2.2.20 kernel, 5.5 GB HDD and 384 MB RAM.  This is the
latest one in use.

Note that even the 5.5 GB machine has over 1 GB spare on the
/usr/local/pgsql/ partition.

Thanks for any and all help!

        Andrew Biagioni


Re: Spontaneous PostgreSQL Server Reboot?

От
Tom Lane
Дата:
Andrew Biagioni <andrew.biagioni@e-greek.net> writes:
> On three different machines running the same PostgreSQL version (7.3.5)
> on Linux and almost identical databases, I have been plagued by
> occasional, unexplainable (to me) reboots of the computer.

Postgres can *not* cause a system reboot; it's only an unprivileged user
process.  (but see *)  You are dealing with either kernel bugs, hardware
errors, or some other root-level process requesting a reboot.

Even though it being three different machines would seem to rule out
hardware issues, I'd not jump to that conclusion ... you might be having
some kind of common-mode hardware failure.  Two questions to ask here:
    * did you buy all the RAM from the same vendor?
    * is the power utility flaky where you live?  (If you say "but
      I've got a UPS", how old are its batteries?)

My money would be on a kernel bug though.  Are you up2date on kernel
patches?

            regards, tom lane

(*) ObFinePrint: at least, PG can't directly trigger a reboot.  One
scenario to think about is flaky RAM in an address range that doesn't
get used until the machine is under significant load --- since you say
Postgres is the only significant load on the machine, it's entirely
possible that triggering of a hardware failure is closely correlated to
what Postgres is doing.  Similar remarks apply to broken disk hardware.

Re: Spontaneous PostgreSQL Server Reboot?

От
Alex Satrapa
Дата:
Andrew Biagioni wrote:
> On three different machines running the same PostgreSQL version (7.3.5)
> on Linux and almost identical databases, I have been plagued by
> occasional, unexplainable (to me) reboots of the computer.

1) Do you have a software watchdog that is configured to reboot the machine when it isn't tickled? The watchdog might
possiblybe rebooting because the machine is under heavy load, or perhaps someone forgot to enable the watchdog tickler. 

2) Was there anyone in the machine room at the time?  One office that I worked in suffered random power outages to
severalmachines for weeks on end, until we figured out that the cleaner was unplugging the main Ethernet switch to use
thepower point for their vacuum cleaner. 

3) Are the power supplies up to scratch?  some power supplies just can't handle the load when you run disk and
processorheavily for minutes on end. The result is the same as a brown-out or black-out - power disappears, machine
reboots.

Hope this helps prod someone's memory!
Alex

Re: Spontaneous PostgreSQL Server Reboot?

От
Rob Fielding
Дата:
Hi,
memtest86 is your first friend. You should run it for a night. As also
mentioned, power could be an issue as you'd like to think memory
failures should be fairly rare, and to effect all servers be fairly remote.

Hope this helps,

--

Rob Fielding
rob@dsvr.net

www.dsvr.co.uk              Development             Designer Servers Ltd

Re: Spontaneous PostgreSQL Server Reboot?

От
Andrew Biagioni
Дата:
Tom Lane wrote:
> Andrew Biagioni <andrew.biagioni@e-greek.net> writes:
>
>>On three different machines running the same PostgreSQL version (7.3.5)
>>on Linux and almost identical databases, I have been plagued by
>>occasional, unexplainable (to me) reboots of the computer.
>
>
> Postgres can *not* cause a system reboot; it's only an unprivileged user
> process.  (but see *)  You are dealing with either kernel bugs, hardware
> errors, or some other root-level process requesting a reboot.

This is one thing I was hoping to hear;  any hints on kernel bug issues
(or logs of causes)?

> Even though it being three different machines would seem to rule out
> hardware issues, I'd not jump to that conclusion ... you might be having
> some kind of common-mode hardware failure.  Two questions to ask here:
>     * did you buy all the RAM from the same vendor?
>     * is the power utility flaky where you live?  (If you say "but
>       I've got a UPS", how old are its batteries?)
>
> My money would be on a kernel bug though.  Are you up2date on kernel
> patches?

kernel patches, yes;  kernel versions, no...

>             regards, tom lane
>
> (*) ObFinePrint: at least, PG can't directly trigger a reboot.  One
> scenario to think about is flaky RAM in an address range that doesn't
> get used until the machine is under significant load --- since you say
> Postgres is the only significant load on the machine, it's entirely
> possible that triggering of a hardware failure is closely correlated to
> what Postgres is doing.  Similar remarks apply to broken disk hardware.

I just got a suggestion to run badblocks and memtest86, to check the
hardware;  that and an updated kernel should help me figure out what is
going on...

Thanks,

        Andrew

> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Re: Spontaneous PostgreSQL Server Reboot?

От
Geoffrey
Дата:
Alex Satrapa wrote:
> Andrew Biagioni wrote:
>
>> On three different machines running the same PostgreSQL version
>> (7.3.5) on Linux and almost identical databases, I have been
>> plagued by occasional, unexplainable (to me) reboots of the
>> computer.
>
>
> 1) Do you have a software watchdog that is configured to reboot the
> machine when it isn't tickled? The watchdog might possibly be
> rebooting because the machine is under heavy load, or perhaps someone
> forgot to enable the watchdog tickler.

There's also the possibility that the motherboard has temperature
monitor and could likely shutdown if the temperature runs too high.  Is
the machine properly cooled?  Does it have plenty of space so as to have
the proper air flow?  What's the processor type? (we all know amd
processors run hotter).

--
Until later, Geoffrey                     Registered Linux User #108567
Building secure systems in spite of Microsoft

Re: Spontaneous PostgreSQL Server Reboot?

От
Geoffrey
Дата:
Andrew Biagioni wrote:

> I just got a suggestion to run badblocks and memtest86, to check the
>  hardware;  that and an updated kernel should help me figure out what
> is going on...

It seems to me that since you reported the problem on three different
machines, you should be looking for a common problem.  Having bad ram in
all three machines I would find unlikely, even if it is all the same
batch.  Same would seem to apply to hard drives.  I'd be more suspect of
power fluctuations, heat issues.  Those things that can affect multiple
machines the same way, that is, if they are in the same location, using
the same power source.

I wouldn't rule out memory or drives, but I'd likely look for something
that affects all the machines.  Common batches of ram, drives or
identical motherboards could be a possibility, but I don't think as
likely as power and/or heat issues.



--
Until later, Geoffrey                     Registered Linux User #108567
Building secure systems in spite of Microsoft

Re: Spontaneous PostgreSQL Server Reboot?

От
Andrew Biagioni
Дата:
Alex,

the answer is "no" to all of these.  We are a tiny start-up (2 guys, and
we do our own cleaning);  ambient temperature varies significantly but
is not related to the failure, and one machine starts beeping when it
gets too hot (then we added an extra case fan);  no fancy watchdogs
(maybe someday...  One can only dream :-> );  three different cases,
power supplies, motherboards, etc., etc. (one power supply is
extra-large, and that's the machine that started failing first!).

We originally blamed the problem on hardware failure (first machine);
then on OS version/configuration (second machine);  now we're out of
things to blame, except maybe unusually bad luck...

Thanks,

        Andrew

Alex Satrapa wrote:

> Andrew Biagioni wrote:
>
>> On three different machines running the same PostgreSQL version
>> (7.3.5) on Linux and almost identical databases, I have been plagued
>> by occasional, unexplainable (to me) reboots of the computer.
>
>
> 1) Do you have a software watchdog that is configured to reboot the
> machine when it isn't tickled? The watchdog might possibly be rebooting
> because the machine is under heavy load, or perhaps someone forgot to
> enable the watchdog tickler.
>
> 2) Was there anyone in the machine room at the time?  One office that I
> worked in suffered random power outages to several machines for weeks on
> end, until we figured out that the cleaner was unplugging the main
> Ethernet switch to use the power point for their vacuum cleaner.
>
> 3) Are the power supplies up to scratch?  some power supplies just can't
> handle the load when you run disk and processor heavily for minutes on
> end. The result is the same as a brown-out or black-out - power
> disappears, machine reboots.
>
> Hope this helps prod someone's memory!
> Alex
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
>
> .
>

Re: Spontaneous PostgreSQL Server Reboot?

От
"scott.marlowe"
Дата:
On Tue, 30 Mar 2004, Andrew Biagioni wrote:

> Alex,
>
> the answer is "no" to all of these.  We are a tiny start-up (2 guys, and
> we do our own cleaning);  ambient temperature varies significantly but
> is not related to the failure, and one machine starts beeping when it
> gets too hot (then we added an extra case fan);  no fancy watchdogs
> (maybe someday...  One can only dream :-> );  three different cases,
> power supplies, motherboards, etc., etc. (one power supply is
> extra-large, and that's the machine that started failing first!).
>
> We originally blamed the problem on hardware failure (first machine);
> then on OS version/configuration (second machine);  now we're out of
> things to blame, except maybe unusually bad luck...

What did memtest86 say?

Did the same person build all the machines?  I've seen plenty of folks
build machines and zap the memory when installing it.  >95% of all ESD
failures are partial / delayed failures, so just because a computer boots
up doesn't mean proper ESD procedures were followed, and if not, and if
you're in a dry environment like I am (I live in Denver) then it's quite
possible all three have bad CPU/mobo/memory or something like that.


Re: Spontaneous PostgreSQL Server Reboot?

От
Andrew Biagioni
Дата:
scott.marlowe wrote:

> On Tue, 30 Mar 2004, Andrew Biagioni wrote:
>
>
>>Alex,
>>
>>the answer is "no" to all of these.  We are a tiny start-up (2 guys, and
>>we do our own cleaning);  ambient temperature varies significantly but
>>is not related to the failure, and one machine starts beeping when it
>>gets too hot (then we added an extra case fan);  no fancy watchdogs
>>(maybe someday...  One can only dream :-> );  three different cases,
>>power supplies, motherboards, etc., etc. (one power supply is
>>extra-large, and that's the machine that started failing first!).
>>
>>We originally blamed the problem on hardware failure (first machine);
>>then on OS version/configuration (second machine);  now we're out of
>>things to blame, except maybe unusually bad luck...
>
>
> What did memtest86 say?
>
> Did the same person build all the machines?  I've seen plenty of folks
> build machines and zap the memory when installing it.  >95% of all ESD
> failures are partial / delayed failures, so just because a computer boots
> up doesn't mean proper ESD procedures were followed, and if not, and if
> you're in a dry environment like I am (I live in Denver) then it's quite
> possible all three have bad CPU/mobo/memory or something like that.

Two different people built the machines;  we're both electrical
engineers with plenty of familiarity and experience with static issues,
so that particular issue is not likely.

As for memtest86 - I haven't been able to run it on two of the machines
yet (they are in production), and I have to restart the third one (it
was "retired" after the third time it died on us).

Meanwhile I found out some more details:
- the first machine had a software raid system that may have been unreliable
- the second machine had a much older kernel and sloppily-updated
modules, and it would hang -- not reboot
- the last machine to reboot MAY have been a line power issue (the whole
building lost power a few hours later, so I lost some info on other
machines' restarting -- I'll dig more).

So -- it's memtest86 and badblocks for all three (as soon as I can),
better UPS-ing, updated kernel(s), and checking more machines' logs;
then we'll see...

Thanks to you all for the suggestions -- keep them coming!

        Andrew