Re: Spontaneous PostgreSQL Server Reboot?

Поиск
Список
Период
Сортировка
От Andrew Biagioni
Тема Re: Spontaneous PostgreSQL Server Reboot?
Дата
Msg-id 406B984E.7030109@e-greek.net
обсуждение исходный текст
Ответ на Re: Spontaneous PostgreSQL Server Reboot?  ("scott.marlowe" <scott.marlowe@ihs.com>)
Список pgsql-admin
scott.marlowe wrote:

> On Tue, 30 Mar 2004, Andrew Biagioni wrote:
>
>
>>Alex,
>>
>>the answer is "no" to all of these.  We are a tiny start-up (2 guys, and
>>we do our own cleaning);  ambient temperature varies significantly but
>>is not related to the failure, and one machine starts beeping when it
>>gets too hot (then we added an extra case fan);  no fancy watchdogs
>>(maybe someday...  One can only dream :-> );  three different cases,
>>power supplies, motherboards, etc., etc. (one power supply is
>>extra-large, and that's the machine that started failing first!).
>>
>>We originally blamed the problem on hardware failure (first machine);
>>then on OS version/configuration (second machine);  now we're out of
>>things to blame, except maybe unusually bad luck...
>
>
> What did memtest86 say?
>
> Did the same person build all the machines?  I've seen plenty of folks
> build machines and zap the memory when installing it.  >95% of all ESD
> failures are partial / delayed failures, so just because a computer boots
> up doesn't mean proper ESD procedures were followed, and if not, and if
> you're in a dry environment like I am (I live in Denver) then it's quite
> possible all three have bad CPU/mobo/memory or something like that.

Two different people built the machines;  we're both electrical
engineers with plenty of familiarity and experience with static issues,
so that particular issue is not likely.

As for memtest86 - I haven't been able to run it on two of the machines
yet (they are in production), and I have to restart the third one (it
was "retired" after the third time it died on us).

Meanwhile I found out some more details:
- the first machine had a software raid system that may have been unreliable
- the second machine had a much older kernel and sloppily-updated
modules, and it would hang -- not reboot
- the last machine to reboot MAY have been a line power issue (the whole
building lost power a few hours later, so I lost some info on other
machines' restarting -- I'll dig more).

So -- it's memtest86 and badblocks for all three (as soon as I can),
better UPS-ing, updated kernel(s), and checking more machines' logs;
then we'll see...

Thanks to you all for the suggestions -- keep them coming!

        Andrew


В списке pgsql-admin по дате отправления:

Предыдущее
От: Justin Camp
Дата:
Сообщение: Problems unsubscribing...
Следующее
От: Hemapriya
Дата:
Сообщение: Best Platform for postgres.