Обсуждение: Spontaneous PostgreSQL Server Reboot?
Help! I'm having a severe problem with my production PostgreSQL servers, and I'm running out of ideas as to what could be the problem. On three different machines running the same PostgreSQL version (7.3.5) on Linux and almost identical databases, I have been plagued by occasional, unexplainable (to me) reboots of the computer. On the first two machines it was clearly connected to running a very complex query (multi-hour processing time); on the third, I could find no such correlation. Please note that only one machine was Production at any one time; the problem never occurred with the Development servers (equivalent configuration). This in and of itself may hint to a threading issue (the DB is serving web servers, plus the occasional maintenance cron job - no cron job running when the last reboot occurred). When the last reboot occurred (this past Friday evening), I scoured all logs for any information; the only log that shows any activity in within 5 minutes of the restart is $PG_DATA/serverlog: 2004-03-26 21:46:59 [10534] LOG: query: SELECT v.* FROM [...] 2004-03-26 21:46:59 [10534] LOG: duration: 0.049330 sec The machine restarted around 21:50 (based on syslog): Mar 26 21:50:33 prod-db-2-1 syslogd 1.4.1#10: restart. I especially found no panics, errors, warnings, ... The machine is used only as a DB server, nothing else running (not even a mail server). I'm running out of things to look for! I'm running PostgreSQL 7.3.5 on Linux (pentium machines), configured as follows: 1. Red Hat 7.3 with 120 GB HDD and 1 GB RAM; 2. Red Hat 7.3 with 80 GB HDD and 512 MB RAM; 3. Debian with a 2.2.20 kernel, 5.5 GB HDD and 384 MB RAM. This is the latest one in use. Note that even the 5.5 GB machine has over 1 GB spare on the /usr/local/pgsql/ partition. Thanks for any and all help! Andrew Biagioni
Andrew Biagioni <andrew.biagioni@e-greek.net> writes: > On three different machines running the same PostgreSQL version (7.3.5) > on Linux and almost identical databases, I have been plagued by > occasional, unexplainable (to me) reboots of the computer. Postgres can *not* cause a system reboot; it's only an unprivileged user process. (but see *) You are dealing with either kernel bugs, hardware errors, or some other root-level process requesting a reboot. Even though it being three different machines would seem to rule out hardware issues, I'd not jump to that conclusion ... you might be having some kind of common-mode hardware failure. Two questions to ask here: * did you buy all the RAM from the same vendor? * is the power utility flaky where you live? (If you say "but I've got a UPS", how old are its batteries?) My money would be on a kernel bug though. Are you up2date on kernel patches? regards, tom lane (*) ObFinePrint: at least, PG can't directly trigger a reboot. One scenario to think about is flaky RAM in an address range that doesn't get used until the machine is under significant load --- since you say Postgres is the only significant load on the machine, it's entirely possible that triggering of a hardware failure is closely correlated to what Postgres is doing. Similar remarks apply to broken disk hardware.
Andrew Biagioni wrote: > On three different machines running the same PostgreSQL version (7.3.5) > on Linux and almost identical databases, I have been plagued by > occasional, unexplainable (to me) reboots of the computer. 1) Do you have a software watchdog that is configured to reboot the machine when it isn't tickled? The watchdog might possiblybe rebooting because the machine is under heavy load, or perhaps someone forgot to enable the watchdog tickler. 2) Was there anyone in the machine room at the time? One office that I worked in suffered random power outages to severalmachines for weeks on end, until we figured out that the cleaner was unplugging the main Ethernet switch to use thepower point for their vacuum cleaner. 3) Are the power supplies up to scratch? some power supplies just can't handle the load when you run disk and processorheavily for minutes on end. The result is the same as a brown-out or black-out - power disappears, machine reboots. Hope this helps prod someone's memory! Alex
Hi, memtest86 is your first friend. You should run it for a night. As also mentioned, power could be an issue as you'd like to think memory failures should be fairly rare, and to effect all servers be fairly remote. Hope this helps, -- Rob Fielding rob@dsvr.net www.dsvr.co.uk Development Designer Servers Ltd
Tom Lane wrote: > Andrew Biagioni <andrew.biagioni@e-greek.net> writes: > >>On three different machines running the same PostgreSQL version (7.3.5) >>on Linux and almost identical databases, I have been plagued by >>occasional, unexplainable (to me) reboots of the computer. > > > Postgres can *not* cause a system reboot; it's only an unprivileged user > process. (but see *) You are dealing with either kernel bugs, hardware > errors, or some other root-level process requesting a reboot. This is one thing I was hoping to hear; any hints on kernel bug issues (or logs of causes)? > Even though it being three different machines would seem to rule out > hardware issues, I'd not jump to that conclusion ... you might be having > some kind of common-mode hardware failure. Two questions to ask here: > * did you buy all the RAM from the same vendor? > * is the power utility flaky where you live? (If you say "but > I've got a UPS", how old are its batteries?) > > My money would be on a kernel bug though. Are you up2date on kernel > patches? kernel patches, yes; kernel versions, no... > regards, tom lane > > (*) ObFinePrint: at least, PG can't directly trigger a reboot. One > scenario to think about is flaky RAM in an address range that doesn't > get used until the machine is under significant load --- since you say > Postgres is the only significant load on the machine, it's entirely > possible that triggering of a hardware failure is closely correlated to > what Postgres is doing. Similar remarks apply to broken disk hardware. I just got a suggestion to run badblocks and memtest86, to check the hardware; that and an updated kernel should help me figure out what is going on... Thanks, Andrew > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Alex Satrapa wrote: > Andrew Biagioni wrote: > >> On three different machines running the same PostgreSQL version >> (7.3.5) on Linux and almost identical databases, I have been >> plagued by occasional, unexplainable (to me) reboots of the >> computer. > > > 1) Do you have a software watchdog that is configured to reboot the > machine when it isn't tickled? The watchdog might possibly be > rebooting because the machine is under heavy load, or perhaps someone > forgot to enable the watchdog tickler. There's also the possibility that the motherboard has temperature monitor and could likely shutdown if the temperature runs too high. Is the machine properly cooled? Does it have plenty of space so as to have the proper air flow? What's the processor type? (we all know amd processors run hotter). -- Until later, Geoffrey Registered Linux User #108567 Building secure systems in spite of Microsoft
Andrew Biagioni wrote: > I just got a suggestion to run badblocks and memtest86, to check the > hardware; that and an updated kernel should help me figure out what > is going on... It seems to me that since you reported the problem on three different machines, you should be looking for a common problem. Having bad ram in all three machines I would find unlikely, even if it is all the same batch. Same would seem to apply to hard drives. I'd be more suspect of power fluctuations, heat issues. Those things that can affect multiple machines the same way, that is, if they are in the same location, using the same power source. I wouldn't rule out memory or drives, but I'd likely look for something that affects all the machines. Common batches of ram, drives or identical motherboards could be a possibility, but I don't think as likely as power and/or heat issues. -- Until later, Geoffrey Registered Linux User #108567 Building secure systems in spite of Microsoft
Alex, the answer is "no" to all of these. We are a tiny start-up (2 guys, and we do our own cleaning); ambient temperature varies significantly but is not related to the failure, and one machine starts beeping when it gets too hot (then we added an extra case fan); no fancy watchdogs (maybe someday... One can only dream :-> ); three different cases, power supplies, motherboards, etc., etc. (one power supply is extra-large, and that's the machine that started failing first!). We originally blamed the problem on hardware failure (first machine); then on OS version/configuration (second machine); now we're out of things to blame, except maybe unusually bad luck... Thanks, Andrew Alex Satrapa wrote: > Andrew Biagioni wrote: > >> On three different machines running the same PostgreSQL version >> (7.3.5) on Linux and almost identical databases, I have been plagued >> by occasional, unexplainable (to me) reboots of the computer. > > > 1) Do you have a software watchdog that is configured to reboot the > machine when it isn't tickled? The watchdog might possibly be rebooting > because the machine is under heavy load, or perhaps someone forgot to > enable the watchdog tickler. > > 2) Was there anyone in the machine room at the time? One office that I > worked in suffered random power outages to several machines for weeks on > end, until we figured out that the cleaner was unplugging the main > Ethernet switch to use the power point for their vacuum cleaner. > > 3) Are the power supplies up to scratch? some power supplies just can't > handle the load when you run disk and processor heavily for minutes on > end. The result is the same as a brown-out or black-out - power > disappears, machine reboots. > > Hope this helps prod someone's memory! > Alex > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > > > . >
On Tue, 30 Mar 2004, Andrew Biagioni wrote: > Alex, > > the answer is "no" to all of these. We are a tiny start-up (2 guys, and > we do our own cleaning); ambient temperature varies significantly but > is not related to the failure, and one machine starts beeping when it > gets too hot (then we added an extra case fan); no fancy watchdogs > (maybe someday... One can only dream :-> ); three different cases, > power supplies, motherboards, etc., etc. (one power supply is > extra-large, and that's the machine that started failing first!). > > We originally blamed the problem on hardware failure (first machine); > then on OS version/configuration (second machine); now we're out of > things to blame, except maybe unusually bad luck... What did memtest86 say? Did the same person build all the machines? I've seen plenty of folks build machines and zap the memory when installing it. >95% of all ESD failures are partial / delayed failures, so just because a computer boots up doesn't mean proper ESD procedures were followed, and if not, and if you're in a dry environment like I am (I live in Denver) then it's quite possible all three have bad CPU/mobo/memory or something like that.
scott.marlowe wrote: > On Tue, 30 Mar 2004, Andrew Biagioni wrote: > > >>Alex, >> >>the answer is "no" to all of these. We are a tiny start-up (2 guys, and >>we do our own cleaning); ambient temperature varies significantly but >>is not related to the failure, and one machine starts beeping when it >>gets too hot (then we added an extra case fan); no fancy watchdogs >>(maybe someday... One can only dream :-> ); three different cases, >>power supplies, motherboards, etc., etc. (one power supply is >>extra-large, and that's the machine that started failing first!). >> >>We originally blamed the problem on hardware failure (first machine); >>then on OS version/configuration (second machine); now we're out of >>things to blame, except maybe unusually bad luck... > > > What did memtest86 say? > > Did the same person build all the machines? I've seen plenty of folks > build machines and zap the memory when installing it. >95% of all ESD > failures are partial / delayed failures, so just because a computer boots > up doesn't mean proper ESD procedures were followed, and if not, and if > you're in a dry environment like I am (I live in Denver) then it's quite > possible all three have bad CPU/mobo/memory or something like that. Two different people built the machines; we're both electrical engineers with plenty of familiarity and experience with static issues, so that particular issue is not likely. As for memtest86 - I haven't been able to run it on two of the machines yet (they are in production), and I have to restart the third one (it was "retired" after the third time it died on us). Meanwhile I found out some more details: - the first machine had a software raid system that may have been unreliable - the second machine had a much older kernel and sloppily-updated modules, and it would hang -- not reboot - the last machine to reboot MAY have been a line power issue (the whole building lost power a few hours later, so I lost some info on other machines' restarting -- I'll dig more). So -- it's memtest86 and badblocks for all three (as soon as I can), better UPS-ing, updated kernel(s), and checking more machines' logs; then we'll see... Thanks to you all for the suggestions -- keep them coming! Andrew