Обсуждение: Excess idle processes, leaking WAL logs?
I'm a sysadmin, not a DBA, and I inherited a legacy app after the developer left the company. It's 4 early mod_perl (1.29) + early DBI (1.43) app servers, going against postgres 7.4.6. The DB just crapped itself a few days ago. In the postmortem, we found out that the number of processes on the server had been climbing as the server ran, going from having some 100 processes, to about 350, which we believe were mostly idle postgres processes (someone else got the page-out for support). The uptime was around 6 months. In the 3 days since it died, I've been watching it and, once again, the processes are climbing slowly again. They're idle postgres processes, fairly evenly distributed against the app servers, and more interestingly, if I do lsof|grep postgres, I see a large number of lines (now 55): postmaste 24521 postgres 55u REG 58,0 16777216 2899982 /a/path/pdm/pgsql/data/pg_xlog/0000030F000000E2 (deleted) I looked in the 7.4 docs about WAL. checkpoint_timeout is 300, checkpoint_segments is 8, plenty of space in pg_xlog, but there's 18 files in there, some a few hours old, which I would suspect would not be the case if it were checkpointing properly. Am I chasing up the right tree over these leaking processes/connections, or is the WAL logs thing just a red herring? Sadly, I can't edit the app code, and I doubt I could upgrade the DB unless I can really business-justify that a minor change would do it (I'd probably have to stay in the 7.4 series due to timid managers), but if there's little baby tweaks I'm missing, or if I can say "yeah, this version of postgres leaks, plan for 3-month-reboots or moving to 7.4.x", it'd work for me. Thanks for any help! -Greg _________________________________________________________________ Dave vs. Carl: The Insignificant Championship Series. �Who will win? http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://davevscarl.spaces.live.com/?icid=T001MSN38C07001
I highly recommend moving to the latest 7.4.x. 7.4.6 is well over 2 years old, and the numerous security issues alone shouldbe convincing enough to upgrade. Updating this far appears to be a multi-step process, though. See the release notes. There was a race condition prior to 7.4.10 in transaction logging: http://www.postgresql.org/docs/7.4/interactive/release-7-4-10.html There's also a pg_clog race condition in 7.4.12, but that doesn't seem obviously related to me. You didn't say if you're using clustering, but you may need to use this to correct this issue. http://www.postgresql.org/docs/7.4/interactive/app-pgresetxlog.html I am by no means an expert with PostgreSQL; however, it seems likely that at least some of the problems you're having havebeen identified and corrected. -- Brandon Aiken CS/IT Systems Engineer -----Original Message----- From: pgsql-novice-owner@postgresql.org [mailto:pgsql-novice-owner@postgresql.org] On Behalf Of Greg Cox Sent: Wednesday, January 03, 2007 4:52 PM To: pgsql-novice@postgresql.org Subject: [NOVICE] Excess idle processes, leaking WAL logs? I'm a sysadmin, not a DBA, and I inherited a legacy app after the developer left the company. It's 4 early mod_perl (1.29) + early DBI (1.43) app servers, going against postgres 7.4.6. The DB just crapped itself a few days ago. In the postmortem, we found out that the number of processes on the server had been climbing as the server ran, going from having some 100 processes, to about 350, which we believe were mostly idle postgres processes (someone else got the page-out for support). The uptime was around 6 months. In the 3 days since it died, I've been watching it and, once again, the processes are climbing slowly again. They're idle postgres processes, fairly evenly distributed against the app servers, and more interestingly, if I do lsof|grep postgres, I see a large number of lines (now 55): postmaste 24521 postgres 55u REG 58,0 16777216 2899982 /a/path/pdm/pgsql/data/pg_xlog/0000030F000000E2 (deleted) I looked in the 7.4 docs about WAL. checkpoint_timeout is 300, checkpoint_segments is 8, plenty of space in pg_xlog, but there's 18 files in there, some a few hours old, which I would suspect would not be the case if it were checkpointing properly. Am I chasing up the right tree over these leaking processes/connections, or is the WAL logs thing just a red herring? Sadly, I can't edit the app code, and I doubt I could upgrade the DB unless I can really business-justify that a minor change would do it (I'd probably have to stay in the 7.4 series due to timid managers), but if there's little baby tweaks I'm missing, or if I can say "yeah, this version of postgres leaks, plan for 3-month-reboots or moving to 7.4.x", it'd work for me. Thanks for any help! -Greg _________________________________________________________________ Dave vs. Carl: The Insignificant Championship Series. Who will win? http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://davevscarl.spaces.live.com/?icid=T001MSN38C07001 ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org
"Greg Cox" <ratness@hotmail.com> writes: > In the 3 days since it died, I've been watching it and, once again, the > processes are climbing slowly again. They're idle postgres processes, > fairly evenly distributed against the app servers, and more interestingly, This isn't a postgres bug, it's a problem in your application code; it's not closing sessions it doesn't need. Possibly a problem in a connection-pooling layer? > I looked in the 7.4 docs about WAL. checkpoint_timeout is 300, > checkpoint_segments is 8, plenty of space in pg_xlog, but there's 18 files > in there, some a few hours old, which I would suspect would not be the case > if it were checkpointing properly. No, that doesn't sound out of line. The steady state is to have about 2 * checkpoint_segments + 1 WAL files. You might have one or so more than I'd have expected but that's not enough to evidence a problem. If you are concerned about whether checkpoints are really happening, use pg_controldata to keep an eye on the last-checkpoint time in pg_control. I concur with the other comment that you really ought to be on 7.4.latest but that's not going to magically fix your connection-pooling issue. regards, tom lane