Обсуждение: pg_stat_activity showing non-existent processes

Поиск
Список
Период
Сортировка

pg_stat_activity showing non-existent processes

От
Jerry Sievers
Дата:
Hello; Briefly, we've been fighting an "old idle transaction" problem
on our Pg 8.0.3 Solaris 2.9 production system for a long time.  This
is due to some quirks in our app server code (to be fixed ASAP
<groan>).

Hourly we run a script that SIGTERMs all backends reported as being in
idle transaction state for 6 hours or better.  Many, many runs of this
procedure during past several weeks; no problem.

Suddently, we have a condition where the pg_stat_activity view is
reporting several such backends and there is no such PID in the system
any more.  These were either successfully terminated by an earlier
script run, died off abnormally or even by normal client disconnect.
(Unknown).

At any rate; I'm wondering what possible causes might be responsible
for pg_stat_activity's underlying functions to lose track of the valid
process list?

Great big thanks!

--
-------------------------------------------------------------------------------
Jerry Sievers   305 854-3001 (home)     WWW ECommerce Consultant
                305 321-1144 (mobile    http://www.JerrySievers.com/

Re: pg_stat_activity showing non-existent processes

От
Tom Lane
Дата:
Jerry Sievers <jerry@jerrysievers.com> writes:
> At any rate; I'm wondering what possible causes might be responsible
> for pg_stat_activity's underlying functions to lose track of the valid
> process list?

It sounds like the stats collector missed a few "backend quit"
messages.  This isn't real surprising: the stats messaging mechanism is
intentionally designed to drop messages under severe load, rather than
slow down backends.

We recently put in a filter that prevents reporting pg_stat_activity
lines for backends that are dead according to the up-to-date list in
shared memory.  I don't think that's in 8.0.3 though.

            regards, tom lane

Re: pg_stat_activity showing non-existent processes

От
Jerry Sievers
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Jerry Sievers <jerry@jerrysievers.com> writes:
> > At any rate; I'm wondering what possible causes might be responsible
> > for pg_stat_activity's underlying functions to lose track of the valid
> > process list?
>
> It sounds like the stats collector missed a few "backend quit"
> messages.  This isn't real surprising: the stats messaging mechanism is
> intentionally designed to drop messages under severe load, rather than
> slow down backends.

Tom, I appreciate your comments on this.

> We recently put in a filter that prevents reporting pg_stat_activity
> lines for backends that are dead according to the up-to-date list in
> shared memory.  I don't think that's in 8.0.3 though.

This machine is going to be upgraded to 8.1 ASAP.

In the meantime, I may adjust the cron run time for the batch job that
terminates these sessions in case it's running concurrently with other
batch traffic.

I believe the thing runs hourly at the 00 minute and this may likely
be a time that we're kicking off all sorts of other jobs too.


> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

--
-------------------------------------------------------------------------------
Jerry Sievers   305 854-3001 (home)     WWW ECommerce Consultant
                305 321-1144 (mobile    http://www.JerrySievers.com/

Re: pg_stat_activity showing non-existent processes

От
"Kevin Grittner"
Дата:
>>> On Sat, Mar 25, 2006 at  8:40 pm, in message
<22874.1143340808@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jerry Sievers <jerry@jerrysievers.com> writes:
>> At any rate; I'm wondering what possible causes might be
responsible
>> for pg_stat_activity's underlying functions to lose track of the
valid
>> process list?
>
> It sounds like the stats collector missed a few "backend quit"
> messages.  This isn't real surprising: the stats messaging mechanism
is
> intentionally designed to drop messages under severe load, rather
than
> slow down backends.

Is there any way to tweak this in favor of more accurate information,
even if has a performance cost?  We're finding that during normal
operations we're not seeing most connections added to the
pg_stat_activity table.  We would like to be able to count on accurate
information there.  We've been considering adding functions to get at
the underlying structures to be able to retrieve it, but it would make a
lot of sense (for us, anyway) to make this table accurate instead.  What
would be involved in that?  Would it improve the accuracy of the other
statistics, as well?  Would anyone else be interested in something like
this (probably controlled by a configuration option), or are we unique
in this regard?

-Kevin


Re: pg_stat_activity showing non-existent processes

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Is there any way to tweak this in favor of more accurate information,
> even if has a performance cost?  We're finding that during normal
> operations we're not seeing most connections added to the
> pg_stat_activity table.  We would like to be able to count on accurate
> information there.

That's basically a non-starter because of the delay in reporting from
the stats collector process (ie, even if the information was "completely
accurate" it'd still be stale by the time that your code gets its hands
on it).  I think you'd be talking about a complete redesign of the stats
subsystem to be able to use it that way.

Having said that, though, I'd be pretty surprised if the stats subsystem
was dropping more than a small fraction of messages --- I would think
that could only occur under very heavy load, and if that's your normal
operating state then it's time to upgrade your hardware ;-).  Maybe you
should investigate a bit more closely to find out why it's dropping so
much.

            regards, tom lane

Re: pg_stat_activity showing non-existent processes

От
"Kevin Grittner"
Дата:
>>> On Mon, Apr 3, 2006 at 11:52 am, in message
<14779.1144083156@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Is there any way to tweak this in favor of more accurate
information,
>> even if has a performance cost?  We're finding that during normal
>> operations we're not seeing most connections added to the
>> pg_stat_activity table.  We would like to be able to count on
accurate
>> information there.
>
> That's basically a non- starter because of the delay in reporting
from
> the stats collector process (ie, even if the information was
"completely
> accurate" it'd still be stale by the time that your code gets its
hands
> on it).  I think you'd be talking about a complete redesign of the
stats
> subsystem to be able to use it that way.

We want this for our monitoring software, to raise an alert when the
connection pool diverges from its nominal configuration beyond
prescribed limits or in excess of a prescribed duration.  What we're
looking for is not necessarily a table which is accurate immediately,
but one which won't entirely miss a connection.  Even then, if it only
misbehaves under extreme load, that would be OK; such extreme usage
might be worthy of note in and of itself.

Since we have converted to PostgreSQL we have not had this monitoring,
and folks are nervous that we will not detect a struggling middle tier
before it fails.  (Not something that happens often, but we really hate
having users tell us that something is broken, versus spotting the
impending failure and correcting it before it fails.)

> Having said that, though, I'd be pretty surprised if the stats
subsystem
> was dropping more than a small fraction of messages ---  I would
think
> that could only occur under very heavy load, and if that's your
normal
> operating state then it's time to upgrade your hardware ;- ).

We have a pair of database servers for our transaction repository.
Each has four Xeon processors.  One of these is Windows, one is Linux.
On the Windows machine, I see 10% CPU utilization.  On the Linux machine
I see a load average of 0.30.  The Linux machine seems to be very
reliable about showing the connections.  The Windows machine, when I
refresh a 20-connection pool, I either get no connections showing, or
only a few.

>  Maybe you
> should investigate a bit more closely to find out why it's dropping
so
> much.

It is probably related to something we've been seeing in the PostgreSQL
logs on the Windows servers:

[2006-04-03 08:28:25.990 ] 2072 FATAL:  could not read from statistics
collector pipe: No error
[2006-04-03 08:28:26.068 ] 2012 LOG:  statistics collector process (PID
3268) was terminated by signal 1

We're going to patch to try to capture more info from WinSock.

In src/port/pipe.c we plan to add before return ret in piperead():

if (ret == SOCKET_ERROR)
{
       ereport(LOG, (errmsg_internal("SOCKET ERROR: %ui",
WSAGetLastError())));
}

I hope to post more info, and possibly a patch, tomorrow.

-Kevin


Re: pg_stat_activity showing non-existent processes

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> It is probably related to something we've been seeing in the PostgreSQL
> logs on the Windows servers:

> [2006-04-03 08:28:25.990 ] 2072 FATAL:  could not read from statistics
> collector pipe: No error
> [2006-04-03 08:28:26.068 ] 2012 LOG:  statistics collector process (PID
> 3268) was terminated by signal 1

We've heard reports of instability in the stats collector on Windows
before, though I'm not sure if this is exactly the symptom --- check
the list archives.  Nobody's been able to track it down yet.

            regards, tom lane

Re: pg_stat_activity showing non-existent processes

От
Tom Lane
Дата:
"Lane Van Ingen" <lvaningen@ESNCC.com> writes:
> Don't understand the 'target machine' message, either; in this case, we are
> running the application and the database server on the same box.

> 2006-04-04 03:12:05 FATAL:  could not read from statistics collector pipe:
> No error 2006-04-04 03:12:06 FATAL:  could not write to statistics collector
> pipe: No connection could be made because the target machine actively
> refused it.

I think that's Microsoftese for ECONNRESET, ie, the kernel bounced a
packet for lack of any listening process to deliver it to.  The real
question is what's causing the collector to fail (the "could not read").
While it'd be easy to make it retry read attempts, the reason it
considers that FATAL is that it really should never happen.  I'd like
to find out exactly what's happening before we try to fix it.

As Kevin mentioned, adding some more debug printout would be helpful.

            regards, tom lane

Re: pg_stat_activity showing non-existent processes

От
"Lane Van Ingen"
Дата:
Perhaps I might be able to help you track this problem down, but I could use
some help. The limited amount of discussion on this I Googled up didn't
help. We have never seen these messages before; we are assuming that their
appearance now (we have been running OK since last August) is related to two
recent changes we made to our Windows 2003 Server (SvcPk 1), PostgreSQL
8.0.4 installation:
  (1) started capturing statistics with the following config parms; all
other parms
      were installation defaults:
        debug_pretty_print        "on"
        log_min_duration_statement    "60"
        log_min_error_statement        "debug1"
        log_statement            "ddl"
        log_truncate_on_rotation    "on"
        stats_block_level            "on"
        stats_command_string        "on"
        stats_reset_on_server_start    "on"
        stats_row_level            "on"
        stats_start_collector        "on"
  (2) added additional application workload to this server

Don't understand the 'target machine' message, either; in this case, we are
running the application and the database server on the same box.

2006-04-04 03:12:05 FATAL:  could not read from statistics collector pipe:
No error 2006-04-04 03:12:06 FATAL:  could not write to statistics collector
pipe: No connection could be made because the target machine actively
refused it.
2006-04-04 04:16:58 FATAL:  could not read from statistics collector pipe:
No error 2006-04-04 04:16:58 FATAL:  could not write to statistics collector
pipe: No connection could be made because the target machine actively
refused it.
2006-04-04 05:47:26 FATAL:  could not read from statistics collector pipe:
No error 2006-04-04 05:47:27 LOG:  statistics collector process (PID 1776)
was terminated by signal 1

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane
Sent: Tuesday, April 04, 2006 1:29 AM
To: Kevin Grittner
Cc: Jerry Sievers; pgsql-admin@postgresql.org; Peter Brant
Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes


"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> It is probably related to something we've been seeing in the PostgreSQL
> logs on the Windows servers:

> [2006-04-03 08:28:25.990 ] 2072 FATAL:  could not read from statistics
> collector pipe: No error
> [2006-04-03 08:28:26.068 ] 2012 LOG:  statistics collector process (PID
> 3268) was terminated by signal 1

We've heard reports of instability in the stats collector on Windows
before, though I'm not sure if this is exactly the symptom --- check
the list archives.  Nobody's been able to track it down yet.

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend



Re: pg_stat_activity showing non-existent processes

От
"Lane Van Ingen"
Дата:
Does anyone out there have a Microsoft installation, know how to get the
dump(s) desired to diagnose the problems below, know where dumps end up
being placed, and know how to interpret dumps?

I am willing to try to help solve these problems, but I need some help ...

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Tuesday, April 04, 2006 3:52 PM
To: Lane Van Ingen
Cc: Kevin Grittner; Jerry Sievers; pgsql-admin@postgresql.org; Peter
Brant
Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes

"Lane Van Ingen" <lvaningen@ESNCC.com> writes:
> Don't understand the 'target machine' message, either; in this case, we
are
> running the application and the database server on the same box.

> 2006-04-04 03:12:05 FATAL:  could not read from statistics collector pipe:
> No error 2006-04-04 03:12:06 FATAL:  could not write to statistics
collector
> pipe: No connection could be made because the target machine actively
> refused it.

I think that's Microsoftese for ECONNRESET, ie, the kernel bounced a
packet for lack of any listening process to deliver it to.  The real
question is what's causing the collector to fail (the "could not read").
While it'd be easy to make it retry read attempts, the reason it
considers that FATAL is that it really should never happen.  I'd like
to find out exactly what's happening before we try to fix it.

As Kevin mentioned, adding some more debug printout would be helpful.

            regards, tom lane
-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Lane Van Ingen
Sent: Tuesday, April 04, 2006 3:40 PM
To: Tom Lane; Kevin Grittner
Cc: Jerry Sievers; pgsql-admin@postgresql.org; Peter Brant
Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes

Perhaps I might be able to help you track this problem down, but I could use
some help. The limited amount of discussion on this I Googled up didn't
help. We have never seen these messages before; we are assuming that their
appearance now (we have been running OK since last August) is related to two
recent changes we made to our Windows 2003 Server (SvcPk 1), PostgreSQL
8.0.4 installation:
  (1) started capturing statistics with the following config parms; all
other parms
      were installation defaults:
        debug_pretty_print        "on"
        log_min_duration_statement    "60"
        log_min_error_statement        "debug1"
        log_statement            "ddl"
        log_truncate_on_rotation    "on"
        stats_block_level            "on"
        stats_command_string        "on"
        stats_reset_on_server_start    "on"
        stats_row_level            "on"
        stats_start_collector        "on"
  (2) added additional application workload to this server

Don't understand the 'target machine' message, either; in this case, we are
running the application and the database server on the same box.

2006-04-04 03:12:05 FATAL:  could not read from statistics collector pipe:
No error 2006-04-04 03:12:06 FATAL:  could not write to statistics collector
pipe: No connection could be made because the target machine actively
refused it.
2006-04-04 04:16:58 FATAL:  could not read from statistics collector pipe:
No error 2006-04-04 04:16:58 FATAL:  could not write to statistics collector
pipe: No connection could be made because the target machine actively
refused it.
2006-04-04 05:47:26 FATAL:  could not read from statistics collector pipe:
No error 2006-04-04 05:47:27 LOG:  statistics collector process (PID 1776)
was terminated by signal 1

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane
Sent: Tuesday, April 04, 2006 1:29 AM
To: Kevin Grittner
Cc: Jerry Sievers; pgsql-admin@postgresql.org; Peter Brant
Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes


"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> It is probably related to something we've been seeing in the PostgreSQL
> logs on the Windows servers:

> [2006-04-03 08:28:25.990 ] 2072 FATAL:  could not read from statistics
> collector pipe: No error
> [2006-04-03 08:28:26.068 ] 2012 LOG:  statistics collector process (PID
> 3268) was terminated by signal 1

We've heard reports of instability in the stats collector on Windows
before, though I'm not sure if this is exactly the symptom --- check
the list archives.  Nobody's been able to track it down yet.

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq



Re: pg_stat_activity showing non-existent processes

От
Jim Nasby
Дата:
On Apr 4, 2006, at 1:29 AM, Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> It is probably related to something we've been seeing in the
>> PostgreSQL
>> logs on the Windows servers:
>
>> [2006-04-03 08:28:25.990 ] 2072 FATAL:  could not read from
>> statistics
>> collector pipe: No error
>> [2006-04-03 08:28:26.068 ] 2012 LOG:  statistics collector process
>> (PID
>> 3268) was terminated by signal 1
>
> We've heard reports of instability in the stats collector on Windows
> before, though I'm not sure if this is exactly the symptom --- check
> the list archives.  Nobody's been able to track it down yet.

BTW, I was (attempting) to do some performance testing on windows
Sunday with pgbench and I was getting an error similar to that (sorry
for being so vague, but I'm ~2000 miles from that server right now).
This was on a database created with a scale factor of 40 running on a
dual Xeon with plain SCSI drives (base tables on a seperate drive
from pg_xlog) running on XP with SP1. Any attempt to run pgbench with
40 connections for more than about 50 transactions would result in
that error (as well as some other ugliness).

Based just on that limited experience it looks like the windows port
could use a lot more shaking out with pgbench. Perhaps it would be a
good idea to add pgbench to the buildfarm test.
--
Jim C. Nasby, Database Architect                decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"




--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: pg_stat_activity showing non-existent processes

От
Jim Nasby
Дата:
On Apr 3, 2006, at 12:52 PM, Tom Lane wrote:

> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Is there any way to tweak this in favor of more accurate information,
>> even if has a performance cost?  We're finding that during normal
>> operations we're not seeing most connections added to the
>> pg_stat_activity table.  We would like to be able to count on
>> accurate
>> information there.
>
> That's basically a non-starter because of the delay in reporting from
> the stats collector process (ie, even if the information was
> "completely
> accurate" it'd still be stale by the time that your code gets its
> hands
> on it).  I think you'd be talking about a complete redesign of the
> stats
> subsystem to be able to use it that way.

BTW, there's some effort going into adding monitoring probes such as
dtrace to PostgreSQL. These would likely be ideal for what you're
trying to do.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461