Обсуждение: Sudden slowdown of Pg server

Поиск
Список
Период
Сортировка

Sudden slowdown of Pg server

От
Jerry Sievers
Дата:
Hello;

I am going through a post mortem analysis of an infrequent but
recurring problem on a Pg 8.0.3 installation.  Application code
connects to Pg using J2EE pooled connections.

 PostgreSQL 8.0.3 on sparc-sun-solaris2.9, compiled by GCC sparc-sun-solaris2.8-gcc (GCC) 3.3.2

Database is quite large with respect to the number of tables, some of
which have up to 6 million tuples.  Typical idle/busy connection ratio
is 3/100 but occationally we'll catch 20 or more busy sessions.

The problem manifests itself and appears like a locking issue.  About
weekly throuput slows down and we notice the busy connection count
rising minute by minute.  2, 20, 40...  Before long, the app server
detects lack of responsiveness and fails over to another app server
(not Pg) which in turn attempts a bunch of new connections into
Postgres.

Sampling of the snapshots of pg_locks and pg_stat_activity tables
takes place each minute.

I am wishing for a few new ideas as to what to be watching; Here's
some observations that I've made.

1. At no time do any UN-granted locks show in pg_locks
2. The number of exclusive locks is small 1, 4, 8
3. Other locks type/mode are numerous but appear like normal workload.
4. There are at   least a few old '<IDLE> In Transaction' cases in
   activity view
5. No interesting error messages or warning in Pg logs.
6. No crash of Pg backend

Other goodies includes a bounty of poor performing queries which are
constantly being optimized now for good measure.  Aside from the heavy
queries, performance is generallly decent.

Resource related server configs have been boosted substantially but
have not undergone any formal R&D to verify that we're inthe safe
under heavy load.

An max_fsm_relations setting which is *below* our table and index
count was discovered by me today and will be increased this evening
during a maint cycle.

The slowdown and subsequent run-away app server takes place within a
small 2-5 minute window and I have as of yet not been able to get into
Psql during the event for a hands-on look.

Questions;

1. Is there any type of resource lock that can unconditionally block
   another session and NOT appear as UN-granted lock?

2. What in particular other runtime info would be most useful to
   sample here?

3. What Solaris side runtime stats might give some clues here
   (maybe?)( and how often to sample?  Assume needs to be aggressive
   due to how fast this problem crops up.

Any help appreciated

Thank you


--
-------------------------------------------------------------------------------
Jerry Sievers   305 854-3001 (home)     WWW ECommerce Consultant
                305 321-1144 (mobile    http://www.JerrySievers.com/

Re: Sudden slowdown of Pg server

От
"Jignesh K. Shah"
Дата:
lockstat is available in Solaris 9. That can help you to determine if
there are any kernel level locks that are occuring during that time.
Solaris 10 also has plockstat which can be used to identify userland
locks happening in your process.

Since you have Solaris 9, try the following:

You can run (as root)
lockstat sleep 5
and note the output which can be long.

I guess "prstat -am" output, "iostat -xczn 3", "vmstat 3" outputs will
help also.

prstat -am has a column called "LAT", if the value is in double digits,
then you have a locking issue which will probably result in higher "SLP"
value for the process.  (Interpretation is data and workload specific
which this email is too small to decode)

Once you have identified a particular process (if any) to be the source
of the problem, get its id and you can look at the outputs of following
command which (quite intrusive)
truss -c -p $pid   2> truss-syscount.txt

(Ctrl-C after a while to stop collecting)

truss -a -e -u":::" -p $pid 2> trussout.txt

(Ctrl-C after a while to stop collecting)

Regards,
Jignesh


Jerry Sievers wrote:

>Hello;
>
>I am going through a post mortem analysis of an infrequent but
>recurring problem on a Pg 8.0.3 installation.  Application code
>connects to Pg using J2EE pooled connections.
>
> PostgreSQL 8.0.3 on sparc-sun-solaris2.9, compiled by GCC sparc-sun-solaris2.8-gcc (GCC) 3.3.2
>
>Database is quite large with respect to the number of tables, some of
>which have up to 6 million tuples.  Typical idle/busy connection ratio
>is 3/100 but occationally we'll catch 20 or more busy sessions.
>
>The problem manifests itself and appears like a locking issue.  About
>weekly throuput slows down and we notice the busy connection count
>rising minute by minute.  2, 20, 40...  Before long, the app server
>detects lack of responsiveness and fails over to another app server
>(not Pg) which in turn attempts a bunch of new connections into
>Postgres.
>
>Sampling of the snapshots of pg_locks and pg_stat_activity tables
>takes place each minute.
>
>I am wishing for a few new ideas as to what to be watching; Here's
>some observations that I've made.
>
>1. At no time do any UN-granted locks show in pg_locks
>2. The number of exclusive locks is small 1, 4, 8
>3. Other locks type/mode are numerous but appear like normal workload.
>4. There are at   least a few old '<IDLE> In Transaction' cases in
>   activity view
>5. No interesting error messages or warning in Pg logs.
>6. No crash of Pg backend
>
>Other goodies includes a bounty of poor performing queries which are
>constantly being optimized now for good measure.  Aside from the heavy
>queries, performance is generallly decent.
>
>Resource related server configs have been boosted substantially but
>have not undergone any formal R&D to verify that we're inthe safe
>under heavy load.
>
>An max_fsm_relations setting which is *below* our table and index
>count was discovered by me today and will be increased this evening
>during a maint cycle.
>
>The slowdown and subsequent run-away app server takes place within a
>small 2-5 minute window and I have as of yet not been able to get into
>Psql during the event for a hands-on look.
>
>Questions;
>
>1. Is there any type of resource lock that can unconditionally block
>   another session and NOT appear as UN-granted lock?
>
>2. What in particular other runtime info would be most useful to
>   sample here?
>
>3. What Solaris side runtime stats might give some clues here
>   (maybe?)( and how often to sample?  Assume needs to be aggressive
>   due to how fast this problem crops up.
>
>Any help appreciated
>
>Thank you
>
>
>
>