Обсуждение: postgress 8.1.4 deadlocking??

Поиск
Список
Период
Сортировка

postgress 8.1.4 deadlocking??

От
Rafal Pietrak
Дата:
Hi,

I fell into the following problem (unfortunately, the database contents
has sensitive customer information, so can publish very little of that).

Currently postgress process takes close to 100% CPU time.

I've restarted the process a moment ago, and it was calm for a brief
minute.

It started to consume the CPU once I've issued:
    ALTER GROUP majster DROP USER rafal;
... which took close to 5minutes to complete. Then I've issued:
    VACUUM ANALYZE ;
... which I had to cancel when it didnt' completed within another
10minutes. CPU utilisation remains at 100% even after the VACCUM was
canceled.

pg_dumpall was successfull during the large-CPU-usage time.

...another postgress-process restart, and another VACCUM ANALYSE, this
one completes in 3-5 secs. Now "ALTER GROUP ..." goes OK. And everything
seams working.

the database is currently almost empty: c.a. 100k records within c.a. 20
tables with another 30 or so views. There are c.a. 6k userids (roles)
created in the cluster.

Any ideas where should I look (system tables? process stats??) to
dieagnose the problem? .... just to be prepared when it appears again?

This is posgresql 8.1.4 on linux-debian.
--
-R

Re: postgress 8.1.4 deadlocking??

От
Casey Duncan
Дата:
On Sep 27, 2006, at 7:28 AM, Rafal Pietrak wrote:

> Hi,
>
> I fell into the following problem (unfortunately, the database
> contents
> has sensitive customer information, so can publish very little of
> that).
>
> Currently postgress process takes close to 100% CPU time.
>
> I've restarted the process a moment ago, and it was calm for a brief
> minute.
>
> It started to consume the CPU once I've issued:
>     ALTER GROUP majster DROP USER rafal;
> ... which took close to 5minutes to complete. Then I've issued:

Sounds like it was blocked (unsure by what). You can use pg_locks to
check that.

>     VACUUM ANALYZE ;
> ... which I had to cancel when it didnt' completed within another
> 10minutes. CPU utilisation remains at 100% even after the VACCUM was
> canceled.

VACUUM sometimes hangs around for a while after it is cancelled,
which can be annoying. I think this has been improved for 8.2 FWIW

What was consuming all the CPU? I assume from what you wrote
previously that it is a postgres backend. If so, what was it doing?
You can use ps, the pg_stat_activity view and strace to figure that out.

> pg_dumpall was successfull during the large-CPU-usage time.
>
> ...another postgress-process restart, and another VACCUM ANALYSE, this
> one completes in 3-5 secs. Now "ALTER GROUP ..." goes OK. And
> everything
> seams working.

Was something else using the database previously? Was something else
different this time than the last?

>
> the database is currently almost empty: c.a. 100k records within
> c.a. 20
> tables with another 30 or so views. There are c.a. 6k userids (roles)
> created in the cluster.

How big is the data on disk? Is it possible that you have queries
scanning tables containing lots of dead tuples? If so a VACUUM FULL
would help, but that's totally speculative.

> Any ideas where should I look (system tables? process stats??) to
> dieagnose the problem? .... just to be prepared when it appears again?

Postgres itself should not spontaneously start using all of the cpu.
Some query operations must be running unless there is an underlying
problem (hardware, os, fs). When you saw this happen did you happen
to check if it was user or system cpu usage? I had a corrupt
filesystem once that cause pg to go off into the weeds consuming all
cpu forever, but it was all *system* cpu time. Also I couldn't shut
pg down because it was presumably caught in an eternal system call
unable to respond to signals. Your description above sounds different
than this to me, however.

-Casey



Re: postgress 8.1.4 deadlocking??

От
Rafal Pietrak
Дата:
Ups, missed the list recepient itself.

Thenx Duncan for the analysis.

This happend again, so I'm able to peek at the details you've pointed
out.

On Wed, 2006-09-27 at 09:33 -0700, Casey Duncan wrote:
> Sounds like it was blocked (unsure by what). You can use pg_locks to
> check that.

That view currently contais 240 entries .... source of blocking. Now I
have to dig the ultimate reason why those locks end up there (and are
not being cleand). Any ideas?

Whould those locks go away by themselves .... in time?

> What was consuming all the CPU? I assume from what you wrote
> previously that it is a postgres backend. If so, what was it doing?

----------------ps output (just one backend)------------
postgres  4759  0.0  2.0  19368  5292 ?        S    21:02   0:00
postgres: czyn_738773 czyn 127.0.0.1(55431) UPDATE waiting
-------------------------------------------------------

"czyn" is the name of the database, "czyn_738773" is database username.
IP address is localhost, since the database purpose is web serbvice
back-end.

Still, I'm not sure what that UPDATE counld that be.

> You can use ps, the pg_stat_activity view and strace to figure that out.

------------- exerpt from pg_stat_activity-----------------
 17660 | czyn |    4759 |    19541 | czyn_738773 | <command string not
enabled> |             | 2006-09-27 21:02:05.914728+02 | 127.0.0.1   |
55431
(18 rows)
------------------------------------------

But this is not very meaningfull to me.

> > ...another postgress-process restart, and another VACCUM ANALYSE, this
> > one completes in 3-5 secs. Now "ALTER GROUP ..." goes OK. And
> > everything
> > seams working.
>
> Was something else using the database previously? Was something else
> different this time than the last?

No. My impression was (naive, I know), that not all things were
correctly cleand-up on the first restart. Now I think, that on the
second go I was just a little bit more lucky - www service was calm for
a little longer. That was anough for VACUUM to complete.

Still, throuble started later.

> How big is the data on disk? Is it possible that you have queries

close to nothing: c.a. 200MiB

> problem (hardware, os, fs). When you saw this happen did you happen
> to check if it was user or system cpu usage? I had a corrupt

It's the user CPU time.

As it currently looks like some unknown UPDATE causes the trouble, how
can I identify what it is?
1. the database is not very sophisticated, apart from what I've
mentioned earlier, it does have just a few stored procedures in pgsql -
nothing fancy. And there is no triggers just specialized views (although
quite a lot of referencial integrity constrains).
2. machine is 'from the old days': 450MHz CPU
3. the system is currently under 'available-for-public' condition, so
there may be 'waves' (peeks) of activity.
4. when the system was tested in-lab (small number of users - no risk /
not-tested for havier load behavior), I haven't noticed this 'blocking'.

Any ideas where should I look now, to pinpoint the problem?
--
-R

Re: postgress 8.1.4 deadlocking??

От
Casey Duncan
Дата:
On Sep 27, 2006, at 12:35 PM, Rafal Pietrak wrote:
> Thenx Duncan for the analysis.
>
> This happend again, so I'm able to peek at the details you've pointed
> out.
>
> On Wed, 2006-09-27 at 09:33 -0700, Casey Duncan wrote:
>> Sounds like it was blocked (unsure by what). You can use pg_locks to
>> check that.
>
> That view currently contais 240 entries .... source of blocking. Now I
> have to dig the ultimate reason why those locks end up there (and are
> not being cleand). Any ideas?

You need to look at the procpid that's holding the lock and then use
pg_stat_activity to see what it is doing. I'm not sure what locks
that query needs to hold, it may be blocked by any concurrent query
from a user in that group, I dunno.

>
> Whould those locks go away by themselves .... in time?

Yeah, once the transactions holding the locks complete.

>
>> What was consuming all the CPU? I assume from what you wrote
>> previously that it is a postgres backend. If so, what was it doing?
>
> ----------------ps output (just one backend)------------
> postgres  4759  0.0  2.0  19368  5292 ?        S    21:02   0:00
> postgres: czyn_738773 czyn 127.0.0.1(55431) UPDATE waiting
> -------------------------------------------------------
>
> "czyn" is the name of the database, "czyn_738773" is database
> username.
> IP address is localhost, since the database purpose is web serbvice
> back-end.

If the update is waiting, it is blocked and should not be consuming
any noticeable cpu. This was using lots of CPU at this moment?

>
> Still, I'm not sure what that UPDATE counld that be.
>
>> You can use ps, the pg_stat_activity view and strace to figure
>> that out.
>
> ------------- exerpt from pg_stat_activity-----------------
>  17660 | czyn |    4759 |    19541 | czyn_738773 | <command string not
> enabled> |             | 2006-09-27 21:02:05.914728+02 | 127.0.0.1   |
> 55431
> (18 rows)
> ------------------------------------------
>
> But this is not very meaningfull to me.

You need to add this to your postgresql.conf:

stats_command_string = on

Then send a HUP signal to the postmaster (you can use pgctl reload
for this or 'kill -HUP')

Then you can see what the exact command is.

Another option that may be helpful to diagnose is:

log_min_duration_statement = 500

This will output the sql for all commands taking longer than 500ms.
You can adjust this up and down depending on what counts as "slow" in
your app. You probably only want it on temporarily, but I find it
sometimes helpful to always leave it on with a suitably high value
(like 1000). That way you can always troll the logs for laggards.

>
>>> ...another postgress-process restart, and another VACCUM ANALYSE,
>>> this
>>> one completes in 3-5 secs. Now "ALTER GROUP ..." goes OK. And
>>> everything
>>> seams working.
>>
>> Was something else using the database previously? Was something else
>> different this time than the last?
>
> No. My impression was (naive, I know), that not all things were
> correctly cleand-up on the first restart. Now I think, that on the
> second go I was just a little bit more lucky - www service was calm
> for
> a little longer. That was anough for VACUUM to complete.

Yeah, it sounds like things are fairly busy given the locks above.

>
> Still, throuble started later.
>
>> How big is the data on disk? Is it possible that you have queries
>
> close to nothing: c.a. 200MiB

Sounds like the problem is more about concurrency then data size. How
many concurrent connections do you have (select count(*) from
pg_stat_activity)? How many are in transaction and running queries
concurrently?

If you see multiple instances of a given query (or the same one
appearing frequently), try running EXPLAIN ANALYZE on it. That'll
tell you what makes it slow.

>
>> problem (hardware, os, fs). When you saw this happen did you happen
>> to check if it was user or system cpu usage? I had a corrupt
>
> It's the user CPU time.

Yeah, just sounds like a scaling issue. Hopefully easily solved ;^)

>
> As it currently looks like some unknown UPDATE causes the trouble, how
> can I identify what it is?

see config recommendations above.

> 1. the database is not very sophisticated, apart from what I've
> mentioned earlier, it does have just a few stored procedures in
> pgsql -
> nothing fancy. And there is no triggers just specialized views
> (although
> quite a lot of referencial integrity constrains).

These can be a source of lock contention on updates (though this is
much better on 8.1 compared to 7.4). Once you know the update query
that's getting blocked you can know for sure.

> 2. machine is 'from the old days': 450MHz CPU

What is the iowait like on this box? You can use iostat to determine
this. That will tell you if the disks are also struggling.

> 3. the system is currently under 'available-for-public' condition, so
> there may be 'waves' (peeks) of activity.
> 4. when the system was tested in-lab (small number of users - no
> risk /
> not-tested for havier load behavior), I haven't noticed this
> 'blocking'.

With any luck the above will help point out some suboptimal queries
that are causing the headache.

-Casey