Обсуждение: debug a lockup

Поиск
Список
Период
Сортировка

debug a lockup

От
Scott Ribe
Дата:
PostgreSQL appears locked up. pgbench run that should have completed in a few seconds has been running  14 hours. psql
invocationlocks up. No CPU usage showing in top. 

I personally suspect infra issues. (k8s pod, Pure block storage) But I'm getting pushback pointing the finger at PG.
It's18.1, and pgbench is the only client FWIW. 

Any way to introspect the current non-debug build to get a clue what's going on in there?

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/






Re: debug a lockup

От
Tom Lane
Дата:
Scott Ribe <scott_ribe@elevated-dev.com> writes:
> Any way to introspect the current non-debug build to get a clue what's going on in there?

Backend stack traces taken with gdb should yield at least some clue
even if you don't have debug symbols.

            regards, tom lane



Re: debug a lockup

От
Aislan Luiz Wendling
Дата:
Hello,

Does it repeat on every run?

If it is possible, try to gracefully stop postgresql.

Not working? Try stop immediate and last resort stop abort.

If postgrres service does not stop, try to kill pgbench process.
First try kill -15 <pgbench PID> and if it does not work, kill -9

Nothing works, reboot the vm.

Open two terminals, start pgbench process in one. In the other ps -ef | grep pgbench

Find the parent process ID and do a strace -f -p <PID> (maybe your kernel has a different syntax, but it is to trace a process and its forks)

It can show which set of instructions is waiting. You will know because usually you are not able to read due to its speed, but when it stops, it is waiting for something.


Hope it helps.

ALW

From: Scott Ribe <scott_ribe@elevated-dev.com>
Sent: Tuesday, February 10, 2026 11:55 AM
To: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Subject: debug a lockup
 
PostgreSQL appears locked up. pgbench run that should have completed in a few seconds has been running  14 hours. psql invocation locks up. No CPU usage showing in top.

I personally suspect infra issues. (k8s pod, Pure block storage) But I'm getting pushback pointing the finger at PG. It's 18.1, and pgbench is the only client FWIW.

Any way to introspect the current non-debug build to get a clue what's going on in there?

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/





Re: debug a lockup

От
Scott Ribe
Дата:
OK, we figured it out--I think.

pgbench was stuck in restart_syscall(<...resuming interrupted read...

it was set to open 100 connections

there were ~20 pg sessions in idle, and the last one (highest pid) in auth

that one was in write to fd 2

So... This is running in kubernetes. I was doing some load testing against a storage service (thus 100 connections). PG
waslaunched manually in a bash session connected to the pod, in k9s. There were ~20 total bash sessions open in k9s
across15 nodes. 

Theory: k9s glitched and stopped reading the piped file descriptor, buffer filled, and PG blocked on the write. (I have
seenprior evidence of less-than-perfect handling of output by k9s). Particularly, I had logging of connections on, so
atauth it would have been writing to stderr. 

This happened in one of probably over 100 runs of the same test, so not readily reproducible and I wanted to autopsy it
beforekilling off the hung processes. Unless someone pokes a hole in my theory, at this point I think it is neither
pgbenchnor PG nor Pure/Portworx at fault. 

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/