Обсуждение: "pg_ctl stop" stops working after a backend crash

Поиск
Список
Период
Сортировка

"pg_ctl stop" stops working after a backend crash

От
Alexander Kuzmenkov
Дата:
Hi hackers,

I noticed that sometimes, when I'm running the regression tests and a
backend crashes, the postmaster can get stuck in some weird state
where it doesn't terminate and doesn't respond to `pg_ctl stop`
anymore. I can semi-reliably reproduce this on 18.3 using a simple
script below.

```
set -e
pg_ctl start -w

psql -X -d postgres -c "SELECT pg_sleep(60)" &>/dev/null &
sleep 0.3
VICTIM=$(psql -X -d postgres -tAc \
    "SELECT pid FROM pg_stat_activity WHERE query LIKE '%pg_sleep%' LIMIT 1")

kill -9 "$VICTIM"           # trigger crash recovery
sleep 0.01                  # let postmaster start reinitializing
timeout 8 pg_ctl stop -m fast &
STOP_PID=$!

sleep 6
if kill -0 "$STOP_PID" 2>/dev/null; then
    echo "pg_ctl stop froze. Active processes:"
    pgrep -al postgres
    exit 1
fi
echo "Successful shutdown"
```

A typical output of this is:

598814 postgres
598864 postgres: io worker 0
598865 postgres: io worker 1
598866 postgres: io worker 2
598868 postgres: checkpointer


These processes just stay there indefinitely, and the shutdown
finishes if I do `pkill -USR2 postgres`.

The not working pg_ctl looks like a bug, so I though I should ask for
your comment on this.


Best regards
Alexander Kuzmenkov
TigerData



Re: "pg_ctl stop" stops working after a backend crash

От
Tom Lane
Дата:
Alexander Kuzmenkov <akuzmenkov@tigerdata.com> writes:
> I noticed that sometimes, when I'm running the regression tests and a
> backend crashes, the postmaster can get stuck in some weird state
> where it doesn't terminate and doesn't respond to `pg_ctl stop`
> anymore. I can semi-reliably reproduce this on 18.3 using a simple
> script below.

I experimented with this a bit.  I failed to reproduce it with
your example, but it did happen once I reduced the "sleep 0.01"
to "sleep 0.001".  So apparently, the postmaster misbehaves if
the stop signal arrives soon enough after a child crash (and
the window is tight enough that it's not too surprising we
hadn't noticed).  Didn't look at the logic.

            regards, tom lane