Обсуждение: [BUGS] postgresql hanging/stuck
The below is the event timeline describes the issue we are having:
1. We are running 9.3.16 on Redhat 7.1 3.10.0-229.14.1.el7.x86_64
2. 500 max connection limit
3. 2 CPU X 18 cores - hyperthreded (72 threads) - 512GB RAM
4. 80% OLTP, 20% OLAP
Until late Feb we didn't encounter any issues on this server other than the normal performance troubleshooting.
Since then we had 2 incidents where PostgreSQL connections went to about 300 and PostgreSQL became unresponsive and restarted itself.
We have monitors setup to page us when connections are over 250. PostgreSQL should still have about 250 connections available.
When this occurred, we saw a lot of connections in the authentication state (stuck)
They all looked like: postgres: username databasename ipaddress (pid) authentication
We tried killing some of the idle connection but these attempts were unsuccessful.
No new connections were able to be established at this time until the authentication connections hit some high number (over 500) and we were forced to restart the engine by restarting (killing) the postgresql PID.
Both incidents came 2 weeks apart and other than our normal processing we didn't find any correlations on why this is happening.
During the incident (about 10-15min timeline)
- CPUs are running high but the box is still very responsible. 60-70%
- Memory allocations are ok, no paging. Nothing looks out of norm.
- We have our normal scenario engine running with about 30-40 active connections + some 5-10 active reporting and processing connections at a time (lots of quick/short queries)
- When the ~300 connections hit and postgresql becomes unresponsive, it also stop writing to its log
- All new connections get stuck in the "authentication" state and queue up
- Kernel logs are clean, no system panic, nothing out of norm
We don't understand why so many connections would be stuck in the authentication state.
This server has not been modified other than the quarterly PostgreSQL update. Also, the server was rebooted just 2 days before the 2nd incident because of storage maintenance.
PostgreSQL Configuration CUSTOM settings:
shared_buffers = 8GB
work_mem = 10MB
maint_work_mem = 1GB
wal_level = hot standby
checkpoint_segments = 32
checkpoint_timeout = 10min
checkpoint_completion_target = 0.9
checkpoint_warning = 30s
archive_mode = on
max_wal_senders = 2
effective_cache_size = 393216MB
Please let me know if you there is anything I can do to elevate this issue.
Hi, On 2017-03-14 16:47:40 -0400, Andrzej Pilacik wrote: > Since then we had 2 incidents where PostgreSQL connections went to about > 300 and PostgreSQL became unresponsive and restarted itself. Could you give a bit more details about how that "retarted itself" happened? Any chance you saw something about a "stuck spinlock" in the logs? > We have monitors setup to page us when connections are over 250. > PostgreSQL should still have about 250 connections available. > > When this occurred, we saw a lot of connections in the authentication state > (stuck) > > *They all looked like: postgres: username databasename ipaddress (pid) > authentication* The symptoms suggest you're hitting quite massive lock contention. But that's just a guess, not yet enough information to nail that down. If that's the case, a update to 9.6 might resolve the issue, we've done two significant scalability improvements that could affect your case. > During the incident (about 10-15min timeline) > > - CPUs are running high but the box is still very responsible. 60-70% > > - Memory allocations are ok, no paging. Nothing looks out of norm. Have you checked that you're not running into issues with transparent hugepages? The symptoms don't quite match, but it's still useful to make sure. I don't know which kernel / distribution you're running, but a search for "disable transparent hugepages $distro" should help. > - We have our normal scenario engine running with about 30-40 active > connections + some 5-10 active reporting and processing connections at a > time (lots of quick/short queries) Is there a chance you could provide a profile using perf? See https://wiki.postgresql.org/wiki/Profiling_with_perf Greetings, Andres Freund -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On 2017-03-14 16:47:40 -0400, Andrzej Pilacik wrote:
> Since then we had 2 incidents where PostgreSQL connections went to about
> 300 and PostgreSQL became unresponsive and restarted itself.
Could you give a bit more details about how that "retarted itself"
happened? Any chance you saw something about a "stuck spinlock" in the
logs?
> We have monitors setup to page us when connections are over 250.
> PostgreSQL should still have about 250 connections available.
>
> When this occurred, we saw a lot of connections in the authentication state
> (stuck)
>
> *They all looked like: postgres: username databasename ipaddress (pid)
> authentication*
The symptoms suggest you're hitting quite massive lock contention. But
that's just a guess, not yet enough information to nail that down. If
that's the case, a update to 9.6 might resolve the issue, we've done two
significant scalability improvements that could affect your case.
> During the incident (about 10-15min timeline)
>
> - CPUs are running high but the box is still very responsible. 60-70%
>
> - Memory allocations are ok, no paging. Nothing looks out of norm.
Have you checked that you're not running into issues with transparent
hugepages? The symptoms don't quite match, but it's still useful to make
sure. I don't know which kernel / distribution you're running, but a
search for "disable transparent hugepages $distro" should help.
> - We have our normal scenario engine running with about 30-40 active
> connections + some 5-10 active reporting and processing connections at a
> time (lots of quick/short queries)
Is there a chance you could provide a profile using perf? See
https://wiki.postgresql.org/
Greetings,
Andres Freund
Hi,
On 2017-03-14 16:47:40 -0400, Andrzej Pilacik wrote:
> Since then we had 2 incidents where PostgreSQL connections went to about
> 300 and PostgreSQL became unresponsive and restarted itself.
Could you give a bit more details about how that "retarted itself"
happened? Any chance you saw something about a "stuck spinlock" in the
logs?
> We have monitors setup to page us when connections are over 250.
> PostgreSQL should still have about 250 connections available.
>
> When this occurred, we saw a lot of connections in the authentication state
> (stuck)
>
> *They all looked like: postgres: username databasename ipaddress (pid)
> authentication*
The symptoms suggest you're hitting quite massive lock contention. But
that's just a guess, not yet enough information to nail that down. If
that's the case, a update to 9.6 might resolve the issue, we've done two
significant scalability improvements that could affect your case.
> During the incident (about 10-15min timeline)
>
> - CPUs are running high but the box is still very responsible. 60-70%
>
> - Memory allocations are ok, no paging. Nothing looks out of norm.
Have you checked that you're not running into issues with transparent
hugepages? The symptoms don't quite match, but it's still useful to make
sure. I don't know which kernel / distribution you're running, but a
search for "disable transparent hugepages $distro" should help.
> - We have our normal scenario engine running with about 30-40 active
> connections + some 5-10 active reporting and processing connections at a
> time (lots of quick/short queries)
Is there a chance you could provide a profile using perf? See
https://wiki.postgresql.org/wiki/Profiling_with_perf
Greetings,
Andres Freund