Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica

Поиск
Список
Период
Сортировка
От Rui An
Тема Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
Дата
Msg-id CABjTVMR39ci6LUVE2yWhpcufPee4nO3cu32sij3Yh8viksCztQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica  (Andrey Borodin <x4mmm@yandex-team.ru>)
Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica  ("David G. Johnston" <david.g.johnston@gmail.com>)
Список pgsql-bugs
Hi, I have recently encountered the problem, as title, on some of my PostgreSQL 9.6.1 hot standby replicas when serving readonly queries.

The problem makes a lot of PostgreSQL processes stuck on waiting for MultiXactOffsetControlLock and eventually runs out of the max_connections quota.

For example:
mydb=# select count(*) from pg_stat_activity where state = 'active' and wait_event = 'MultiXactOffsetControlLock';
 count
------
    956
(1 row)

I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for MultiXactOffsetControlLock even if the pg_terminate_backend(pid) returns true.

Currently, the only thing I can do to resolve the problem is to restart my replicas.

Can someone help me find out how could the problem occurred and is there any other way to resolve the problem?

I have also recorded some strace and perf top output on stuck processes:

strace:
strace: Process 555 attached
select(0, NULL, NULL, NULL, {0, 78}       = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(61734963, [{14, -1, 0}, 1])             = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(61505580, [{12, -1, 0}, 1])             = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(63471720, [{9, -1, 0}, 1])               = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(63471720, [{9, -1, 0}, 1])               = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0

perf top:
Samples: 208  of event 'cpu-clock', Event count (approx.): 36045672
Overhead  Shared Object   Symbol
    30.74%  postgres            [.] perform_spin_delay
    13.70%  [kernel]              [k] finish_task_switch
    13.66%  [kernel]              [k] 0x00007fff81838f35
      8.59%  postgres            [.] 0x0000000000319107
      4.02%  postgres            [.] 0x0000000000318f04
      2.47%  postgres            [.] pg_usleep
      2.10%  libc-2.23.so       [.] __select
      1.68%  postgres            [.] PGSemaphoreLock
      1.68%  postgres            [.] 0x0000000000318eb2
      1.54%  [kernel]              [k] pvclock_clocksource_read
      1.53%  postgres            [.] GetMultiXactIdMembers
      1.41%  postgres            [.] 0x0000000000318ea4
      1.31%  postgres            [.] LWLockAcquire
      1.30%  postgres            [.] 0x0000000000318e17
      1.23%  [kernel]              [k] do_select
      1.07%  postgres            [.] finish_spin_delay
      1.06%  [kernel]              [k] poll_freewait
      1.05%  libc-2.23.so       [.] semop
      0.96%  [kernel]              [k] poll_schedule_timeout
      0.72%  [kernel]              [k] SYSC_semtimedop
      0.69%  [kernel]              [k] core_sys_select
      0.69%  [kernel]              [k] pid_vnr
      0.61%  [kernel]              [k] dev_gro_receive
      0.61%  libc-2.23.so       [.] random
      0.61%  postgres            [.] 0x0000000000318e3d
      0.53%  [kernel]              [k] hrtimer_active
      0.53%  [kernel]              [k] ipc_obtain_object_check
      0.46%  postgres            [.] 0x0000000000318f00
      0.46%  postgres            [.] 0x0000000000319100
      0.46%  postgres            [.] 0x0000000000319103
      0.41%  postgres            [.] LWLockRelease
      0.41%  postgres            [.] 0x0000000000318f85
      0.36%  [kernel]              [k] run_timer_softirq
      0.36%  [kernel]              [k] sys_semop

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data
Следующее
От: Andrey Borodin
Дата:
Сообщение: Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica