BUG #15124: standby hung, recovering process stuck waiting,possible undetected deadlock

Поиск
Список
Период
Сортировка
От PG Bug reporting form
Тема BUG #15124: standby hung, recovering process stuck waiting,possible undetected deadlock
Дата
Msg-id 152165774326.9520.987533011190563481@wrigleys.postgresql.org
обсуждение исходный текст
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15124
Logged by:          Clinton Adams
Email address:      clinton.adams@gmail.com
PostgreSQL version: 9.6.6
Operating system:   CentOS Linux release 7.4.1708
Description:

One of our streaming standbys built up a wait queue behind the recovering
process, which rendered db inaccessible with all non-idle user processes
stuck in 'startup waiting'.

Some master log snippets:
2018-03-20 17:06:10.987 EDT,,,12954,,5a5ec246.329a,3015,,2018-01-16 22:25:58
EST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2018-03-20 17:29:24.490 EDT,,,5197,,5ab17d34.144d,1,,2018-03-20 17:29:24
EDT,113/10510168,2787541257,LOG,00000,"automatic vacuum of table
""ateb.pg_catalog.pg_index"": index scans: 1
2018-03-20 17:30:19.876 EDT,,,12954,,5a5ec246.329a,3016,,2018-01-16 22:25:58
EST,,0,LOG,00000,"checkpoint complete: wrote 4045 buffers (0.4%); 0
transaction log file(s) added, 68 removed, 0 recycled; write=1448.033 s,
sync=0.318 s, total=1448.888 s; sync files=1513, longest=0.003 s,
average=0.000 s; distance=1800115 kB, estimate=23855600 kB",,,,,,,,,""

Standby has -
max_standby_streaming_delay = 4h
max_standby_archive_delay = 8h
hot_standby_feedback = off
wal_keep_segments = 2000
hot_standby = on

All instances have
synchronous_commit = off 
commit_delay is not being used

Standby log showing first wait:
2018-03-20 17:29:34.588
EDT,"XXXXX","XXXXX",17933,"XX.XX.XX.XX:48164",5ab17d34.460d,1,"SELECT
waiting",2018-03-20 17:29:24 EDT,3/34326918,0,LOG,00000,"process 17933 still
waiting for AccessShareLock on relation 2610 of database 16403 after
10000.164 ms","Process holding the lock: 30493. Wait queue: 17933.",,,,,"

pid 30493 - 
postgres 30493  1.0 17.4 9019072 8594764 ?     Ss   Jan16 942:24 postgres:
startup process   recovering 0000000100009EFF00000035 waiting

relation 2610 is pg_index

Standby log showing wait queue:
2018-03-20 19:49:55.845
EDT,"XXXXX","XXXXX",9129,"XX.XX.XX.XX:57952",5ab18d0c.23a9,2,"startup
waiting",2018-03-20 18:37:00 EDT,177/101,0,LOG,00000,"process 9129 still
waiting for AccessShareLock on relation 2610 of database 16403 after
4375401.381 ms","Process holding the lock: 30493. Wait queue: 17933, 18459,
18515, 20148, 20196, 20446, 20460, 20474, 20475, 20476, 20477, 20478, 20479,
20890, 21076, 21104, 21234, 21493, 21505, 22736, 24478, 24744, 24757, 24894,
25140, 25152, 26389, 26396, 26397, 26398, 26400, 26399, 26401, 28159, 28433,
28455, 28580, 28850, 28862, 30119, 31929, 31931, 31943, 31956, 31968, 31969,
31970, 31971, 31972, 31974, 31988, 32213, 32260, 32272, 32411, 32617, 32678,
32691, 1590, 3466, 3480, 3492, 3777, 3789, 3929, 4192, 4204, 5515, 5524,
5523, 5525, 5526, 5527, 7377, 7390, 7403, 7687, 7700, 7824, 8102, 8115,
9368, 11277, 11293, 11309, 11666, 11705, 11718, 11843, 24149, 12132, 12144,
13473, 14743, 15002, 15393, 15405, 15418, 15711, 15723, 15860, 16134, 16146,
17468, 19406, 19424, 19440, 19732, 19745, 19883, 20157, 20169, 21513, 23420,
23434, 23447, 23753, 23794, 23806, 23944, 24218, 24230, 25568, 27492, 27506,
27519, 27825, 27851, 27973, 28260, 28272, 29625, 31564, 31578, 31591, 31877,
31890, 32025, 32305, 32317, 1227, 11021, 2268, 3091, 3121, 3135, 3422, 3461,
3475, 3607, 3883, 3895, 4719, 5058, 5461, 7073, 7086, 7099, 7384, 7399,
7523, 7803, 7815, 9129, 10997, 11015, 11029, 11316, 11330, 11458, 11737,
11749, 13057, 14927, 14940, 14959, 15290, 15302, 15437, 15706, 15719, 16275,
16653, 17029, 17044, 18521, 18869, 18887, 18900, 19165, 19178, 19303, 19569,
19582, 19950, 20874, 22702, 22714, 22727, 23003, 23015, 23146, 23408, 23421,
24686, 26545, 26599, 26612, 26938, 26951, 27075, 27355, 27367, 28248, 28703,
30479, 30495, 30511, 30791, 30792, 30929, 31175, 31187, 32482, 1852, 1873,
1886, 2151, 2164, 2289, 2555, 2567, 3727, 5612, 5626, 5640, 5961, 5973,
6110, 6356, 6368, 7647, 9407, 9425, 9438, 9704, 9716, 9840, 10099, 10111,
11388, 13169, 13183, 13195, 13460, 13476, 13612, 13857, 13869, 15161, 15413,
15444, 15758, 15774, 15816, 15817, 15819, 15820, 16785, 16975, 17002, 17015,
17336, 17349, 17474, 17746, 17758, 19060, 20888, 20903, 20916, 21193, 21208,
21331, 21602, 21614, 22905, 22980, 23255, 24320, 24958, 24977, 24989, 25294,
25306, 25437, 25729, 25741, 27109, 29150, 29169, 29186, 29499, 29539, 29551,
29676, 29930, 29973, 29985, 31364.",,,,,,,,"psql"

I confirmed that there were no idle in transactions on the standby. I could
not get access to db to query pg_locks view. Before restarting the standby
as last resort, there were 280 user processes in 'startup waiting', 7
'idle', and 3 in 'SELECT waiting'.

After restart of standby operations returned to normal, retrieving WAL
segments from archive and then going back to streaming. Other streaming
standbys did not show the problem.

An update to 9.6.8 is planned, but I do not see anything applicable to this
in the release notes. This is possibly related to "BUG #13657: Some kind of
undetected deadlock between query and "startup process" on replica.",
https://www.postgresql.org/message-id/20151001121745.368.58359%40wrigleys.postgresql.org





В списке pgsql-bugs по дате отправления:

Предыдущее
От: Teodor Sigaev
Дата:
Сообщение: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Следующее
От: Keith Fiske
Дата:
Сообщение: Re: BUG #15123: pgAdmin 4 no error message