Обсуждение: BUG #15124: standby hung, recovering process stuck waiting,possible undetected deadlock
BUG #15124: standby hung, recovering process stuck waiting,possible undetected deadlock
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 15124 Logged by: Clinton Adams Email address: clinton.adams@gmail.com PostgreSQL version: 9.6.6 Operating system: CentOS Linux release 7.4.1708 Description: One of our streaming standbys built up a wait queue behind the recovering process, which rendered db inaccessible with all non-idle user processes stuck in 'startup waiting'. Some master log snippets: 2018-03-20 17:06:10.987 EDT,,,12954,,5a5ec246.329a,3015,,2018-01-16 22:25:58 EST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,"" 2018-03-20 17:29:24.490 EDT,,,5197,,5ab17d34.144d,1,,2018-03-20 17:29:24 EDT,113/10510168,2787541257,LOG,00000,"automatic vacuum of table ""ateb.pg_catalog.pg_index"": index scans: 1 2018-03-20 17:30:19.876 EDT,,,12954,,5a5ec246.329a,3016,,2018-01-16 22:25:58 EST,,0,LOG,00000,"checkpoint complete: wrote 4045 buffers (0.4%); 0 transaction log file(s) added, 68 removed, 0 recycled; write=1448.033 s, sync=0.318 s, total=1448.888 s; sync files=1513, longest=0.003 s, average=0.000 s; distance=1800115 kB, estimate=23855600 kB",,,,,,,,,"" Standby has - max_standby_streaming_delay = 4h max_standby_archive_delay = 8h hot_standby_feedback = off wal_keep_segments = 2000 hot_standby = on All instances have synchronous_commit = off commit_delay is not being used Standby log showing first wait: 2018-03-20 17:29:34.588 EDT,"XXXXX","XXXXX",17933,"XX.XX.XX.XX:48164",5ab17d34.460d,1,"SELECT waiting",2018-03-20 17:29:24 EDT,3/34326918,0,LOG,00000,"process 17933 still waiting for AccessShareLock on relation 2610 of database 16403 after 10000.164 ms","Process holding the lock: 30493. Wait queue: 17933.",,,,," pid 30493 - postgres 30493 1.0 17.4 9019072 8594764 ? Ss Jan16 942:24 postgres: startup process recovering 0000000100009EFF00000035 waiting relation 2610 is pg_index Standby log showing wait queue: 2018-03-20 19:49:55.845 EDT,"XXXXX","XXXXX",9129,"XX.XX.XX.XX:57952",5ab18d0c.23a9,2,"startup waiting",2018-03-20 18:37:00 EDT,177/101,0,LOG,00000,"process 9129 still waiting for AccessShareLock on relation 2610 of database 16403 after 4375401.381 ms","Process holding the lock: 30493. Wait queue: 17933, 18459, 18515, 20148, 20196, 20446, 20460, 20474, 20475, 20476, 20477, 20478, 20479, 20890, 21076, 21104, 21234, 21493, 21505, 22736, 24478, 24744, 24757, 24894, 25140, 25152, 26389, 26396, 26397, 26398, 26400, 26399, 26401, 28159, 28433, 28455, 28580, 28850, 28862, 30119, 31929, 31931, 31943, 31956, 31968, 31969, 31970, 31971, 31972, 31974, 31988, 32213, 32260, 32272, 32411, 32617, 32678, 32691, 1590, 3466, 3480, 3492, 3777, 3789, 3929, 4192, 4204, 5515, 5524, 5523, 5525, 5526, 5527, 7377, 7390, 7403, 7687, 7700, 7824, 8102, 8115, 9368, 11277, 11293, 11309, 11666, 11705, 11718, 11843, 24149, 12132, 12144, 13473, 14743, 15002, 15393, 15405, 15418, 15711, 15723, 15860, 16134, 16146, 17468, 19406, 19424, 19440, 19732, 19745, 19883, 20157, 20169, 21513, 23420, 23434, 23447, 23753, 23794, 23806, 23944, 24218, 24230, 25568, 27492, 27506, 27519, 27825, 27851, 27973, 28260, 28272, 29625, 31564, 31578, 31591, 31877, 31890, 32025, 32305, 32317, 1227, 11021, 2268, 3091, 3121, 3135, 3422, 3461, 3475, 3607, 3883, 3895, 4719, 5058, 5461, 7073, 7086, 7099, 7384, 7399, 7523, 7803, 7815, 9129, 10997, 11015, 11029, 11316, 11330, 11458, 11737, 11749, 13057, 14927, 14940, 14959, 15290, 15302, 15437, 15706, 15719, 16275, 16653, 17029, 17044, 18521, 18869, 18887, 18900, 19165, 19178, 19303, 19569, 19582, 19950, 20874, 22702, 22714, 22727, 23003, 23015, 23146, 23408, 23421, 24686, 26545, 26599, 26612, 26938, 26951, 27075, 27355, 27367, 28248, 28703, 30479, 30495, 30511, 30791, 30792, 30929, 31175, 31187, 32482, 1852, 1873, 1886, 2151, 2164, 2289, 2555, 2567, 3727, 5612, 5626, 5640, 5961, 5973, 6110, 6356, 6368, 7647, 9407, 9425, 9438, 9704, 9716, 9840, 10099, 10111, 11388, 13169, 13183, 13195, 13460, 13476, 13612, 13857, 13869, 15161, 15413, 15444, 15758, 15774, 15816, 15817, 15819, 15820, 16785, 16975, 17002, 17015, 17336, 17349, 17474, 17746, 17758, 19060, 20888, 20903, 20916, 21193, 21208, 21331, 21602, 21614, 22905, 22980, 23255, 24320, 24958, 24977, 24989, 25294, 25306, 25437, 25729, 25741, 27109, 29150, 29169, 29186, 29499, 29539, 29551, 29676, 29930, 29973, 29985, 31364.",,,,,,,,"psql" I confirmed that there were no idle in transactions on the standby. I could not get access to db to query pg_locks view. Before restarting the standby as last resort, there were 280 user processes in 'startup waiting', 7 'idle', and 3 in 'SELECT waiting'. After restart of standby operations returned to normal, retrieving WAL segments from archive and then going back to streaming. Other streaming standbys did not show the problem. An update to 9.6.8 is planned, but I do not see anything applicable to this in the release notes. This is possibly related to "BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.", https://www.postgresql.org/message-id/20151001121745.368.58359%40wrigleys.postgresql.org