Re: max_connections reached in postgres 9.3.3

Поиск
Список
Период
Сортировка
От Vasudevan, Ramya
Тема Re: max_connections reached in postgres 9.3.3
Дата
Msg-id 20EE50F73664E744AF948F0106FE6DFA2AD1101B@SEAMBX01.sea.corp.int.untd.com
обсуждение исходный текст
Ответ на Re: max_connections reached in postgres 9.3.3  (Merlin Moncure <mmoncure@gmail.com>)
Ответы Re: max_connections reached in postgres 9.3.3
Список pgsql-general
Merlin, Thank you for the response.

On the waiting queries - When  we reached 1500 connections, we had 759 connections  that were in active state (116
COMMIT,238 INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active INSERTS and UPDATES  also includes the 80-90
waitingsessions (We checked pg_stat_activity for 'waiting' state.  And pg_locks for granted=f). The blocking and the
waitingqueries were simple one row updates/inserts/deletes. These shouldn’t be blocking each other normally (unless, we
think,there was a problem writing to the disk). Correct me if I am wrong.
 

1) log_min_duration_statement  >>
 We have it set to 200ms and capturing all slow queries.

2) detailed system performance diagnostics during load event.   We need to know precise values for user, system, iowait
>>
 First occurance of 'max_connections reached' problem:  It seems to have started about 10:40. 

At 10:42, we ran out of connections:
Jun 17 10:42:21 FATAL:  remaining connection slots are reserved for non-replication superuser connections

AT 10:48:
 We dropped the disk cache (used
http://www.evanhoffman.com/evan/2012/09/25/problems-with-postgresql-9-0-on-centos-6-3-with-a-2-6-32-series-kernel/as
reference).However, that didn’t help.
 

vmstat information (with user cpu, system cpu, i/dle % and i/o wait):
-----------------------------------------------------------------------------------
At 10:40:19, user cpu started climbing, and cs (context switches) and in (interrupts) climbed up as well. No change
reallyin swap, bi (bytes in), bo (bytes out)
 
procs  -----------memory----------  ---swap--  -----io----  --procs  -----------memory----------  ---swap--
-----io---- --system--  -----cpu------  ---timestamp---
 
r      b                            swpd       free         buff        cache           si               so  bi    bo
 in     cs      us  sy  id  wa  st
 
7      0                            0          939888       145808      774588544       0                0   1524  772
 21177  32761   26  6   67  1   0   2014-06-17  10:40:06  PDT
 
14     0                            0          951436       145816      774589952       0                0   1128  1396
 22584  42339   34  6   59  1   0   2014-06-17  10:40:07  PDT
 
12     1                            0          916132       145816      774591360       0                0   2036  1956
 23132  35945   27  6   65  2   0   2014-06-17  10:40:08  PDT
 
7      0                            0          925444       145816      774593920       0                0   1604  3156
 23722  78796   35  8   55  1   0   2014-06-17  10:40:09  PDT
 
15     2                            0          946200       145816      774595840       0                0   1420  460
 24051  56856   35  8   57  1   0   2014-06-17  10:40:10  PDT
 
6      2                            0          979356       145816      774597376       0                0   1708  5016
 23379  33239   31  6   61  1   0   2014-06-17  10:40:11  PDT
 
6      0                            0          1005784      145816      774615616       0                0   1980  6792
 23584  36025   30  6   62  2   0   2014-06-17  10:40:12  PDT
 
9      1                            0          1004352      145816      774618880       0                0   1760  9784
 24826  63949   35  7   57  2   0   2014-06-17  10:40:13  PDT
 
18     0                            0          980156       145816      774621568       0                0   1260  9336
 23897  69468   35  8   56  1   0   2014-06-17  10:40:14  PDT
 
4      1                            0          960672       145816      774639168       0                0   1516  6420
 23235  38023   33  7   58  1   0   2014-06-17  10:40:15  PDT
 
19     0                            0          944560       145816      774642048       0                0   1880  6940
 23337  39658   32  6   60  2   0   2014-06-17  10:40:16  PDT
 
16     1                            0          897036       145816      774661440       0                0   2072  6024
 26300  52597   39  8   53  1   0   2014-06-17  10:40:17  PDT
 
16     5                            0          880428       145816      774664192       0                0   1460
12652 22617  28171   32  5   62  1   0   2014-06-17  10:40:18  PDT
 
11     0                            0          885412       145816      774666112       0                0   1840  7884
 25048  45393   39  7   53  1   0   2014-06-17  10:40:19  PDT 
 
* interrupts start climbing about here (in and cs columns)
24     1                            0          900336       145816      774685120       0                0   2248  4804
 26126  42707   42  7   50  1   0   2014-06-17  10:40:20  PDT
 
14     2                            0          895120       145816      774687744       0                0   1856
22660 27721  57019   46  8   45  1   0   2014-06-17  10:40:21  PDT
 
27     0                            0          881216       145816      774690496       0                0   1756  5872
 27643  44114   43  7   48  2   0   2014-06-17  10:40:22  PDT
 
19     0                            0          865320       145816      774692864       0                0   1704
25752 34434  132248  59  13  27  1   0   2014-06-17  10:40:23  PDT
 
8      0                            0          868056       145816      774711680       0                0   1308
10960 30492  81550   52  10  37  2   0   2014-06-17  10:40:24  PDT
 
33     0                            0          853888       145816      774713152       0                0   616   2960
 31951  61918   68  10  21  0   0   2014-06-17  10:40:25  PDT
 
44     0                            0          784172       145816      774714176       0                0   700   2292
 40585  77163   77  15  8   0   0   2014-06-17  10:40:26  PDT
 
procs  -----------memory----------  ---swap--  -----io----  --system--  -----cpu------  ---timestamp---
r      b                            swpd       free         buff        cache           si               so  bi    bo
 in     cs      us  sy  id  wa  st
 
32     0                            0          613712       145816      774715008       0                0   768   1172
 44675  108155  74  17  8   0   0   2014-06-17  10:40:27  PDT
 
68     0                            0          564336       145764      774600320       0                0   960   4892
 46838  145268  68  21  11  0   0   2014-06-17  10:40:28  PDT
 
49     1                            0          565880       145732      774426560       0                0   1340  4112
 47306  165764  68  21  11  0   0   2014-06-17  10:40:29  PDT
 
3      0                            0          552808       145716      774312896       0                0   2072  1336
 46749  160141  66  23  11  1   0   2014-06-17  10:40:30  PDT
 
26     2                            0          549716       145704      774139328       0                0   1260  1212
 47234  164933  67  22  10  0   0   2014-06-17  10:40:31  PDT
 
13     1                            0          579216       145676      773907072       0                0   1116  1328
 46831  173111  64  24  12  0   0   2014-06-17  10:40:32  PDT
 
44     0                            0          563372       145640      773703680       0                0   1384  3120
 49568  166285  68  23  9   0   0   2014-06-17  10:40:33  PDT
 
55     0                            0          545040       145624      773587776       0                0   1056  3864
 50042  167987  68  23  9   0   0   2014-06-17  10:40:34  PDT
 
86     2                            0          572432       145572      773430144       0                0   944   1084
 58868  140190  77  18  4   0   0   2014-06-17  10:40:35  PDT
 
63     0                            0          555424       145548      773372864       0                0   312   616
 61136  126323  82  16  3   0   0   2014-06-17  10:40:36  PDT
 
71     0                            0          584556       145500      773255936       0                0   888   852
 59944  128606  81  16  3   0   0   2014-06-17  10:40:37  PDT
 
45     1                            0          575360       145500      773140224       0                0   796   1844
 60107  128723  81  16  3   0   0   2014-06-17  10:40:38  PDT
 
37     0                            0          560712       145472      772965120       0                0   1440  4796
 57150  140963  76  19  5   0   0   2014-06-17  10:40:39  PDT
 
112    0                            0          541944       145424      772732480       0                0   2340  1272
 53304  154266  70  22  7   0   0   2014-06-17  10:40:40  PDT
 
55     1                            0          548428       145368      772440000       0                0   852   816
 51299  166797  71  22  7   0   0   2014-06-17  10:40:41  PDT
 
21     0                            0          584668       145288      772177216       0                0   1232  992
 49328  157423  70  23  7   0   0   2014-06-17  10:40:42  PDT
 
61     0                            0          552688       145236      771940608       0                0   2224  5216
 49301  165526  66  25  9   0   0   2014-06-17  10:40:43  PDT
 
29     1                            0          564316       145216      771679104       0                0   1340  4832
 48844  164812  68  24  8   0   0   2014-06-17  10:40:44  PDT
 
34     0                            0          566816       145192      771444288       0                0   1156  776
 49673  155265  71  22  7   0   0   2014-06-17  10:40:45  PDT
 
34     0                            0          569152       145176      771268608       0                0   1232
33408 49762  138163  74  20  6   0   0   2014-06-17  10:40:46  PDT
 
32     1                            0          582320       145144      771034752       0                0   712   596
 47012  156240  69  22  9   0   0   2014-06-17  10:40:47  PDT
 


iostat information, consolidated from output of iostat -x to one minute intervals:
--------------------------------------------------------------------------------------------------
 RPS: Reads in blocks per second
 WPS: writes in blocks per second
 BLK_R: Blocks read
 BLK_W: Blocks written

This is our file system mount point information:
 site and site2 - mounts holding the data files.
 data - pg_xlog
 archive - archive logs

In general, We do not see anything that stands out in terms of I/O leading up to the problem. 

i/o information on all mount points:
site - RPS went up after the cache was cleared, expected. But, no pattern that I can see leading up to the issue:
20140617 10:36:       site: RPS:  1433.91 WPS:   599.67 BLK_R:  1434.27 BLK_W:     1.08
20140617 10:37:       site: RPS:  1364.76 WPS:   583.80 BLK_R:  1364.87 BLK_W:     0.88
20140617 10:38:       site: RPS:  1392.00 WPS:  1412.67 BLK_R:  1392.07 BLK_W:     0.75
20140617 10:39:       site: RPS:  1481.12 WPS:   664.47 BLK_R:  1481.20 BLK_W:     0.88
20140617 10:40:       site: RPS:  1242.64 WPS:  2034.67 BLK_R:  1242.73 BLK_W:     1.08 (problem)
20140617 10:41:       site: RPS:   758.62 WPS:   620.86 BLK_R:   758.73 BLK_W:     1.10
20140617 10:42:       site: RPS:  1482.59 WPS:  1304.47 BLK_R:  1482.80 BLK_W:     0.80
20140617 10:43:       site: RPS:  1699.67 WPS:  3165.47 BLK_R:  1699.67 BLK_W:     1.13
20140617 10:44:       site: RPS:  1072.45 WPS:  1193.67 BLK_R:  1072.67 BLK_W:     1.05
20140617 10:45:       site: RPS:   941.51 WPS:   594.07 BLK_R:   941.67 BLK_W:     0.82
20140617 10:46:       site: RPS:   622.42 WPS:   358.07 BLK_R:   622.73 BLK_W:     0.82
20140617 10:47:       site: RPS:   781.18 WPS:  2038.07 BLK_R:   781.13 BLK_W:     0.98
20140617 10:48:       site: RPS:  1582.35 WPS:   730.27 BLK_R:  1582.40 BLK_W:     0.93 (cache cleared)
20140617 10:49:       site: RPS:  7954.96 WPS:  1713.60 BLK_R:  7955.87 BLK_W:     1.10
20140617 10:50:       site: RPS: 10837.16 WPS:  1065.13 BLK_R: 10839.73 BLK_W:     1.05
20140617 10:51:       site: RPS: 27859.84 WPS:   846.22 BLK_R: 27863.53 BLK_W:     1.02
20140617 10:52:       site: RPS: 23309.31 WPS:  1650.86 BLK_R: 23312.67 BLK_W:     1.37
20140617 10:53:       site: RPS: 28450.93 WPS:   220.83 BLK_R: 28454.00 BLK_W:     0.78
20140617 10:54:       site: RPS: 14312.11 WPS:   272.51 BLK_R: 14313.60 BLK_W:     1.18
20140617 10:55:       site: RPS: 29515.76 WPS:   157.33 BLK_R: 29516.00 BLK_W:     0.58
20140617 10:56:       site: RPS: 22423.51 WPS:   683.07 BLK_R: 22426.80 BLK_W:     0.73

site2 - WPS went down after the problem, and even further down after cache was cleared. 
Definitely an effect not a cause. 
DB was hung and so, fewer writes. 
Again, no pattern that I can see leading up to the issue:

20140617 10:36:      site2: RPS:  1199.19 WPS:  1366.67 BLK_R:  1199.47 BLK_W:     1.17
20140617 10:37:      site2: RPS:  1241.60 WPS:  1450.93 BLK_R:  1241.60 BLK_W:     1.27
20140617 10:38:      site2: RPS:  1277.89 WPS:  1306.53 BLK_R:  1278.13 BLK_W:     1.20
20140617 10:39:      site2: RPS:  1254.00 WPS:  1174.40 BLK_R:  1254.00 BLK_W:     1.07
20140617 10:40:      site2: RPS:  1106.41 WPS:  1206.39 BLK_R:  1106.53 BLK_W:     0.87 (problem)
20140617 10:41:      site2: RPS:   360.63 WPS:   898.53 BLK_R:   360.67 BLK_W:     0.87
20140617 10:42:      site2: RPS:  1290.44 WPS:   418.27 BLK_R:  1290.67 BLK_W:     0.93
20140617 10:43:      site2: RPS:  1804.53 WPS:  1614.93 BLK_R:  1804.53 BLK_W:     1.27
20140617 10:44:      site2: RPS:   815.21 WPS:  1029.73 BLK_R:   815.47 BLK_W:     0.73
20140617 10:45:      site2: RPS:   731.70 WPS:   640.67 BLK_R:   732.00 BLK_W:     1.33
20140617 10:46:      site2: RPS:   415.36 WPS:   362.80 BLK_R:   415.47 BLK_W:     1.13
20140617 10:47:      site2: RPS:   461.93 WPS:   268.13 BLK_R:   462.27 BLK_W:     1.13
20140617 10:48:      site2: RPS:   559.47 WPS:   175.07 BLK_R:   559.47 BLK_W:     1.07
20140617 10:49:      site2: RPS:   578.12 WPS:   306.53 BLK_R:   578.27 BLK_W:     1.20
20140617 10:50:      site2: RPS:   419.14 WPS:   155.60 BLK_R:   419.20 BLK_W:     1.10
20140617 10:51:      site2: RPS:   407.66 WPS:   152.80 BLK_R:   407.73 BLK_W:     1.13
20140617 10:52:      site2: RPS:   316.07 WPS:   204.62 BLK_R:   316.00 BLK_W:     1.07
20140617 10:53:      site2: RPS:  4957.84 WPS:   122.80 BLK_R:  4958.00 BLK_W:     1.13
20140617 10:54:      site2: RPS:  5731.85 WPS:   236.40 BLK_R:  5732.40 BLK_W:     0.90
20140617 10:55:      site2: RPS: 69626.67 WPS:     0.67 BLK_R: 69626.67 BLK_W:     0.17 (cache cleared)
20140617 10:56:      site2: RPS: 77967.18 WPS:     2.93 BLK_R: 77980.80 BLK_W:     0.27
20140617 10:57:      site2: RPS: 92866.05 WPS:    12.79 BLK_R: 92880.93 BLK_W:     3.13
20140617 10:58:      site2: RPS: 84538.00 WPS:    22.53 BLK_R: 84538.00 BLK_W:     4.20
20140617 10:59:      site2: RPS: 37730.09 WPS:    72.26 BLK_R: 37737.33 BLK_W:     2.93
20140617 11:00:      site2: RPS: 82512.67 WPS:    13.60 BLK_R: 82512.67 BLK_W:     2.27
20140617 11:01:      site2: RPS: 78838.10 WPS:    23.33 BLK_R: 78850.67 BLK_W:     4.17

data - A minute after the issue, spurt in reads (could be a one-off). 
Writes drop (effect, not cause). 
After cache is cleared, reads from disk increase, writes decrease (expected)

20140617 10:11:       data: RPS:     0.00 WPS:  1342.90 BLK_R:     0.00 BLK_W:     3.73
20140617 10:12:       data: RPS:     0.00 WPS:  1187.73 BLK_R:     0.00 BLK_W:     4.57
20140617 10:13:       data: RPS:     0.00 WPS:  1318.91 BLK_R:     0.00 BLK_W:     4.23
20140617 10:14:       data: RPS:     0.00 WPS:  1350.00 BLK_R:     0.00 BLK_W:     4.17
20140617 10:15:       data: RPS:     0.00 WPS:  1539.12 BLK_R:     0.00 BLK_W:     3.83
20140617 10:16:       data: RPS:     0.00 WPS:  1255.47 BLK_R:     0.00 BLK_W:     4.47
20140617 10:17:       data: RPS:     0.00 WPS:  1734.58 BLK_R:     0.00 BLK_W:     4.20
20140617 10:18:       data: RPS:     0.00 WPS:  2082.13 BLK_R:     0.00 BLK_W:     4.13
20140617 10:19:       data: RPS:     0.00 WPS:  1560.90 BLK_R:     0.00 BLK_W:     4.70
20140617 10:20:       data: RPS:     0.00 WPS:  1413.73 BLK_R:     0.00 BLK_W:     3.57
20140617 10:21:       data: RPS:     0.00 WPS:  1677.33 BLK_R:     0.00 BLK_W:     3.90
20140617 10:22:       data: RPS:     0.27 WPS:  1454.13 BLK_R:     0.27 BLK_W:     3.97
20140617 10:23:       data: RPS:     0.00 WPS:  1439.16 BLK_R:     0.00 BLK_W:     4.00
20140617 10:24:       data: RPS:     0.00 WPS:  1323.47 BLK_R:     0.00 BLK_W:     4.13
20140617 10:25:       data: RPS:     0.00 WPS:  1650.56 BLK_R:     0.00 BLK_W:     3.27
20140617 10:26:       data: RPS:     0.00 WPS:  1361.20 BLK_R:     0.00 BLK_W:     3.87
20140617 10:27:       data: RPS:     0.00 WPS:  1360.07 BLK_R:     0.00 BLK_W:     3.93
20140617 10:28:       data: RPS:     0.00 WPS:  1260.13 BLK_R:     0.00 BLK_W:     3.97
20140617 10:29:       data: RPS:     0.00 WPS:  1513.86 BLK_R:     0.00 BLK_W:     3.50
20140617 10:30:       data: RPS:     0.00 WPS:  1499.73 BLK_R:     0.00 BLK_W:     3.90
20140617 10:31:       data: RPS:     0.00 WPS:  1394.80 BLK_R:     0.00 BLK_W:     3.47
20140617 10:32:       data: RPS:     0.00 WPS:  1690.27 BLK_R:     0.00 BLK_W:     3.90
20140617 10:33:       data: RPS:     0.00 WPS:  1543.55 BLK_R:     0.00 BLK_W:     3.57
20140617 10:34:       data: RPS:     0.00 WPS:  1457.47 BLK_R:     0.00 BLK_W:     3.80
20140617 10:35:       data: RPS:     0.00 WPS:  1414.00 BLK_R:     0.00 BLK_W:     4.50
20140617 10:36:       data: RPS:     0.00 WPS:  1557.14 BLK_R:     0.00 BLK_W:     4.23
20140617 10:37:       data: RPS:     0.00 WPS:  1675.60 BLK_R:     0.00 BLK_W:     3.77
20140617 10:38:       data: RPS:     0.00 WPS:  1473.56 BLK_R:     0.00 BLK_W:     4.57
20140617 10:39:       data: RPS:     0.00 WPS:  1424.00 BLK_R:     0.00 BLK_W:     3.67
20140617 10:40:       data: RPS:     0.00 WPS:  4113.20 BLK_R:     0.00 BLK_W:     4.13 (problem)

* (anomaly, RPS goes to 2.40 from 0.0. One-off?) *
20140617 10:41:       data: RPS:     2.40 WPS:   425.04 BLK_R:     2.40 BLK_W:     3.90 
20140617 10:42:       data: RPS:     0.00 WPS:  1402.49 BLK_R:     0.00 BLK_W:     4.50
20140617 10:43:       data: RPS:     0.00 WPS:  1619.47 BLK_R:     0.00 BLK_W:     3.47
20140617 10:44:       data: RPS:     0.00 WPS:   715.64 BLK_R:     0.00 BLK_W:     3.53
20140617 10:45:       data: RPS:     0.27 WPS:   547.02 BLK_R:     0.27 BLK_W:     3.53
20140617 10:46:       data: RPS:     0.00 WPS:   163.31 BLK_R:     0.00 BLK_W:     4.00
20140617 10:47:       data: RPS:     0.13 WPS:   815.87 BLK_R:     0.13 BLK_W:     4.30
20140617 10:48:       data: RPS:    74.00 WPS:   663.07 BLK_R:    74.00 BLK_W:     3.90 (cache cleared)
20140617 10:49:       data: RPS:   448.53 WPS:   628.63 BLK_R:   448.53 BLK_W:     4.33
20140617 10:50:       data: RPS:    39.20 WPS:   208.50 BLK_R:    39.20 BLK_W:     4.03
20140617 10:51:       data: RPS:    35.73 WPS:   228.12 BLK_R:    35.73 BLK_W:     4.13
20140617 10:52:       data: RPS:    15.20 WPS:   148.78 BLK_R:    15.20 BLK_W:     3.63
20140617 10:53:       data: RPS:    24.27 WPS:   162.37 BLK_R:    24.27 BLK_W:     3.40
20140617 10:54:       data: RPS:   136.26 WPS:   139.74 BLK_R:   136.27 BLK_W:     1.90
20140617 10:55:       data: RPS:    93.33 WPS:  2253.87 BLK_R:    93.33 BLK_W:     4.37
20140617 10:56:       data: RPS:    20.27 WPS:   837.60 BLK_R:    20.27 BLK_W:     4.57
20140617 10:57:       data: RPS:    11.46 WPS:   391.42 BLK_R:    11.47 BLK_W:     4.30
20140617 10:58:       data: RPS:    21.07 WPS:   348.27 BLK_R:    21.07 BLK_W:     3.77
20140617 10:59:       data: RPS:    65.83 WPS:   388.87 BLK_R:    65.87 BLK_W:     3.43
20140617 11:00:       data: RPS:    45.87 WPS:  1159.20 BLK_R:    45.87 BLK_W:     3.37

archive -After the problem started, WPS dropped very low for some time. 
Is that expected? Is it because wal logs were written at a lower pace, and so archiving had less to do?

20140617 10:34:    archive: RPS:     0.00 WPS:  1648.27 BLK_R:     0.00 BLK_W:     0.93
20140617 10:35:    archive: RPS:     0.00 WPS:   553.07 BLK_R:     0.00 BLK_W:     0.57
20140617 10:36:    archive: RPS:     0.00 WPS:  1648.10 BLK_R:     0.00 BLK_W:     1.30
20140617 10:37:    archive: RPS:     0.00 WPS:  1102.00 BLK_R:     0.00 BLK_W:     1.17
20140617 10:38:    archive: RPS:     0.00 WPS:  1098.39 BLK_R:     0.00 BLK_W:     0.73
20140617 10:39:    archive: RPS:     0.00 WPS:  1102.27 BLK_R:     0.00 BLK_W:     0.60
20140617 10:40:    archive: RPS:     0.00 WPS:  3285.07 BLK_R:     0.00 BLK_W:     0.73 (problem)
20140617 10:41:    archive: RPS:     0.00 WPS:  1102.27 BLK_R:     0.00 BLK_W:     1.10
20140617 10:42:    archive: RPS:     0.00 WPS:   551.73 BLK_R:     0.00 BLK_W:     0.73
20140617 10:43:    archive: RPS:     0.00 WPS:  1101.07 BLK_R:     0.00 BLK_W:     1.07
20140617 10:44:    archive: RPS:     0.00 WPS:  1103.47 BLK_R:     0.00 BLK_W:     1.13
20140617 10:45:    archive: RPS:     0.00 WPS:   553.20 BLK_R:     0.00 BLK_W:     0.70
20140617 10:46:    archive: RPS:     0.00 WPS:     7.07 BLK_R:     0.00 BLK_W:     0.73
20140617 10:47:    archive: RPS:     0.00 WPS:   553.84 BLK_R:     0.00 BLK_W:     0.87
20140617 10:48:    archive: RPS:     0.00 WPS:   554.53 BLK_R:     0.00 BLK_W:     0.87 (cache cleared)
20140617 10:49:    archive: RPS:     0.00 WPS:   551.31 BLK_R:     0.00 BLK_W:     0.50
20140617 10:50:    archive: RPS:     0.00 WPS:     8.53 BLK_R:     0.00 BLK_W:     0.87
20140617 10:51:    archive: RPS:     0.00 WPS:   553.73 BLK_R:     0.00 BLK_W:     0.73
20140617 10:52:    archive: RPS:     0.00 WPS:     3.87 BLK_R:     0.00 BLK_W:     0.87
20140617 10:53:    archive: RPS:     0.00 WPS:     6.67 BLK_R:     0.00 BLK_W:     0.83
20140617 10:54:    archive: RPS:     0.00 WPS:   554.27 BLK_R:     0.00 BLK_W:     0.77
20140617 10:55:    archive: RPS:   848.80 WPS:   555.87 BLK_R:   848.80 BLK_W:     1.03
20140617 10:56:    archive: RPS:     0.27 WPS:  1099.20 BLK_R:     0.27 BLK_W:     0.87
20140617 10:57:    archive: RPS:     0.00 WPS:     1.33 BLK_R:     0.00 BLK_W:     0.17
20140617 10:58:    archive: RPS:     0.00 WPS:     1.33 BLK_R:     0.00 BLK_W:     0.17
20140617 10:59:    archive: RPS:     0.00 WPS:  1110.13 BLK_R:     0.00 BLK_W:     0.13
20140617 11:00:    archive: RPS:     0.00 WPS:   553.87 BLK_R:     0.00 BLK_W:     0.37
20140617 11:01:    archive: RPS:     0.00 WPS:  1102.13 BLK_R:     0.00 BLK_W:     0.80
20140617 11:02:    archive: RPS:     0.00 WPS:  1647.87 BLK_R:     0.00 BLK_W:     1.03
20140617 11:03:    archive: RPS:     0.00 WPS:  2191.60 BLK_R:     0.00 BLK_W:     0.60
20140617 11:04:    archive: RPS:     0.00 WPS:  2194.27 BLK_R:     0.00 BLK_W:     0.93
20140617 11:05:    archive: RPS:     0.00 WPS:  2193.33 BLK_R:     0.00 BLK_W:     1.33
20140617 11:06:    archive: RPS:     0.00 WPS:     7.84 BLK_R:     0.00 BLK_W:     0.70
20140617 11:07:    archive: RPS:  3229.33 WPS:     4.27 BLK_R:  3229.33 BLK_W:     0.43
20140617 11:08:    archive: RPS:  3281.07 WPS:     6.40 BLK_R:  3281.07 BLK_W:     0.73
20140617 11:09:    archive: RPS: 19427.33 WPS:   554.27 BLK_R: 19427.33 BLK_W:     0.77
20140617 11:10:    archive: RPS:     0.00 WPS:  2191.47 BLK_R:     0.00 BLK_W:     0.63
20140617 11:11:    archive: RPS:     0.00 WPS:  1102.27 BLK_R:     0.00 BLK_W:     0.60
20140617 11:12:    archive: RPS:     0.00 WPS:  1102.00 BLK_R:     0.00 BLK_W:     0.67
20140617 11:13:    archive: RPS:     0.00 WPS:  1101.60 BLK_R:     0.00 BLK_W:     0.77
20140617 11:14:    archive: RPS:     0.00 WPS:  3829.47 BLK_R:     0.00 BLK_W:     0.80
20140617 11:15:    archive: RPS:     0.00 WPS:  1104.13 BLK_R:     0.00 BLK_W:     0.97
20140617 11:16:    archive: RPS:     0.00 WPS:  1101.87 BLK_R:     0.00 BLK_W:     0.87
20140617 11:17:    archive: RPS:  1339.07 WPS:  1100.40 BLK_R:  1339.07 BLK_W:     0.90
20140617 11:18:    archive: RPS:     0.00 WPS:  1100.67 BLK_R:     0.00 BLK_W:     0.83


Load average information leading up to the problem:
-----------------------------------------------------------------
20140617 10:30:01 top - 10:30:08 up 161 days, 19:46,  5 users,  load average: 1.30, 1.22, 1.20
20140617 10:30:01 Tasks: 1347 total,  15 running, 1332 sleeping,   0 stopped,   0 zombie
--
20140617 10:31:01 top - 10:31:08 up 161 days, 19:47,  5 users,  load average: 1.89, 1.35, 1.25
20140617 10:31:01 Tasks: 1339 total,   6 running, 1333 sleeping,   0 stopped,   0 zombie
--
20140617 10:32:01 top - 10:32:09 up 161 days, 19:48,  5 users,  load average: 1.53, 1.33, 1.25
20140617 10:32:01 Tasks: 1359 total,   5 running, 1354 sleeping,   0 stopped,   0 zombie
--
20140617 10:33:01 top - 10:33:08 up 161 days, 19:49,  5 users,  load average: 1.10, 1.25, 1.22
20140617 10:33:01 Tasks: 1333 total,  10 running, 1323 sleeping,   0 stopped,   0 zombie
--
20140617 10:34:01 top - 10:34:08 up 161 days, 19:50,  5 users,  load average: 1.01, 1.23, 1.22
20140617 10:34:01 Tasks: 1368 total,   8 running, 1360 sleeping,   0 stopped,   0 zombie
--
20140617 10:35:01 top - 10:35:09 up 161 days, 19:51,  5 users,  load average: 1.21, 1.26, 1.22
20140617 10:35:01 Tasks: 1359 total,   6 running, 1353 sleeping,   0 stopped,   0 zombie
--
20140617 10:36:01 top - 10:36:08 up 161 days, 19:52,  5 users,  load average: 1.19, 1.25, 1.22
20140617 10:36:01 Tasks: 1356 total,   9 running, 1347 sleeping,   0 stopped,   0 zombie
--
20140617 10:37:01 top - 10:37:08 up 161 days, 19:53,  5 users,  load average: 0.88, 1.15, 1.19
20140617 10:37:01 Tasks: 1339 total,  13 running, 1326 sleeping,   0 stopped,   0 zombie
--
20140617 10:38:01 top - 10:38:08 up 161 days, 19:54,  5 users,  load average: 1.51, 1.24, 1.21
20140617 10:38:01 Tasks: 1356 total,   9 running, 1347 sleeping,   0 stopped,   0 zombie
--
20140617 10:39:01 top - 10:39:08 up 161 days, 19:55,  5 users,  load average: 1.75, 1.35, 1.25
20140617 10:39:01 Tasks: 1339 total,  12 running, 1327 sleeping,   0 stopped,   0 zombie
--
20140617 10:40:01 top - 10:40:08 up 161 days, 19:56,  5 users,  load average: 1.27, 1.27, 1.23
20140617 10:40:01 Tasks: 1359 total,   4 running, 1355 sleeping,   0 stopped,   0 zombie

* Start of problem *
--
20140617 10:41:02 top - 10:41:12 up 161 days, 19:57,  5 users,  load average: 14.96, 4.47, 2.29
20140617 10:41:02 Tasks: 1831 total,  59 running, 1772 sleeping,   0 stopped,   0 zombie
--
20140617 10:43:01 top - 10:43:09 up 161 days, 19:59,  5 users,  load average: 16.23, 10.94, 5.07
20140617 10:43:01 Tasks: 2203 total,  49 running, 2154 sleeping,   0 stopped,   0 zombie
--
20140617 10:44:01 top - 10:44:09 up 161 days, 20:00,  5 users,  load average: 8.61, 9.60, 4.98
20140617 10:44:01 Tasks: 1942 total,  36 running, 1906 sleeping,   0 stopped,   0 zombie
--
20140617 10:45:01 top - 10:45:10 up 161 days, 20:01,  5 users,  load average: 26.00, 14.28, 6.86
20140617 10:45:01 Tasks: 2226 total,  49 running, 2177 sleeping,   0 stopped,   0 zombie
--
20140617 10:46:02 top - 10:46:22 up 161 days, 20:02,  6 users,  load average: 121.92, 47.56, 19.11
20140617 10:46:02 Tasks: 2248 total, 102 running, 2145 sleeping,   0 stopped,   1 zombie
--
20140617 10:47:01 top - 10:47:15 up 161 days, 20:03,  7 users,  load average: 119.38, 58.95, 24.52
20140617 10:47:01 Tasks: 2257 total,  83 running, 2171 sleeping,   0 stopped,   2 zombie
--
20140617 10:48:01 top - 10:48:12 up 161 days, 20:04,  9 users,  load average: 78.64, 57.52, 26.14
20140617 10:48:01 Tasks: 2247 total,  95 running, 2152 sleeping,   0 stopped,   0 zombie
--
20140617 10:55:01 top - 10:55:08 up 161 days, 20:11, 13 users,  load average: 55.85, 64.71, 41.02
20140617 10:55:01 Tasks: 1275 total,   4 running, 1271 sleeping,   0 stopped,   0 zombie
--
20140617 10:56:01 top - 10:56:08 up 161 days, 20:12, 13 users,  load average: 306.87, 132.30, 65.5
20140617 10:56:01 Tasks: 2007 total,   1 running, 2006 sleeping,   0 stopped,   0 zombie
--
20140617 10:57:01 top - 10:57:09 up 161 days, 20:13, 13 users,  load average: 591.73, 241.81, 107.
20140617 10:57:01 Tasks: 2209 total,   2 running, 2207 sleeping,   0 stopped,   0 zombie
--
20140617 10:58:01 top - 10:58:08 up 161 days, 20:14, 13 users,  load average: 858.24, 378.84, 162.
20140617 10:58:01 Tasks: 2266 total,   3 running, 2263 sleeping,   0 stopped,   0 zombie
--
20140617 10:59:02 top - 10:59:09 up 161 days, 20:15, 13 users,  load average: 1068.68, 525.91, 226
20140617 10:59:02 Tasks: 2263 total,   1 running, 2262 sleeping,   0 stopped,   0 zombie
--
20140617 11:00:01 top - 11:00:08 up 161 days, 20:16, 13 users,  load average: 1104.86, 639.53, 284
20140617 11:00:01 Tasks: 2279 total,   2 running, 2277 sleeping,   0 stopped,   0 zombie
--
20140617 11:01:01 top - 11:01:08 up 161 days, 20:17, 13 users,  load average: 664.81, 602.11, 294.
20140617 11:01:01 Tasks: 2275 total,   6 running, 2269 sleeping,   0 stopped,   0 zombie
--
20140617 11:02:01 top - 11:02:09 up 161 days, 20:18, 13 users,  load average: 485.78, 561.31, 299.
20140617 11:02:01 Tasks: 2272 total,   9 running, 2263 sleeping,   0 stopped,   0 zombie
--
20140617 11:03:01 top - 11:03:08 up 161 days, 20:19, 13 users,  load average: 422.34, 529.23, 304.
20140617 11:03:01 Tasks: 2118 total,   4 running, 2114 sleeping,   0 stopped,   0 zombie
--
20140617 11:04:01 top - 11:04:09 up 161 days, 20:20, 13 users,  load average: 371.39, 495.59, 307.
20140617 11:04:01 Tasks: 1953 total,  10 running, 1943 sleeping,   0 stopped,   0 zombie
--
20140617 11:05:01 top - 11:05:08 up 161 days, 20:21, 13 users,  load average: 331.89, 461.73, 307.
20140617 11:05:01 Tasks: 1844 total,   8 running, 1836 sleeping,   0 stopped,   0 zombie
--
20140617 11:06:01 top - 11:06:12 up 161 days, 20:22, 13 users,  load average: 213.18, 403.72, 297.
20140617 11:06:01 Tasks: 1913 total, 110 running, 1803 sleeping,   0 stopped,   0 zombie
--
20140617 11:10:01 top - 11:10:08 up 161 days, 20:26, 14 users,  load average: 90.95, 229.57, 249.9
20140617 11:10:01 Tasks: 1326 total,  11 running, 1314 sleeping,   1 stopped,   0 zombie
--
20140617 11:11:01 top - 11:11:08 up 161 days, 20:27, 14 users,  load average: 83.22, 202.67, 239.4
20140617 11:11:01 Tasks: 1544 total,   6 running, 1537 sleeping,   1 stopped,   0 zombie
--
20140617 11:12:01 top - 11:12:08 up 161 days, 20:28, 14 users,  load average: 76.15, 179.03, 229.0
20140617 11:12:01 Tasks: 1486 total,  12 running, 1473 sleeping,   1 stopped,   0 zombie
--
20140617 11:13:01 top - 11:13:08 up 161 days, 20:29, 14 users,  load average: 70.00, 158.61, 218.8
20140617 11:13:01 Tasks: 1518 total,   3 running, 1514 sleeping,   1 stopped,   0 zombie
--
20140617 11:14:01 top - 11:14:08 up 161 days, 20:30, 14 users,  load average: 67.06, 141.63, 209.2
20140617 11:14:01 Tasks: 1439 total,  39 running, 1399 sleeping,   1 stopped,   0 zombie
--
20140617 11:15:01 top - 11:15:08 up 161 days, 20:31, 14 users,  load average: 72.38, 128.89, 200.6
20140617 11:15:01 Tasks: 2279 total,   9 running, 2269 sleeping,   1 stopped,   0 zombie
--
20140617 11:16:01 top - 11:16:08 up 161 days, 20:32, 14 users,  load average: 63.85, 116.17, 191.7
20140617 11:16:01 Tasks: 2002 total,  18 running, 1983 sleeping,   1 stopped,   0 zombie
--
20140617 11:17:01 top - 11:17:08 up 161 days, 20:33, 14 users,  load average: 54.34, 103.79, 182.7
20140617 11:17:01 Tasks: 1759 total,  12 running, 1746 sleeping,   1 stopped,   0 zombie
--
20140617 11:18:01 top - 11:18:08 up 161 days, 20:34, 14 users,  load average: 56.49, 95.20, 174.84
20140617 11:18:01 Tasks: 1630 total,  13 running, 1616 sleeping,   1 stopped,   0 zombie
--
20140617 11:19:01 top - 11:19:08 up 161 days, 20:35, 15 users,  load average: 53.37, 87.20, 167.10
20140617 11:19:01 Tasks: 1501 total,  11 running, 1489 sleeping,   1 stopped,   0 zombie
--
20140617 11:20:01 top - 11:20:08 up 161 days, 20:36, 15 users,  load average: 47.54, 79.38, 159.41
20140617 11:20:01 Tasks: 1415 total,   8 running, 1406 sleeping,   1 stopped,   0 zombie
--

3) know if transaction rates spiking going into (as opposed to during) the load is interesting.  We can capture tps
ratesby logging the output of: select * from pg_stat_database. >>
 
We noticed that tps rate was not abnormal before, during and immediately after the issue happens. Old sessions were
still"processing" and we had new sessions come in (like normal). When legitimate connections from our site came in
whilethe existing old connections were still running in the database (either active/processing or active/waiting
state),we are hitting the max_connections limit.
 

Few more information to add:
This is what we see happening almost all the time when we have the issue:
- At least from 5 mins before the issue (max_connections reached) happens, we see queries slowing down considerably
(simpleselects using indexes taking 10-15 mins, and fewer dmls)
 
- When the issue happens user cpu maxes out, i/o wait goes almost to zero, and load goes really high. Simple selects
takingforever, and ton of dmls waiting. 
 

Things we changed in the environment so far:
 1. Disabled THP, and tcp offloading (GRP/GSO/TRO/TSO). They helped reduce the frequency of the issue but didn’t fix
it
 2. We restarted postgres with "numactl --interleave=all" last Thrusday (06-12-14) to alleviate the NUMA specific
memoryoffloading issues. 
 
 3. We also dropped the disk cache a couple times till now.

Questions:
1) As mentioned earlier, we are in CentOS 6.5 and Kernel version 2.6.32. Are there any known issues with these
versions?We read in some forums that downgrading to CentOS 5.8 and kernel 2.6.18 have helped with similar issues.  Is
itworth trying?
 
2) Should we look into changing our I/O scheduler from CFQ to deadline?
(http://www.cybertec.at/postgresql-linux-kernel-io-tuning/)

Thank You
Ramya

В списке pgsql-general по дате отправления:

Предыдущее
От: ogromm
Дата:
Сообщение: \COPY from CSV ERROR: unterminated CSV quoted field
Следующее
От: Edson Richter
Дата:
Сообщение: Re: Re : Global value/global variable?