Обсуждение: BUG #12183: Memory leak in long running sessions

Поиск
Список
Период
Сортировка

BUG #12183: Memory leak in long running sessions

От
valgog@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      12183
Logged by:          Valentine Gogichashvili
Email address:      valgog@gmail.com
PostgreSQL version: 9.3.4
Operating system:   Ubuntu Precise (3.13.0-30)
Description:

Hello,

it will be difficult to debug, but I will start with the report and provide
more and more information as requested.

The situation is follows:

Postgres 9.3.4 is running on a physical machine with 125GB of RAM.

Java application opens connections using a connection pool and reuses them
until it is redeployed. So, depending on the deployment cycle connections
can grow up to 2 weeks old.

Java application is calling a lot of stored procedure calls including ones
with SQL statements, that consume a lot of memory.

We experience a situations, that some of the sessions (in our case the
oldest ones) do not give the memory back.

The "solution" in our case is to kill the oldest idle connections. But of
cause this is not a nice solution.

The excerpt of the "ps fauxw" is the following:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND




postgres   959  0.0  0.0 8736052 99316 ?       S    Jun22 111:57
/server/postgres/9.3.4/bin/postgres -D /data/postgres/pgsql_whdbef/9.3/data
postgres   960  0.0  0.0 102544  7160 ?        Ss   Jun22 113:32  \_
postgres: logger process
postgres 44057  0.3  6.4 8744884 8531256 ?     Ss   Sep10 425:46  \_
postgres: checkpointer process
postgres 44058  0.2  6.4 8740288 8504052 ?     Ss   Sep10 267:35  \_
postgres: writer process
postgres 44059  0.6  0.0 8740152 18648 ?       Ss   Sep10 870:12  \_
postgres: wal writer process
postgres 44060  0.0  0.0 8741060 4328 ?        Ss   Sep10   8:44  \_
postgres: autovacuum launcher process
postgres 44061  0.0  0.0  98512  1904 ?        Ss   Sep10  16:14  \_
postgres: archiver process   last was 000000010000386C000000E1
postgres 44062  1.3  0.0 100304  3020 ?        Ss   Sep10 1804:45  \_
postgres: stats collector process
postgres 11544  1.0  0.0 8741376 3380 ?        Ss   Nov04 518:15  \_
postgres: wal sender process standby 10.228.2.29(56620) streaming
386C/E2FEEB30
postgres 41851  1.0  0.0 8741376 3024 ?        Ss   Nov04 523:59  \_
postgres: wal sender process standby 10.228.2.34(35268) streaming
386C/E2FEEB30
postgres 39234  1.8  6.7 9059376 8884356 ?     Ss   Dec03 138:16  \_
postgres: zomcat_p4820 prod_wh_ef_main_db 10.228.2.41(39735) idle
postgres 54021  5.3  7.5 10397052 9927416 ?    Ss   08:57  33:45  \_
postgres: zomcat_p4820 prod_wh_ef_main_db 10.228.2.40(57772) idle
postgres 31659  0.1  0.2 8766116 278368 ?      Ss   19:15   0:01  \_
postgres: zomcat_p4810 prod_wh_ef_main_db 10.228.2.41(46032) idle
... more processes

The result of the "select * from pg_stat_activity order by backend_start
limit 5" is the following:

-[ RECORD 1
]----+----------------------------------------------------------
datid            | 16414
datname          | prod_wh_ef_main_db
pid              | 39234
usesysid         | 18361
usename          | zomcat_p4820
application_name |
client_addr      | 10.228.2.41
client_hostname  |
client_port      | 39735
backend_start    | 2014-12-03 13:01:44.983123+01
xact_start       |
query_start      | 2014-12-08 19:28:46.091008+01
state_change     | 2014-12-08 19:28:46.091531+01
waiting          | f
state            | idle
query            | select * from very_fast_sproc($1) as result
-[ RECORD 2
]----+----------------------------------------------------------
datid            | 16414
datname          | prod_wh_ef_main_db
pid              | 54021
usesysid         | 18361
usename          | zomcat_p4820
application_name |
client_addr      | 10.228.2.40
client_hostname  |
client_port      | 57772
backend_start    | 2014-12-08 08:57:05.976879+01
xact_start       |
query_start      | 2014-12-08 19:28:46.135191+01
state_change     | 2014-12-08 19:28:46.213875+01
waiting          | f
state            | idle
query            | SELECT * FROM fast_sproc ( $1,$2,$3,$4,$5,$6,$7,$8,$9 )
-[ RECORD 3
]----+----------------------------------------------------------
datid            | 16414
datname          | prod_wh_ef_main_db
pid              | 31659
usesysid         | 25509794
usename          | zomcat_p4810
application_name |
client_addr      | 10.228.2.41
client_hostname  |
client_port      | 46032
backend_start    | 2014-12-08 19:15:51.126914+01
xact_start       |
query_start      | 2014-12-08 19:28:29.45225+01
state_change     | 2014-12-08 19:28:29.452271+01
waiting          | f
state            | idle
query            | SELECT pg_advisory_unlock(42)
-[ RECORD 4
]----+----------------------------------------------------------
datid            | 16414
datname          | prod_wh_ef_main_db
pid              | 31662
usesysid         | 25509794
usename          | zomcat_p4810
application_name |
client_addr      | 10.228.2.41
client_hostname  |
client_port      | 46034
backend_start    | 2014-12-08 19:15:51.210825+01
xact_start       |
query_start      | 2014-12-08 19:28:44.452563+01
state_change     | 2014-12-08 19:28:44.452578+01
waiting          | f
state            | idle
query            | SELECT pg_advisory_unlock(42)


Do anybody has any ideas?

Regards,

-- Valentine

Re: BUG #12183: Memory leak in long running sessions

От
Tom Lane
Дата:
valgog@gmail.com writes:
> We experience a situations, that some of the sessions (in our case the
> oldest ones) do not give the memory back.

You have not shown any evidence of an actual problem.  In particular,
if you are looking at ps RSS output and claiming that there's a leak,
you are probably simply wrong.  The output shown here looks like normal
behavior of the RSS stat: it does not count shared memory pages for a
particular process until that process has touched the individual pages.
So the usual behavior of long-lived PG processes is that the reported
RSS starts small and gradually grows until it includes all of shared
memory ... and that looks like what you've got here, especially since
the larger RSS numbers are pretty similar to the VSZ numbers which are
nearly common across all the backends.

If you had some individual processes with RSS/VSZ greatly exceeding
your shared memory allocation, then I'd believe you had a leak problem.

            regards, tom lane

Re: BUG #12183: Memory leak in long running sessions

От
Valentine Gogichashvili
Дата:
Hello Tom,

the biggest evidence for me is that the COMMIT_AS memory grow to the
COMMIT_LIMIT at some moment and we do not have any memory left to allocate
on the system and sessions start to throw 'out of memory' exceptions.

Killing that old running sessions reduces COMMIT_AS value (we get up to
60GB free). So I do not have any explanation other then a leak.

We made more statistics gathering monitoring the /proc/PID/statm values
"resident" - "shared" for such processes.

We see the value growth and could correlate it with the calling of some of
stored procedure that is quite heavy and processes relatively large set of
data. It also creates a 50MB temp file.

Here is an example of the memory growth:

timestamp     size(kb)
1418187057 79888
1418187087 80192
1418187117 85976
1418187147 86100
1418187177 88292
1418187207 88740
1418187237 380524
1418187267 719960
1418187297 555292
1418187327 560488
1418187357 563500
1418187387 569868
1418187417 573800
1418187447 576692
1418187477 582300
1418187507 584240
1418187537 586036
1418187567 586508
1418187597 586852
1418187628 587284
1418187658 589092
1418187688 589392
1418187718 601164
1418187748 602124
1418187778 605472
1418187808 606520
1418187838 608196
1418187868 609612
1418187898 612588
1418187928 614740
1418187958 616572
1418187988 630092
1418188018 630696
1418188048 632240
1418188078 634220
1418188108 636636
1418188138 637192
1418188168 638120
1418188198 640940
1418188229 642532
1418188259 645040
1418188289 827404
1418188319 662092
1418188349 662676
1418188379 662748
1418188409 663364
1418188439 663408
1418188469 663416
1418188499 663600
1418188529 664224
1418188559 664620
1418188589 666088
1418188619 667176
1418188649 669252
1418188680 669732
1418188710 670176
1418188740 670224
1418188770 676104
1418188800 684256
1418188830 804900
1418188860 1151124
1418188890 1151124
1418188920 949300
1418188950 950456
1418188980 950492
1418189010 950764
1418189040 951756
1418189070 953068

Can we collect some more information that would give more hints on what is
going on there?

Regards,

-- Valentine Gogichashvili

On Mon, Dec 8, 2014 at 8:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> valgog@gmail.com writes:
> > We experience a situations, that some of the sessions (in our case the
> > oldest ones) do not give the memory back.
>
> You have not shown any evidence of an actual problem.  In particular,
> if you are looking at ps RSS output and claiming that there's a leak,
> you are probably simply wrong.  The output shown here looks like normal
> behavior of the RSS stat: it does not count shared memory pages for a
> particular process until that process has touched the individual pages.
> So the usual behavior of long-lived PG processes is that the reported
> RSS starts small and gradually grows until it includes all of shared
> memory ... and that looks like what you've got here, especially since
> the larger RSS numbers are pretty similar to the VSZ numbers which are
> nearly common across all the backends.
>
> If you had some individual processes with RSS/VSZ greatly exceeding
> your shared memory allocation, then I'd believe you had a leak problem.
>
>                         regards, tom lane
>

Re: BUG #12183: Memory leak in long running sessions

От
Tom Lane
Дата:
Valentine Gogichashvili <valgog@gmail.com> writes:
> We see the value growth and could correlate it with the calling of some of
> stored procedure that is quite heavy and processes relatively large set of
> data. It also creates a 50MB temp file.

So, now that you've identified something you think is leaking, how about a
reproducible test case?

            regards, tom lane

Re: BUG #12183: Memory leak in long running sessions

От
Kevin Grittner
Дата:
Valentine Gogichashvili <valgog@gmail.com> wrote:

> Can we collect some more information that would give more hints
> on what is going on there?

As Tom already said, a reproducible test case (i.e., a script which
a developer can run starting from an empty database) which
demonstrates the problem would be ideal.  If you can't produce
that, install smem (if it's not already installed) and run this at
intervals (beginning, middle, and end would probably suffice):

smem -trs uss

The other thing that would give valuable information would be to
set overcommit_memory to 2 and overcommit_ratio to something like
80.  If there is a memory leak in PostgreSQL, it will probably dump
a memory map when an allocation attempt fails.  It would also be
great if you could enable core dumps for postgres and get a
backtrace from the point of failure in this configuration.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company