Обсуждение: OOM Killing on Docker while ANALYZE running

Поиск

Список

Период

Сортировка

OOM Killing on Docker while ANALYZE running

От

Jorge Daniel

Дата:

26 января 2018 г., 01:16:09

Hi guys , I'm dealing with OOM killing on Postgresql 9.4.8 running on docker ,

Docker : Docker version 17.05.0,

"resources_template = Resources(cpu = 1,ram = 1 * GB, disk = 1 * GB)"

Using Ceph volumes as storages

------ Shared Memory Limits --------

max number of segments = 4096

max seg size (kbytes) = 5079992

max total shared memory (kbytes) = 5079992

min seg size (bytes) = 1

------ Semaphore Limits --------

max number of arrays = 32000

max semaphores per array = 32000

max semaphores system wide = 1024000000

max ops per semop call = 500

semaphore max value = 32767

------ Messages Limits --------

max queues system wide = 32000

max size of message (bytes) = 8192

default max size of queue (bytes) = 16384

------ Shared Memory Segments --------

key shmid owner perms bytes nattch status

0x0052e2c1 0 postgres 600 56 5

0x6b8b4567 32769 postgres 600 2028 5

postgres@d90b66757bb1:~$ ipcs -a

------ Shared Memory Segments --------

key shmid owner perms bytes nattch status

0x0052e2c1 0 postgres 600 56 5

0x6b8b4567 32769 postgres 600 2028 5

------ Semaphore Arrays --------

key semid owner perms nsems

0x0052e2c1 0 postgres 600 17

0x0052e2c2 32769 postgres 600 17

0x0052e2c3 65538 postgres 600 17

0x0052e2c4 98307 postgres 600 17

0x0052e2c5 131076 postgres 600 17

0x0052e2c6 163845 postgres 600 17

0x0052e2c7 196614 postgres 600 17

------ Message Queues --------

key msqid owner perms used-bytes messages

The postgres installation is :

PostgreSQL 9.4.8 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit

From postgresql.conf :

shared_buffers = 500MB # min 128kB

temp_buffers = 80MB # min 800kB

work_mem = 80MB # min 64kB

maintenance_work_mem = 80MB # min 1MB

autovacuum_work_mem = 300MB # min 1MB, or -1 to use maintenance_work_mem

effective_cache_size = 500MB

While running

/usr/bin/pgbench --initialize --fillfactor=100 --no-vacuum --scale=1000 pgbench

.........

100000000 of 100000000 tuples (100%) done (elapsed 189.29 s, remaining 0.00 s).

set primary keys...

done.

2018-01-25 10:19:19.652 PST rhost=[local] app=psql:user=postgres:db=postgres:LOG: disconnection: session time: 0:00:14.824 user=postgres database=postgres host=[local]

2018-01-25 10:19:33.371 PST rhost=[local] app=[unknown]:user=[unknown]:db=[unknown]:LOG: connection received: host=[local]

2018-01-25 10:19:33.371 PST rhost=[local] app=[unknown]:user=postgres:db=pgbench:LOG: connection authorized: user=postgres database=pgbench

2018-01-25 10:24:55.640 PST rhost=[local] app=pgbench:user=postgres:db=pgbench:LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp812.0", size 1073741824

2018-01-25 10:24:55.640 PST rhost=[local] app=pgbench:user=postgres:db=pgbench:STATEMENT: alter table pgbench_accounts add primary key (aid)

2018-01-25 10:24:55.677 PST rhost=[local] app=pgbench:user=postgres:db=pgbench:LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp812.1", size 928219136

2018-01-25 10:24:55.677 PST rhost=[local] app=pgbench:user=postgres:db=pgbench:STATEMENT: alter table pgbench_accounts add primary key (aid)

2018-01-25 10:24:55.752 PST rhost=[local] app=pgbench:user=postgres:db=pgbench:LOG: disconnection: session time: 0:05:22.381 user=postgres database=pgbench host=[local]

2018-01-25 10:26:29.937 PST rhost= app=:user=:db=:LOG: server process (PID 2709) was terminated by signal 9: Killed

2018-01-25 10:26:29.937 PST rhost= app=:user=:db=:DETAIL: Failed process was running: autovacuum: ANALYZE public.pgbench_accounts

2018-01-25 10:26:29.937 PST rhost= app=:user=:db=:LOG: terminating any other active server processes

2018-01-25 10:26:29.937 PST rhost= app=:user=:db=:WARNING: terminating connection because of crash of another server process

2018-01-25 10:26:29.937 PST rhost= app=:user=:db=:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2018-01-25 10:26:29.937 PST rhost= app=:user=:db=:HINT: In a moment you should be able to reconnect to the database and repeat your command.

pgbench=# select count(*) from pgbench_accounts ;

count

-----------

100000000

(1 row)

pgbench=# SELECT pg_size_pretty(pg_total_relation_size('pgbench_accounts'));

pg_size_pretty

----------------

15 GB

(1 row)

The kern.log shows:

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818867] watch invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0-1, order=0, oom_score_adj=0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818868] watch cpuset=d90b66757bb18658502690ae68d4ea365f922ed40855c15b5eb5d074be38c223 mems_allowed=0-1

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818873] CPU: 29 PID: 1911120 Comm: watch Tainted: G OE 4.10.0-42-generic #46~16.04.1-Ubuntu

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818874] Hardware name: Supermicro SYS-F628R3-RTB+-SS30C/X10DRFR, BIOS 2.0c 04/13/2017

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818875] Call Trace:

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818881] dump_stack+0x63/0x90

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818886] dump_header+0x7b/0x1fd

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818891] ? mem_cgroup_scan_tasks+0xc4/0xf0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818895] oom_kill_process+0x219/0x3e0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818897] out_of_memory+0x120/0x4b0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818898] mem_cgroup_out_of_memory+0x4b/0x80

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818900] mem_cgroup_oom_synchronize+0x325/0x340

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818901] ? memory_high_write+0xe0/0xe0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818902] pagefault_out_of_memory+0x36/0x80

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818907] mm_fault_error+0x8f/0x190

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818909] __do_page_fault+0x4b2/0x4e0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818910] do_page_fault+0x22/0x30

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818914] page_fault+0x28/0x30

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818916] RIP: 0033:0x7fedbf439830

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818916] RSP: 002b:00007ffeb543c288 EFLAGS: 00010202

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818917] RAX: 00007fedbf7c1de0 RBX: 00007fedbf7c1de0 RCX: 00007fedbf4bef20

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818918] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007fedbf57c209

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818918] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818919] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818919] R13: ffffffffffffffff R14: 00000000ffffffff R15: 00000000ffffffff

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818920] Task in /docker/d90b66757bb18658502690ae68d4ea365f922ed40855c15b5eb5d074be38c223 killed as a result of limit of /docker/d90b6675

7bb18658502690ae68d4ea365f922ed40855c15b5eb5d074be38c223

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818924] memory: usage 1048576kB, limit 1048576kB, failcnt 7864663

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818924] memory+swap: usage 1048576kB, limit 2097152kB, failcnt 0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818925] kmem: usage 42116kB, limit 9007199254740988kB, failcnt 0

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818925] Memory cgroup stats for /docker/d90b66757bb18658502690ae68d4ea365f922ed40855c15b5eb5d074be38c223: cache:2344KB rss:1004116KB rss

_huge:0KB mapped_file:256KB dirty:4KB writeback:0KB swap:0KB inactive_anon:518092KB active_anon:486012KB inactive_file:1436KB active_file:732KB unevictable:0KB

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.818940] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819358] [1877175] 0 1877175 4899 72 13 3 0 0 docker-entrypoi

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819361] [1877277] 0 1877277 12050 93 27 3 0 0 su

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819362] [1877279] 26 1877279 164354 335 51 4 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819363] [1877284] 26 1877284 164598 473 51 4 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819364] [1877285] 26 1877285 164417 335 49 4 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819366] [1877286] 26 1877286 164417 335 49 4 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819367] [1877287] 26 1877287 164517 388 50 4 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819368] [1877288] 26 1877288 26624 335 49 4 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819369] [1877289] 26 1877289 26659 351 50 4 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819385] [1878251] 0 1878251 4947 126 13 3 0 0 bash

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819388] [1911120] 0 1911120 3519 459 13 3 0 0 watch

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819404] [1943707] 0 1943707 4946 125 15 3 0 0 bash

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819411] [1946711] 0 1946711 12050 96 29 3 0 0 su

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819412] [1946712] 26 1946712 4545 122 14 3 0 0 bash

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819435] [2015235] 26 2015235 422880 249393 540 5 0 0 postgres

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.819443] Memory cgroup out of memory: Kill process 2015235 (postgres) score 953 or sacrifice child

Jan 25 10:26:29 lon-r10-u17 kernel: [1737264.828834] Killed process 2015235 (postgres) total-vm:1691520kB, anon-rss:997572kB, file-rss:0kB, shmem-rss:0kB

Again I reply the escenario but now monitoring the used memory inside the container, and it really shows a bump when I shoot the ANALYZE :

pgbench=# analyze verbose pgbench_accounts;

INFO: analyzing "public.pgbench_accounts"

INFO: "pgbench_accounts": scanned 1639345 of 1639345 pages, containing 100000000 live rows and 0 dead rows; 3000000 rows in sample, 100000000 estimated total rows

server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

The connection to the server was lost. Attempting reset: Failed.

Meantime

$ while true ; do cat /sys/fs/cgroup/memory/memory.usage_in_b sleep 1; done

82620416

82821120

82780160

83017728

82903040

82993152

83054592

82894848

82964480

83017728

675348480 ------>>ANALYZE

1073606656

1073700864

1073684480

1073479680

1073704960

1073598464

1073713152

1073590272

1073733632

1073610752

1073467392

1073729536

1073717248

1073582080

1073594368

1073598464

1073741824

1073700864

1073623040

1073700864

1073610752

1073487872

1073496064

1073565696

1073717248

1073238016

1073717248

1073647616

1073676288

1073524736

1073692672

1073717248

1073631232

1073696768

1073496064

1073627136

1073635328

1073561600

1073545216

710078464 ------> KILL

85487616

85274624

85237760

It's clear that the whole container memory was in fact filled-up with the something happening in analyze, but then if the work_mem is set, why it's breaking outside its memory limits? Has anyone saw this before?

The actual case is happening in my Prod clusters , with large tables.

I replicated the issue with pgbench to have a simpler escenario .

Any thoguhts?

Re: OOM Killing on Docker while ANALYZE running

От

Alvaro Herrera

Дата:

26 января 2018 г., 01:47:20

Jorge Daniel wrote:
> Hi guys , I'm dealing with OOM killing on Postgresql 9.4.8 running on docker ,

What is your statistics target?  analyze is supposed to acquire samples
of the data, not the whole table ...

> pgbench=# analyze verbose pgbench_accounts;
> INFO:  analyzing "public.pgbench_accounts"
> INFO:  "pgbench_accounts": scanned 1639345 of 1639345 pages, containing 100000000 live rows and 0 dead rows; 3000000
rowsin sample, 100000000 estimated total rows
 

Here it seems to be saying that it reads all 1.6 million pages ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: OOM Killing on Docker while ANALYZE running

От

Tom Lane

Дата:

26 января 2018 г., 02:48:26

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Jorge Daniel wrote:
>> Hi guys , I'm dealing with OOM killing on Postgresql 9.4.8 running on docker ,

> What is your statistics target?  analyze is supposed to acquire samples
> of the data, not the whole table ...

>> pgbench=# analyze verbose pgbench_accounts;
>> INFO:  analyzing "public.pgbench_accounts"
>> INFO:  "pgbench_accounts": scanned 1639345 of 1639345 pages, containing 100000000 live rows and 0 dead rows; 3000000
rowsin sample, 100000000 estimated total rows 

The standard setting of default_statistics_target would only result in
trying to acquire a 30000-row sample, so the problem here is having
cranked that up beyond what the machine can sustain.  AFAIR we do not
consider maintenance_work_mem as imposing a limit on the sample size.

            regards, tom lane

Re: OOM Killing on Docker while ANALYZE running

От

Jorge Daniel

Дата:

26 января 2018 г., 02:58:44

HI Alvaro:

These are our values

default_statistics_target

---------------------------

10000

relpages | reltuples

----------+-----------

1639345 | 1e+08

Table "public.pgbench_accounts"

----------+---------------+-----------+----------+--------------+-------------

bid | integer | | plain | |

Indexes:

"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)

Options: fillfactor=100

I guess you hit the knotch:

pgbench=# set default_statistics_target =10;

SET

pgbench=# analyze verbose pgbench_accounts ;

INFO: analyzing "public.pgbench_accounts"

INFO: "pgbench_accounts": scanned 3000 of 1639345 pages, containing 183000 live rows and 0 dead rows; 3000 rows in sample, 100000000 estimated total rows

ANALYZE

pgbench=# set default_statistics_target =100;

SET

pgbench=# analyze verbose pgbench_accounts ;

INFO: analyzing "public.pgbench_accounts"

INFO: "pgbench_accounts": scanned 30000 of 1639345 pages, containing 1830000 live rows and 0 dead rows; 30000 rows in sample, 100000001 estimated total rows

ANALYZE

pgbench=# set default_statistics_target =1000;

SET

pgbench=# analyze verbose pgbench_accounts ;

INFO: analyzing "public.pgbench_accounts"

INFO: "pgbench_accounts": scanned 300000 of 1639345 pages, containing 18300000 live rows and 0 dead rows; 300000 rows in sample, 100000008 estimated total rows

ANALYZE

pgbench=# set default_statistics_target =10000;

SET

pgbench=# analyze verbose pgbench_accounts ;

INFO: analyzing "public.pgbench_accounts"

INFO: "pgbench_accounts": scanned 1639345 of 1639345 pages, containing 100000000 live rows and 0 dead rows; 3000000 rows in sample, 100000000 estimated total rows

server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

The connection to the server was lost. Attempting reset: Failed.

Kind regards

Jorge Daniel Fernandez

From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Sent: Thursday, January 25, 2018 4:47 PM
To: Jorge Daniel
Cc: pgsql-admin@lists.postgresql.org; pgsql-admin@postgresql.org; dyuryeva@medallia.com
Subject: Re: OOM Killing on Docker while ANALYZE running

Jorge Daniel wrote:
> Hi guys , I'm dealing with OOM killing on Postgresql 9.4.8 running on docker ,

What is your statistics target? analyze is supposed to acquire samples
of the data, not the whole table ...

> pgbench=# analyze verbose pgbench_accounts;
> INFO: analyzing "public.pgbench_accounts"
> INFO: "pgbench_accounts": scanned 1639345 of 1639345 pages, containing 100000000 live rows and 0 dead rows; 3000000 rows in sample, 100000000 estimated total rows

Here it seems to be saying that it reads all 1.6 million pages ...

--
Álvaro Herrera https://www.2ndQuadrant.com/

Professional PostgreSQL | 2ndQuadrant

www.2ndquadrant.com

Stay in touch with us. Subscribe to our monthly newsletter to hear the latest developments from 2ndQuadrant and related technologies. We’ll also send you any ...

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: OOM Killing on Docker while ANALYZE running

От

Ray Stell

Дата:

26 января 2018 г., 08:51:36

On 1/25/18 3:48 PM, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> Jorge Daniel wrote:
>>> Hi guys , I'm dealing with OOM killing on Postgresql 9.4.8 running on docker

Mike Coleman's talk at DockerCon17 is here:
http://training.play-with-docker.com/ops-stage1/
At about the 38 minute mark, on the topic of transitioning enterprise 
apps to docker:
"define your guardrails,  leave the db out in the VM."  Mike is a docker 
evangelist.

Re: OOM Killing on Docker while ANALYZE running

От

Mark Kirkwood

Дата:

26 января 2018 г., 10:15:39

On 26/01/18 15:51, Ray Stell wrote:

> On 1/25/18 3:48 PM, Tom Lane wrote:
>
>> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>>> Jorge Daniel wrote:
>>>> Hi guys , I'm dealing with OOM killing on Postgresql 9.4.8 running 
>>>> on docker
>
> Mike Coleman's talk at DockerCon17 is here:
> http://training.play-with-docker.com/ops-stage1/
> At about the 38 minute mark, on the topic of transitioning enterprise 
> apps to docker:
> "define your guardrails,  leave the db out in the VM."  Mike is a 
> docker evangelist.
>
>

Hmm - I don't think docker per se is the problem here - that 
default_statistics_target of 10000 is the more likely culprit. A minimal 
allowance of resources (e.g memory in this case) combined with hugely 
over spec'ed memory use (the default_statistics_target setting) is gonna 
trigger OOM for any environment.

regards

Mark

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: OOM Killing on Docker while ANALYZE running

OOM Killing on Docker while ANALYZE running

Re: OOM Killing on Docker while ANALYZE running

Re: OOM Killing on Docker while ANALYZE running

Re: OOM Killing on Docker while ANALYZE running

Re: OOM Killing on Docker while ANALYZE running

Re: OOM Killing on Docker while ANALYZE running