BUG #15438: Standby corruption after "Too many open files in system"error

Поиск
Список
Период
Сортировка
От PG Bug reporting form
Тема BUG #15438: Standby corruption after "Too many open files in system"error
Дата
Msg-id 15438-bbe4679a595d795f@postgresql.org
обсуждение исходный текст
Ответы Re: BUG #15438: Standby corruption after "Too many open files insystem" error  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15438
Logged by:          Juanjo Santamaría
Email address:      juanjo.santamaria@gmail.com
PostgreSQL version: 10.4
Operating system:   Redhat 7.3
Description:

After some time working under poor conditions, the open file limit was too
low for the postgres user, the streaming replica ended in a corrupted
state.

The primary was working the whole time and recreating the standby replica,
after configuring the user limits, seems to solve the issue.

The log is quite verbose, so I have tried to summarize at least one
occurrency for every error:

Oct 15 10:47:54 postgres[9507]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:48:38 postgres[9507]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 10:48:38 postgres[40098]: [15-1] LOG:  record with incorrect
prev-link 6/A1A993C8 at 6/BEA99400
Oct 15 10:48:38 postgres[9550]: [9-1] FATAL:  could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:48:38 postgres[9550]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:48:38 postgres[9550]: [9-3] #011#011before or while processing the
request.

Oct 15 10:48:53 postgres[9586]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:49:32 postgres[9586]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open f
iles in system
Oct 15 10:49:32 postgres[40098]: [16-1] LOG:  invalid record length at
6/BEABD0F8: wanted 24, got 0

Oct 15 10:50:31 postgres[9765]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:50:39 postgres[9765]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 10:50:39 postgres[40098]: [19-1] LOG:  invalid resource manager ID
101 at 6/BEAE3708
Oct 15 10:51:01 postgres[9767]: [9-1] FATAL:  could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:51:01 postgres[9767]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:51:01 postgres[9767]: [9-3] #011#011before or while processing the
request.

Oct 15 10:51:06 postgres[9783]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:51:39 postgres[9783]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 10:51:39 postgres[40098]: [20-1] LOG:  record with incorrect
prev-link 3E/746F6F72 at 6/BEC3D5E8

Oct 15 11:03:39 postgres[10850]: [9-1] FATAL:  could not connect to the
primary server: FATAL:  could not open relation mapping file
"global/pg_filenode.map": Too many open files in system

Oct 15 12:00:10 postgres[40101]: [3732-1] LOG:  restartpoint complete: wrote
61 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 re
cycled; write=6.159 s, sync=0.018 s, total=6.223 s; sync files=16,
longest=0.006 s, average=0.001 s; distance=451 kB, estimate=2215 kB
Oct 15 12:01:40 postgres[15182]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open
files in system
Oct 15 12:01:40 postgres[40098]: [80-1] LOG:  invalid resource manager ID 64
at 6/BEE34F00
Oct 15 12:01:42 postgres[15711]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 12:05:38 postgres[15711]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 12:05:38 postgres[40098]: [81-1] LOG:  unexpected pageaddr 6/A1E6E000
in log segment 0000000300000006000000BE, offset 15130624

Oct 16 01:34:15 postgres[18890]: [9-1] LOG:  started streaming WAL from
primary at 6/C3000000 on timeline 3
Oct 16 01:34:15 postgres[18890]: [10-1]  FATAL:  could not receive data from
WAL stream: ERROR:  could not open file "pg_wal/0000000300000006000000C3":
Too many open files in system

Oct 16 17:58:44 postgres[38030]: [9-1] FATAL:  the database system is
starting up
Oct 16 17:58:47 postgres[38029]: [9-1] LOG:  entering standby mode
Oct 16 17:58:47 postgres[38029]: [10-1] LOG:  redo starts at 6/CF9B3598
Oct 16 17:58:47 postgres[38029]: [11-1] LOG:  consistent recovery state
reached at 6/CF9C53B8
Oct 16 17:58:47 postgres[38026]: [8-1] LOG:  database system is ready to
accept read only connections
Oct 16 17:58:48 postgres[38029]: [12-1] FATAL:  invalid memory alloc request
size 1715044352
Oct 16 17:58:48 postgres[38026]: [9-1] LOG:  startup process (PID 38029)
exited with exit code 1
Oct 16 17:58:48 postgres[38026]: [10-1] LOG:  terminating any other active
server processes
Oct 16 17:58:48 postgres[38026]: [11-1] LOG:  database system is shut down


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Euler Taveira
Дата:
Сообщение: Re: parallel queries hang
Следующее
От: Tom Lane
Дата:
Сообщение: Re: NOTIFY does not work as expected