Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

Поиск
Список
Период
Сортировка
От Achilleas Mantzios
Тема Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint
Дата
Msg-id 91736c68-3837-cf51-19cb-11fe0f096cef@matrix.gatewaynet.com
обсуждение исходный текст
Ответ на Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdown checkpoint  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-admin
On 6/11/18 4:58 μ.μ., Tom Lane wrote:
> Achilleas Mantzios <achill@matrix.gatewaynet.com> writes:
>> Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping
thosefiles afloat should also hold at this point, right.
 
>> The question is why didn't Postgresql removed them earlier.
> WAL files get removed/recycled after completion of a checkpoint.  So
> apparently, checkpoints were not finishing during normal operation,
> but the shutdown checkpoint managed to terminate normally.  That
> eliminates a lot of the usual theories about why checkpoints might
> not be succeeding (like a dirty buffer that always fails to be
> written, say as a result of broken permissions on its file).
>
> The only theory that comes to mind is that the checkpointer process
> was stuck somehow, but just "soft" stuck, in a way that allowed the
> postmaster's time-to-shut-down-please signal to unstick it.  No,
> I have no idea how that could happen exactly.  If it happens again,
> it'd be really interesting to attach to the checkpointer with a
> debugger and collect a stack trace.

Hello Tom,
I enabled debugging as you suggested.
In the meantime yesterday (also Sunday) we had an identical incident as far symptoms are concerned , only that in this
(yesterday)case I fully explained it, and this is reported here : 
 
https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com
What happened is that walsender died on Friday, replication slot stuck, server kept wal files, pg_wal disk was full (on
Sundaymorning), system PANIC'ed :
 

10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:9
LOG: terminating walsender process due to replication timeout
 
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change
 
callback, associated LSN 13DF/393BF7F0
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr(at)dynacom line:11
LOG:disconnection: session time: 49:47:17.937 user=repmgr database=dynacom
 
host=10.9.0.77 port=48650

(so this problem appeared 2 days after walsender died)

So, I tried to find evidence that would connect the two incidents (this and previous Sunday) but could not find
anythingin either the publisher (primary) or the subscriber logs that would match wal 
 
sender or logical replication worker terminate or exit. *but* I found this on the primary's log back from 2018-11-01 :

10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG: 
couldnot send data to client: Connection reset by peer
 
10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the commit 
 
callback, associated LSN 1290/6ADB518
10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.434 EET data_for_testsmadb_pub repmgr@dynacom line:11 LOG: 
disconnection:session time: 29:50:03.592 user=repmgr database=dynacom 
 
host=10.9.0.77 port=47489
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.476 EET [unknown] [unknown]@[unknown] line:1 LOG: 
connectionreceived: host=10.9.0.77 port=47612
 
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.477 EET [unknown] repmgr@dynacom line:2 LOG:  replication
connectionauthorized: user=repmgr
 
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.478 EET data_for_testsmadb_pub repmgr@dynacom line:3 LOG: 
receivedreplication command: IDENTIFY_SYSTEM
 
......
above pattern repeats 9 more times
(so this problem appeared 3 days after this connection reset)

In this case the wal sender did not terminate, but I wonder if it caused this "soft" stuck in the replication slot?
In this primary server we have only one replication slot (the one to the logical subscriber node), and archiving works
withoutproblems for years. So I cannot think of another reason that wals were 
 
accumulating in pg_wal .

>
>             regards, tom lane
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



В списке pgsql-admin по дате отправления:

Предыдущее
От: Achilleas Mantzios
Дата:
Сообщение: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Следующее
От: Pepe TD Vo
Дата:
Сообщение: Re: query can't merge into table of the other schema