Обсуждение: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdowncheckpoint

Поиск
Список
Период
Сортировка

PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdowncheckpoint

От
Achilleas Mantzios
Дата:
On Sunday morning I started getting email warnings about running out of space on our pg_wal dir (which is a symlink to
aseparate disk).
 

As the size was about to reach 100% of the disk (100GB) I shutdown the database in order to investigate without the
pressureof a potentially worse situation. Only to see that the restart solved the 
 
problem. I had not kept track of the total number of files in pg_wal during the previous days, but normally they are
about550->600 files (see relevant settings below), so since the info on the log 
 
regarding checkpoints refer to the changes done by those checkpoints I am not sure about the number of files before the
85%threshold at which our monitoring system starts to send out alerts. But 
 
doing the calculations this 85% is reached when the number of files exceeds 5000.
The removal of wal files (according to checkpoint log entries) stopped some time in late Friday.

  [782] 5bbc6127.30e 2018-11-02 22:59:20.015 EET  @ line:5560 LOG: checkpoint complete: wrote 7758 buffers (0.7%); 0
WALfile(s) added, 5 removed, 21 recycled; write=777.496 s, sync=0.006 s, 
 
total=777.554 s; sync files=152, longest=0.002 s, average=0.000 s; distance=108549 kB, estimate=412619 kB
  [782] 5bbc6127.30e 2018-11-02 23:13:14.493 EET  @ line:5562 LOG: checkpoint complete: wrote 7104 buffers (0.7%); 0
WALfile(s) added, 13 removed, 18 recycled; write=712.311 s, sync=0.007 s, 
 
total=712.379 s; sync files=190, longest=0.001 s, average=0.000 s; distance=473908 kB, estimate=473908 kB
  [782] 5bbc6127.30e 2018-11-02 23:26:10.157 EET  @ line:5564 LOG: checkpoint complete: wrote 7623 buffers (0.7%); 0
WALfile(s) added, 0 removed, 20 recycled; write=638.089 s, sync=0.006 s, 
 
total=638.152 s; sync files=245, longest=0.006 s, average=0.000 s; distance=716742 kB, estimate=716742 kB
  [782] 5bbc6127.30e 2018-11-02 23:41:24.253 EET  @ line:5566 LOG: checkpoint complete: wrote 7277 buffers (0.7%); 0
WALfile(s) added, 0 removed, 0 recycled; write=652.984 s, sync=0.002 s, 
 
total=653.014 s; sync files=195, longest=0.001 s, average=0.000 s; distance=326408 kB, estimate=677709 kB
  [782] 5bbc6127.30e 2018-11-02 23:47:25.473 EET  @ line:5568 LOG: checkpoint complete: wrote 1138 buffers (0.1%); 0
WALfile(s) added, 0 removed, 0 recycled; write=114.098 s, sync=0.002 s, 
 
total=114.110 s; sync files=189, longest=0.002 s, average=0.000 s; distance=84617 kB, estimate=618400 kB

No deletions happened in Saturday, neither in Sunday till the shutdown. Then shutdown gave :
  [782] 5bbc6127.30e 2018-11-04 10:49:55.571 EET  @ line:5916 LOG: checkpoint starting: shutdown immediate
....
  [782] 5bbc6127.30e 2018-11-04 10:50:04.838 EET  @ line:5917 LOG: checkpoint complete: wrote 2206 buffers (0.2%); 0
WALfile(s) added, 5128 removed, 105 recycled; write=0.038 s, sync=0.023 s, 
 
total=9.268 s; sync files=129, longest=0.014 s, average=0.000 s; distance=261973 kB, estimate=620287 kB

The removal of those 5128 files freed 80GB of space. (and the used space is still low as we speak after about 180GB of
trafficsince the incident).
 

Our current settings are :

wal_keep_segments = 512
max_wal_size = 2GB
min_wal_size = 1GB

Our setup is as follows :

primary (smadb) <--> (no replication slot) physical hot stanbdby (smadb2) (managed via repmgr) <--> (replication slot)
barman
                 ^--> (replication slot) logical subscriber (testsmadb)
                 ^--> wal archiving to host (sma) (via /usr/bin/rsync -a --delay-updates %p sma:/smadb/pgsql/pitr/%f )

I checked for any errors in communications, delays, but could not find any. No ERRORs indication anything with the
archivecommand in the logs, or any abnormal lag in either the physical standby or 
 
the logical subscriber or any replication slot. And none of those two servers had any outages. We have two alternate
methodsof backups, one old with pg_start/stop_backup and one with barman. I 
 
checked that all backups successfully were completed with pg_stop_backup. (barman does not interface the primary)

This is the first time we are facing this. No change happened in the load or configuration since the deployment of 10
in1st July, besides the upgrade to 10.5 (in September)
 

Any thoughts on this?

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

От
Rui DeSousa
Дата:

> On Nov 5, 2018, at 6:24 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
>
> Our current settings are :
>
> wal_keep_segments = 512
> max_wal_size = 2GB
> min_wal_size = 1GB
>
> Our setup is as follows :

The settings seem counterintuitive; if you’re using standard 16MB WAL files then keep parameter is at 8GB but
max_wal_sizeis at 2GB — that seems counterproductive to me and would cause more checkpoints than needed. 

How often are your checkpoints occurring and why, time or log? What’s your checkpoint_timeout set to?


> primary (smadb) <--> (no replication slot) physical hot stanbdby (smadb2) (managed via repmgr) <--> (replication
slot)barman 
>                 ^--> (replication slot) logical subscriber (testsmadb)
>                 ^--> wal archiving to host (sma) (via /usr/bin/rsync -a --delay-updates %p sma:/smadb/pgsql/pitr/%f )


Did you check the status of both the replication slots and archiving status?

> No ERRORs indication anything with the archive command in the logs,


Postgres is not going log an error if archive command fails; I believe that is up to the your archive command to log
theerror.  

I would suspect it might have been your archive command.  Could you verify that you have all the WAL files? I’ve seen a
casein a 9.2 environment where the startup removed files that were not yet archived thus losing WAL files and breaking
thebackup.   

It would be great if you can double check to see if have all the WAL files (no gaps) and report back.




Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

От
Achilleas Mantzios
Дата:
On 5/11/18 6:48 μ.μ., Rui DeSousa wrote:

On Nov 5, 2018, at 6:24 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:


Our current settings are :

wal_keep_segments = 512
max_wal_size = 2GB
min_wal_size = 1GB

Our setup is as follows :
The settings seem counterintuitive; 
This email was not sent in order to discuss WAL GUC settings but rather the actual problem in hand.
We are using postgresql since 2001 and this is the first time we faced such a (rather serious) issue.
if you’re using standard 16MB WAL files then keep parameter is at 8GB but max_wal_size is at 2GB — that seems counterproductive to me and would cause more checkpoints than needed.
wal_keep_segments cover the case where one has to provide some safety (by keeping at least this num of wals) for replication clients when replication slots are not in use.
checkpoint_timeout / max_wal_size control checkpoints. All of them + other conditions are used in the algorithm which decides how many files to keep in pg_wal. That's what I am trying to figure out here.

How often are your checkpoints occurring and why, time or log? What’s your checkpoint_timeout set to? 


primary (smadb) <--> (no replication slot) physical hot stanbdby (smadb2) (managed via repmgr) <--> (replication slot) barman               ^--> (replication slot) logical subscriber (testsmadb)               ^--> wal archiving to host (sma) (via /usr/bin/rsync -a --delay-updates %p sma:/smadb/pgsql/pitr/%f )

Did you check the status of both the replication slots and archiving status? 

No ERRORs indication anything with the archive command in the logs,

Postgres is not going log an error if archive command fails; I believe that is up to the your archive command to log the error. 
No, PostgreSQL will complain. Normally (in 10) you get something like : LOG:  archive command failed with exit code .....
In previous versions the LOG level were even more severe IIRC.
I would suspect it might have been your archive command.  Could you verify that you have all the WAL files? I’ve seen a case in a 9.2 environment where the startup removed files that were not yet archived thus losing WAL files and breaking the backup.  

It would be great if you can double check to see if have all the WAL files (no gaps) and report back.
Absolute continuity.

Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping those files afloat should also hold at this point, right.
The question is why didn't Postgresql removed them earlier.





-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

От
Laurenz Albe
Дата:
Achilleas Mantzios wrote:
> On Sunday morning I started getting email warnings about running out of space on our pg_wal dir (which is a symlink
toa separate disk).
 
> 
> As the size was about to reach 100% of the disk (100GB) I shutdown the database in order to investigate without the
pressureof a potentially worse situation. Only to see that the restart solved the 
 
> problem. I had not kept track of the total number of files in pg_wal during the previous days, but normally they are
about550->600 files (see relevant settings below), so since the info on the log 
 
> regarding checkpoints refer to the changes done by those checkpoints I am not sure about the number of files before
the85% threshold at which our monitoring system starts to send out alerts. But 
 
> doing the calculations this 85% is reached when the number of files exceeds 5000.
> The removal of wal files (according to checkpoint log entries) stopped some time in late Friday.
> 
>   [782] 5bbc6127.30e 2018-11-02 22:59:20.015 EET  @ line:5560 LOG: checkpoint complete: wrote 7758 buffers (0.7%); 0
WALfile(s) added, 5 removed, 21 recycled; write=777.496 s, sync=0.006 s, 
 
> total=777.554 s; sync files=152, longest=0.002 s, average=0.000 s; distance=108549 kB, estimate=412619 kB
>   [782] 5bbc6127.30e 2018-11-02 23:13:14.493 EET  @ line:5562 LOG: checkpoint complete: wrote 7104 buffers (0.7%); 0
WALfile(s) added, 13 removed, 18 recycled; write=712.311 s, sync=0.007 s, 
 
> total=712.379 s; sync files=190, longest=0.001 s, average=0.000 s; distance=473908 kB, estimate=473908 kB
>   [782] 5bbc6127.30e 2018-11-02 23:26:10.157 EET  @ line:5564 LOG: checkpoint complete: wrote 7623 buffers (0.7%); 0
WALfile(s) added, 0 removed, 20 recycled; write=638.089 s, sync=0.006 s, 
 
> total=638.152 s; sync files=245, longest=0.006 s, average=0.000 s; distance=716742 kB, estimate=716742 kB
>   [782] 5bbc6127.30e 2018-11-02 23:41:24.253 EET  @ line:5566 LOG: checkpoint complete: wrote 7277 buffers (0.7%); 0
WALfile(s) added, 0 removed, 0 recycled; write=652.984 s, sync=0.002 s, 
 
> total=653.014 s; sync files=195, longest=0.001 s, average=0.000 s; distance=326408 kB, estimate=677709 kB
>   [782] 5bbc6127.30e 2018-11-02 23:47:25.473 EET  @ line:5568 LOG: checkpoint complete: wrote 1138 buffers (0.1%); 0
WALfile(s) added, 0 removed, 0 recycled; write=114.098 s, sync=0.002 s, 
 
> total=114.110 s; sync files=189, longest=0.002 s, average=0.000 s; distance=84617 kB, estimate=618400 kB
> 
> No deletions happened in Saturday, neither in Sunday till the shutdown. Then shutdown gave :
>   [782] 5bbc6127.30e 2018-11-04 10:49:55.571 EET  @ line:5916 LOG: checkpoint starting: shutdown immediate
> ....
>   [782] 5bbc6127.30e 2018-11-04 10:50:04.838 EET  @ line:5917 LOG: checkpoint complete: wrote 2206 buffers (0.2%); 0
WALfile(s) added, 5128 removed, 105 recycled; write=0.038 s, sync=0.023 s, 
 
> total=9.268 s; sync files=129, longest=0.014 s, average=0.000 s; distance=261973 kB, estimate=620287 kB
> 
> The removal of those 5128 files freed 80GB of space. (and the used space is still low as we speak after about 180GB
oftraffic since the incident).
 
> 
> Our current settings are :
> 
> wal_keep_segments = 512
> max_wal_size = 2GB
> min_wal_size = 1GB

If you set log_min_messages to DEBUG2, you should get a log message like

   attempting to remove WAL segments older than log file ...

during each checkpoint.  That would help to determine what is going on.

The traditional things to hold back WAL segment removal are wal_keep_segments and
replication slots, but that should also hold for a shutdown checkpoint.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

От
Achilleas Mantzios
Дата:
On 6/11/18 2:41 μ.μ., Laurenz Albe wrote:
> Achilleas Mantzios wrote:
>
> If you set log_min_messages to DEBUG2, you should get a log message like
>
>     attempting to remove WAL segments older than log file ...
>
> during each checkpoint.  That would help to determine what is going on.
Thank you Laurenz, is DEBUG2 cause a great deal of excessive log lines written to the log? or is there any way to set
log_min_messagesto DEBUG2 only for the checkpoints (by e.g. kill -HUP the 
 
checkpointer )?
> The traditional things to hold back WAL segment removal are wal_keep_segments and
> replication slots, but that should also hold for a shutdown checkpoint.
wal_keep_segments is unchanged. We have only one replication slot running for the subscriber (testsmadb) (which is
usuallyultra verbose about replication problems) and there was nothing in the logs 
 
prior to the incident (or many hours before the incident) in the primary (smadb) or the subscriber to indicate anything
regardingthose two.
 
Moreover, we have monitoring in place to take metrics on the replication lag. No mail alert relevant to this.
No emails/log msgs indicating anything abnormal on smadb2 either (the non-replication-slot physical standby) .

But the connection from smadb2 to the barman host went downhill this morning, I don't know if this was still the case
inSunday or if the two incidents are connected. Today we had a much more serious 
 
situation. Our traffic increased by almost 10 and the net speed fell also by 10.  But at least, today I explained
everything(I hope).
 

>
> Yours,
> Laurenz Albe


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdown checkpoint

От
Tom Lane
Дата:
Achilleas Mantzios <achill@matrix.gatewaynet.com> writes:
> Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping
thosefiles afloat should also hold at this point, right. 
> The question is why didn't Postgresql removed them earlier.

WAL files get removed/recycled after completion of a checkpoint.  So
apparently, checkpoints were not finishing during normal operation,
but the shutdown checkpoint managed to terminate normally.  That
eliminates a lot of the usual theories about why checkpoints might
not be succeeding (like a dirty buffer that always fails to be
written, say as a result of broken permissions on its file).

The only theory that comes to mind is that the checkpointer process
was stuck somehow, but just "soft" stuck, in a way that allowed the
postmaster's time-to-shut-down-please signal to unstick it.  No,
I have no idea how that could happen exactly.  If it happens again,
it'd be really interesting to attach to the checkpointer with a
debugger and collect a stack trace.

            regards, tom lane


Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

От
Achilleas Mantzios
Дата:
On 6/11/18 4:58 μ.μ., Tom Lane wrote:
> Achilleas Mantzios <achill@matrix.gatewaynet.com> writes:
>> Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping
thosefiles afloat should also hold at this point, right.
 
>> The question is why didn't Postgresql removed them earlier.
> WAL files get removed/recycled after completion of a checkpoint.  So
> apparently, checkpoints were not finishing during normal operation,
> but the shutdown checkpoint managed to terminate normally.  That
> eliminates a lot of the usual theories about why checkpoints might
> not be succeeding (like a dirty buffer that always fails to be
> written, say as a result of broken permissions on its file).
>
> The only theory that comes to mind is that the checkpointer process
> was stuck somehow, but just "soft" stuck, in a way that allowed the
> postmaster's time-to-shut-down-please signal to unstick it.  No,
> I have no idea how that could happen exactly.  If it happens again,
> it'd be really interesting to attach to the checkpointer with a
> debugger and collect a stack trace.

Thank you Tom. Should I also re-run configure --enable-debug && make clean install ? Initially PostgreSQL was built
without--enable-debug, so currently it does not show source code line numbers in 
 
gdb bt .

>
>             regards, tom lane
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdown checkpoint

От
Tom Lane
Дата:
Achilleas Mantzios <achill@matrix.gatewaynet.com> writes:
> On 6/11/18 4:58 μ.μ., Tom Lane wrote:
>> I have no idea how that could happen exactly.  If it happens again,
>> it'd be really interesting to attach to the checkpointer with a
>> debugger and collect a stack trace.

> Thank you Tom. Should I also re-run configure --enable-debug && make clean install ? Initially PostgreSQL was built
without--enable-debug, so currently it does not show source code line numbers in  
> gdb bt .

Yeah, if you do not have debug symbols installed, the stack trace
will be of very little use.

            regards, tom lane


Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint

От
Achilleas Mantzios
Дата:
On 6/11/18 4:58 μ.μ., Tom Lane wrote:
> Achilleas Mantzios <achill@matrix.gatewaynet.com> writes:
>> Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping
thosefiles afloat should also hold at this point, right.
 
>> The question is why didn't Postgresql removed them earlier.
> WAL files get removed/recycled after completion of a checkpoint.  So
> apparently, checkpoints were not finishing during normal operation,
> but the shutdown checkpoint managed to terminate normally.  That
> eliminates a lot of the usual theories about why checkpoints might
> not be succeeding (like a dirty buffer that always fails to be
> written, say as a result of broken permissions on its file).
>
> The only theory that comes to mind is that the checkpointer process
> was stuck somehow, but just "soft" stuck, in a way that allowed the
> postmaster's time-to-shut-down-please signal to unstick it.  No,
> I have no idea how that could happen exactly.  If it happens again,
> it'd be really interesting to attach to the checkpointer with a
> debugger and collect a stack trace.

Hello Tom,
I enabled debugging as you suggested.
In the meantime yesterday (also Sunday) we had an identical incident as far symptoms are concerned , only that in this
(yesterday)case I fully explained it, and this is reported here : 
 
https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com
What happened is that walsender died on Friday, replication slot stuck, server kept wal files, pg_wal disk was full (on
Sundaymorning), system PANIC'ed :
 

10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:9
LOG: terminating walsender process due to replication timeout
 
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change
 
callback, associated LSN 13DF/393BF7F0
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr(at)dynacom line:11
LOG:disconnection: session time: 49:47:17.937 user=repmgr database=dynacom
 
host=10.9.0.77 port=48650

(so this problem appeared 2 days after walsender died)

So, I tried to find evidence that would connect the two incidents (this and previous Sunday) but could not find
anythingin either the publisher (primary) or the subscriber logs that would match wal 
 
sender or logical replication worker terminate or exit. *but* I found this on the primary's log back from 2018-11-01 :

10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG: 
couldnot send data to client: Connection reset by peer
 
10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:10
CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the commit 
 
callback, associated LSN 1290/6ADB518
10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.434 EET data_for_testsmadb_pub repmgr@dynacom line:11 LOG: 
disconnection:session time: 29:50:03.592 user=repmgr database=dynacom 
 
host=10.9.0.77 port=47489
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.476 EET [unknown] [unknown]@[unknown] line:1 LOG: 
connectionreceived: host=10.9.0.77 port=47612
 
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.477 EET [unknown] repmgr@dynacom line:2 LOG:  replication
connectionauthorized: user=repmgr
 
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.478 EET data_for_testsmadb_pub repmgr@dynacom line:3 LOG: 
receivedreplication command: IDENTIFY_SYSTEM
 
......
above pattern repeats 9 more times
(so this problem appeared 3 days after this connection reset)

In this case the wal sender did not terminate, but I wonder if it caused this "soft" stuck in the replication slot?
In this primary server we have only one replication slot (the one to the logical subscriber node), and archiving works
withoutproblems for years. So I cannot think of another reason that wals were 
 
accumulating in pg_wal .

>
>             regards, tom lane
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt