Обсуждение: [ADMIN] Too many WAL file created

Поиск
Список
Период
Сортировка

[ADMIN] Too many WAL file created

От
Gabriel Fernández Martínez
Дата:

Hi,

 

I am working in a Project who use PostgreSQL, and our application crash time to time due a lack of space because the px_xlog has consume all the available space.

 

I have discarded a problem with the archiving process, and the main problem appear to be related with the amount of WAL files created per minute. I can show that we are creating more than two WAL files per minute like you can see:

[DEV] [root@ip-10-175-17-84 /data/node1/pg_xlog]# ls -lrt

total 82208

-rw-------. 1 postgres postgres 16777216 mar 10 14:15 000000010000001C000000D7

-rw-------. 1 postgres postgres 16777216 mar 10 14:15 000000010000001C000000D8

-rw-------. 1 postgres postgres 16777216 mar 10 14:15 000000010000001C000000D9

-rw-------. 1 postgres postgres 16777216 mar 10 14:16 000000010000001C000000DA

drwx------. 2 postgres postgres   286720 mar 10 14:16 archive_status

-rw-------. 1 postgres postgres 16777216 mar 10 14:17 000000010000001C000000DB

But I can identify the transaction responsible for this growing.

 

WAL parameters with non-default values in place in the database are:

wal_level = 'archive'

archive_mode = on

archive_command = '/usr/bin/true'

 

In order to try to identify the transaction responsible for this behavior I have try to follow two approach:

1º Increase database login to “all”, I collect 97.906 selects in two minutes but no insert/update/delete, log parameters in use has been:

log_destination = 'syslog,csvlog'

logging_collector = on                  # Enable capturing of stderr and csvlog

log_directory = 'pg_log'                # directory where log files are written,

log_filename = 'postgresql-%Y-%m-%d.log'

log_line_prefix = '< %m >'                      # special values:

log_timezone = 'Europe/Madrid'

log_statement = 'ddl'

                an example of the selct capture could be:

2017-03-10 13:59:57.319 CET,"app","app",11919,"10.175.17.84:33895",58c2a02f.2e8f,1,"SELECT",2017-03-10 13:46:39 CET,4/4352895,0,LOG,00000,"execute S_1: SELECT

        id,

        locked_ts

        FROM table

        WHERE

        id = $1","parameters: $1 = 'TOPIC_CONSUMER'",,,,,,,,""

2º Use the pg_xlogdump, and I gather something like that:

rmgr: Heap        len (rec/tot):     14/   240, tx:   54704944, lsn: 1C/B9000038, prev 1C/B8FFFFD8, desc: UPDATE off 1 xmax 54704944 ; new off 13 xmax 0, blkref #0: rel 16431/598016/598033 blk 2709

rmgr: Btree       len (rec/tot):      2/    72, tx:   54704944, lsn: 1C/B9000128, prev 1C/B9000038, desc: INSERT_LEAF off 26, blkref #0: rel 16431/598016/598040 blk 3560

rmgr: Btree       len (rec/tot):      2/    72, tx:   54704944, lsn: 1C/B9000170, prev 1C/B9000128, desc: INSERT_LEAF off 103, blkref #0: rel 16431/598016/598042 blk 727

rmgr: Btree       len (rec/tot):      2/    64, tx:   54704944, lsn: 1C/B90001B8, prev 1C/B9000170, desc: INSERT_LEAF off 268, blkref #0: rel 16431/598016/598043 blk 1441

rmgr: Transaction len (rec/tot):      8/    34, tx:   54704944, lsn: 1C/B90001F8, prev 1C/B90001B8, desc: COMMIT 2017-03-10 13:58:57.494289 CET

rmgr: Heap2       len (rec/tot):      8/    56, tx:          0, lsn: 1C/B9000220, prev 1C/B90001F8, desc: CLEAN remxid 54702701, blkref #0: rel 16431/598016/598033 blk 15966

rmgr: Heap        len (rec/tot):     14/   241, tx:   54704945, lsn: 1C/B9000258, prev 1C/B9000220, desc: UPDATE off 33 xmax 54704945 ; new off 8 xmax 0, blkref #0: rel 16431/598016/598033 blk 15966

rmgr: Btree       len (rec/tot):      2/  3589, tx:   54704945, lsn: 1C/B9000350, prev 1C/B9000258, desc: INSERT_LEAF off 3, blkref #0: rel 16431/598016/598040 blk 1567 FPW

rmgr: Btree       len (rec/tot):      2/    72, tx:   54704945, lsn: 1C/B9001158, prev 1C/B9000350, desc: INSERT_LEAF off 103, blkref #0: rel 16431/598016/598042 blk 727

rmgr: Btree       len (rec/tot):      2/    64, tx:   54704945, lsn: 1C/B90011A0, prev 1C/B9001158, desc: INSERT_LEAF off 269, blkref #0: rel 16431/598016/598043 blk 1441

rmgr: Transaction len (rec/tot):      8/    34, tx:   54704945, lsn: 1C/B90011E0, prev 1C/B90011A0, desc: COMMIT 2017-03-10 13:58:57.546523 CET

 

Any suggestion it is welcome to address this problem.

 

Regards

 

Gabriel

Re: [ADMIN] Too many WAL file created

От
Albe Laurenz
Дата:
Gabriel Fernández Martínez wrote:
> I am working in a Project who use PostgreSQL, and our application crash time to time due a lack of
> space because the px_xlog has consume all the available space.
> 
> I have discarded a problem with the archiving process, and the main problem appear to be related with
> the amount of WAL files created per minute. I can show that we are creating more than two WAL files
> per minute like you can see:
> 
> [DEV] [root@ip-10-175-17-84 /data/node1/pg_xlog]# ls -lrt
> total 82208
> -rw-------. 1 postgres postgres 16777216 mar 10 14:15 000000010000001C000000D7
> -rw-------. 1 postgres postgres 16777216 mar 10 14:15 000000010000001C000000D8
> -rw-------. 1 postgres postgres 16777216 mar 10 14:15 000000010000001C000000D9
> -rw-------. 1 postgres postgres 16777216 mar 10 14:16 000000010000001C000000DA
> drwx------. 2 postgres postgres   286720 mar 10 14:16 archive_status
> -rw-------. 1 postgres postgres 16777216 mar 10 14:17 000000010000001C000000DB
> But I can identify the transaction responsible for this growing.
> 
> WAL parameters with non-default values in place in the database are:
> wal_level = 'archive'
> archive_mode = on
> archive_command = '/usr/bin/true'

You are right that you have discarded the archiving problem, because no archives
are written any more.

What is your backup strategy? If you are using anything else than "pg_dump",
this move has probably destroyed your backup process.

> In order to try to identify the transaction responsible for this behavior I have try to follow two
> approach:
> 
> 1º Increase database login to “all”, I collect 97.906 selects in two minutes but no
> insert/update/delete, log parameters in use has been:
>
> log_destination = 'syslog,csvlog'
> logging_collector = on                  # Enable capturing of stderr and csvlog
> log_directory = 'pg_log'                # directory where log files are written,
> log_filename = 'postgresql-%Y-%m-%d.log'
> log_line_prefix = '< %m >'                      # special values:
> log_timezone = 'Europe/Madrid'
> log_statement = 'ddl'

You should set the parameter "log_statement" to "all", then everything will be logged.

The way it is now, only statements that change tables and other objects will be logged.

>                 an example of the selct capture could be:
> 2017-03-10 13:59:57.319 CET,"app","app",11919,"10.175.17.84:33895",58c2a02f.2e8f,1,"SELECT",2017-03-10
> 13:46:39 CET,4/4352895,0,LOG,00000,"execute S_1: SELECT
>         id,
>         locked_ts
>         FROM table
>         WHERE
>         id = $1","parameters: $1 = 'TOPIC_CONSUMER'",,,,,,,,""

A SELECT statement like will not cause a change to the database, and WAL is only
written when something in the database has changed.
So SELECT statements like this one cannot be the cause of your problem.

I have two suggestions:

1) Change "log_statement" to "all" and activate the changes with "pg_ctl reload".
   That will log all statements.
   You should find data modifying statements in there.
   Sometimes a SELECT statement can also change data, particularly when a function
   is called that modifies the database.

2) There might be a long running statement that gets logged only once, but runs for
   hours or days on end.  To find such a statement, run a query like

      SELECT datname, usename, query_start, query
      FROM pg_stat_activity
      WHERE state <> 'idle';

  If you find an INSERT, UPDATE or DELETE statement that has been running for a long
  time, you have found your problem.

3) Set "log_autovacuum_min_duration" to 0 and see if anything gets logged.
   It may be that there is an autovacuum background process that is busy cleaning up
   old rows in a large table.
   If you are on PostgreSQL 9.6, you can use the "pg_stat_progress_vacuum" view to see
   currently running autovacuum processes, on older versions you have to look at the
   processes on the operating system level.

> 2º Use the pg_xlogdump, and I gather something like that:
> 
> rmgr: Heap        len (rec/tot):     14/   240, tx:   54704944, lsn: 1C/B9000038, prev 1C/B8FFFFD8,
> desc: UPDATE off 1 xmax 54704944 ; new off 13 xmax 0, blkref #0: rel 16431/598016/598033 blk 2709
> rmgr: Btree       len (rec/tot):      2/    72, tx:   54704944, lsn: 1C/B9000128, prev 1C/B9000038,
> desc: INSERT_LEAF off 26, blkref #0: rel 16431/598016/598040 blk 3560
> rmgr: Btree       len (rec/tot):      2/    72, tx:   54704944, lsn: 1C/B9000170, prev 1C/B9000128,
> desc: INSERT_LEAF off 103, blkref #0: rel 16431/598016/598042 blk 727
> rmgr: Btree       len (rec/tot):      2/    64, tx:   54704944, lsn: 1C/B90001B8, prev 1C/B9000170,
> desc: INSERT_LEAF off 268, blkref #0: rel 16431/598016/598043 blk 1441
> rmgr: Transaction len (rec/tot):      8/    34, tx:   54704944, lsn: 1C/B90001F8, prev 1C/B90001B8,
> desc: COMMIT 2017-03-10 13:58:57.494289 CET
> rmgr: Heap2       len (rec/tot):      8/    56, tx:          0, lsn: 1C/B9000220, prev 1C/B90001F8,
> desc: CLEAN remxid 54702701, blkref #0: rel 16431/598016/598033 blk 15966
> rmgr: Heap        len (rec/tot):     14/   241, tx:   54704945, lsn: 1C/B9000258, prev 1C/B9000220,
> desc: UPDATE off 33 xmax 54704945 ; new off 8 xmax 0, blkref #0: rel 16431/598016/598033 blk 15966
> rmgr: Btree       len (rec/tot):      2/  3589, tx:   54704945, lsn: 1C/B9000350, prev 1C/B9000258,
> desc: INSERT_LEAF off 3, blkref #0: rel 16431/598016/598040 blk 1567 FPW
> rmgr: Btree       len (rec/tot):      2/    72, tx:   54704945, lsn: 1C/B9001158, prev 1C/B9000350,
> desc: INSERT_LEAF off 103, blkref #0: rel 16431/598016/598042 blk 727
> rmgr: Btree       len (rec/tot):      2/    64, tx:   54704945, lsn: 1C/B90011A0, prev 1C/B9001158,
> desc: INSERT_LEAF off 269, blkref #0: rel 16431/598016/598043 blk 1441
> rmgr: Transaction len (rec/tot):      8/    34, tx:   54704945, lsn: 1C/B90011E0, prev 1C/B90011A0,
> desc: COMMIT 2017-03-10 13:58:57.546523 CET

That tells you that the table where the changes are taking place is base/598016/598033 in the
PostgreSQL data directory.  You can use the "oid2name" utility to find out which table that is.

Yours,
Laurenz Albe

Re: [ADMIN] Too many WAL file created

От
Gabriel Fernández Martínez
Дата:
Hi Albe,

Thank you for your suggestions has been so usefull. Regarding your tips:
1º Backup: You are right, we are using meanwhile a pg_dump to have a backup till the issue could be solved.
2º Logging: You are right, I include the current setup, not the log conf in use to gather the stats, I used the
suggestedparameter (log_statement = all) and I only capture SELECT statements
 
3º I agree, a Select cannot increase the size of the WAL file.

Regarding your suggestions:
1) Change "log_statement" to "all" and activate the changes with "pg_ctl reload": already done an no update captured.
2) There might be a long running statement that gets logged only once: already done, no long running transaction
3) Autovacuum, I have not tested before, I have enabled log_autovacuum_min_duration = 0, no background process has been
identified.

I have used the pg_class and pg_database to identify the "relation" who has been updated, thank you for your point of
oid2nameI have learn another approach. With this information I have been able that all the operations has been executed
onthe same relation (22M of updated in 3 days in the table plus thee indexes) and I am working with the development
teamabout why the application show this behavior.
 

I have found another approach to identify the "relations" who suffer more changes, the statistic table
pg_stat_all_tables.

Best regards

-----Original Message-----
From: Albe Laurenz [mailto:laurenz.albe@wien.gv.at] 
Sent: viernes, 10 de marzo de 2017 15:33
To: Gabriel Fernández Martínez <gfernandez@hotelbeds.com>; pgsql-admin@postgresql.org
Subject: RE: Too many WAL file created

Gabriel Fernández Martínez wrote:
> I am working in a Project who use PostgreSQL, and our application 
> crash time to time due a lack of space because the px_xlog has consume all the available space.
> 
> I have discarded a problem with the archiving process, and the main 
> problem appear to be related with the amount of WAL files created per 
> minute. I can show that we are creating more than two WAL files per minute like you can see:
> 
> [DEV] [root@ip-10-175-17-84 /data/node1/pg_xlog]# ls -lrt total 82208 
> -rw-------. 1 postgres postgres 16777216 mar 10 14:15 
> 000000010000001C000000D7 -rw-------. 1 postgres postgres 16777216 mar 
> 10 14:15 000000010000001C000000D8 -rw-------. 1 postgres postgres 
> 16777216 mar 10 14:15 000000010000001C000000D9 -rw-------. 1 postgres postgres 16777216 mar 10 14:16
000000010000001C000000DA
> drwx------. 2 postgres postgres   286720 mar 10 14:16 archive_status
> -rw-------. 1 postgres postgres 16777216 mar 10 14:17 
> 000000010000001C000000DB But I can identify the transaction responsible for this growing.
> 
> WAL parameters with non-default values in place in the database are:
> wal_level = 'archive'
> archive_mode = on
> archive_command = '/usr/bin/true'

You are right that you have discarded the archiving problem, because no archives are written any more.

What is your backup strategy? If you are using anything else than "pg_dump", this move has probably destroyed your
backupprocess.
 

> In order to try to identify the transaction responsible for this 
> behavior I have try to follow two
> approach:
> 
> 1º Increase database login to “all”, I collect 97.906 selects in two 
> minutes but no insert/update/delete, log parameters in use has been:
>
> log_destination = 'syslog,csvlog'
> logging_collector = on                  # Enable capturing of stderr and csvlog
> log_directory = 'pg_log'                # directory where log files are written,
> log_filename = 'postgresql-%Y-%m-%d.log'
> log_line_prefix = '< %m >'                      # special values:
> log_timezone = 'Europe/Madrid'
> log_statement = 'ddl'

You should set the parameter "log_statement" to "all", then everything will be logged.

The way it is now, only statements that change tables and other objects will be logged.

>                 an example of the selct capture could be:
> 2017-03-10 13:59:57.319 
> CET,"app","app",11919,"10.175.17.84:33895",58c2a02f.2e8f,1,"SELECT",20
> 17-03-10
> 13:46:39 CET,4/4352895,0,LOG,00000,"execute S_1: SELECT
>         id,
>         locked_ts
>         FROM table
>         WHERE
>         id = $1","parameters: $1 = 'TOPIC_CONSUMER'",,,,,,,,""

A SELECT statement like will not cause a change to the database, and WAL is only written when something in the database
haschanged.
 
So SELECT statements like this one cannot be the cause of your problem.

I have two suggestions:

1) Change "log_statement" to "all" and activate the changes with "pg_ctl reload".
   That will log all statements.
   You should find data modifying statements in there.
   Sometimes a SELECT statement can also change data, particularly when a function
   is called that modifies the database.

2) There might be a long running statement that gets logged only once, but runs for
   hours or days on end.  To find such a statement, run a query like

      SELECT datname, usename, query_start, query
      FROM pg_stat_activity
      WHERE state <> 'idle';

  If you find an INSERT, UPDATE or DELETE statement that has been running for a long
  time, you have found your problem.

3) Set "log_autovacuum_min_duration" to 0 and see if anything gets logged.
   It may be that there is an autovacuum background process that is busy cleaning up
   old rows in a large table.
   If you are on PostgreSQL 9.6, you can use the "pg_stat_progress_vacuum" view to see
   currently running autovacuum processes, on older versions you have to look at the
   processes on the operating system level.

> 2º Use the pg_xlogdump, and I gather something like that:
> 
> rmgr: Heap        len (rec/tot):     14/   240, tx:   54704944, lsn: 1C/B9000038, prev 1C/B8FFFFD8,
> desc: UPDATE off 1 xmax 54704944 ; new off 13 xmax 0, blkref #0: rel 16431/598016/598033 blk 2709
> rmgr: Btree       len (rec/tot):      2/    72, tx:   54704944, lsn: 1C/B9000128, prev 1C/B9000038,
> desc: INSERT_LEAF off 26, blkref #0: rel 16431/598016/598040 blk 3560
> rmgr: Btree       len (rec/tot):      2/    72, tx:   54704944, lsn: 1C/B9000170, prev 1C/B9000128,
> desc: INSERT_LEAF off 103, blkref #0: rel 16431/598016/598042 blk 727
> rmgr: Btree       len (rec/tot):      2/    64, tx:   54704944, lsn: 1C/B90001B8, prev 1C/B9000170,
> desc: INSERT_LEAF off 268, blkref #0: rel 16431/598016/598043 blk 1441
> rmgr: Transaction len (rec/tot):      8/    34, tx:   54704944, lsn: 1C/B90001F8, prev 1C/B90001B8,
> desc: COMMIT 2017-03-10 13:58:57.494289 CET
> rmgr: Heap2       len (rec/tot):      8/    56, tx:          0, lsn: 1C/B9000220, prev 1C/B90001F8,
> desc: CLEAN remxid 54702701, blkref #0: rel 16431/598016/598033 blk 15966
> rmgr: Heap        len (rec/tot):     14/   241, tx:   54704945, lsn: 1C/B9000258, prev 1C/B9000220,
> desc: UPDATE off 33 xmax 54704945 ; new off 8 xmax 0, blkref #0: rel 16431/598016/598033 blk 15966
> rmgr: Btree       len (rec/tot):      2/  3589, tx:   54704945, lsn: 1C/B9000350, prev 1C/B9000258,
> desc: INSERT_LEAF off 3, blkref #0: rel 16431/598016/598040 blk 1567 FPW
> rmgr: Btree       len (rec/tot):      2/    72, tx:   54704945, lsn: 1C/B9001158, prev 1C/B9000350,
> desc: INSERT_LEAF off 103, blkref #0: rel 16431/598016/598042 blk 727
> rmgr: Btree       len (rec/tot):      2/    64, tx:   54704945, lsn: 1C/B90011A0, prev 1C/B9001158,
> desc: INSERT_LEAF off 269, blkref #0: rel 16431/598016/598043 blk 1441
> rmgr: Transaction len (rec/tot):      8/    34, tx:   54704945, lsn: 1C/B90011E0, prev 1C/B90011A0,
> desc: COMMIT 2017-03-10 13:58:57.546523 CET

That tells you that the table where the changes are taking place is base/598016/598033 in the PostgreSQL data
directory. You can use the "oid2name" utility to find out which table that is.
 

Yours,
Laurenz Albe

Re: [ADMIN] Too many WAL file created

От
Albe Laurenz
Дата:
Gabriel Fernández Martínez wrote:
> Thank you for your suggestions has been so usefull. Regarding your tips:
> 1º Backup: You are right, we are using meanwhile a pg_dump to have a backup till the issue could be
> solved.
> 2º Logging: You are right, I include the current setup, not the log conf in use to gather the stats, I
> used the suggested parameter (log_statement = all) and I only capture SELECT statements
> 3º I agree, a Select cannot increase the size of the WAL file.
> 
> Regarding your suggestions:
> 1) Change "log_statement" to "all" and activate the changes with "pg_ctl reload": already done an no
> update captured.
> 2) There might be a long running statement that gets logged only once: already done, no long running
> transaction
> 3) Autovacuum, I have not tested before, I have enabled log_autovacuum_min_duration = 0, no background
> process has been identified.

Then you have conclusively proved that WAL is not growing for this database cluster.

The only remaining options are:
- You are looking at the wrong PostgreSQL database cluster.
- You missed the statements in the log.
  Note that other statements than UPDATE also create WAL, like INSERT, DELETE and COPY.
  Also note that SQL is case insensitive.

Yours,
Laurenz Albe