Обсуждение: archive falling behind

Поиск
Список
Период
Сортировка

archive falling behind

От
German Becker
Дата:
Hi I am working with version 9.1.9. The dabase is in archiving mode. The archives are then sended to a hotstandby replication sever.
I have seen that under heavy load, the archive process will fall behind the WAL generation and thuse some WAL segments won't get archived so the replication stops. To recover from this I need to do a new cold backup and send it to the replication server. Is there a way to limit the WAL generation (i.e slowing down insterts and deletes) so as to prevent the archive from falling behind?

Thanks!!

Re: archive falling behind

От
Strahinja Kustudić
Дата:
How can the archive process fall behind? Postgres will never reuse WAL files which are not yet archived.

Regarding your question about slowing down WAL generation, that is not possible to do, unless you slow down the application which is doing the writing into the database.

Regards,
Strahinja

On Mon, Apr 22, 2013 at 6:03 PM, German Becker <german.becker@gmail.com> wrote:
Hi I am working with version 9.1.9. The dabase is in archiving mode. The archives are then sended to a hotstandby replication sever.
I have seen that under heavy load, the archive process will fall behind the WAL generation and thuse some WAL segments won't get archived so the replication stops. To recover from this I need to do a new cold backup and send it to the replication server. Is there a way to limit the WAL generation (i.e slowing down insterts and deletes) so as to prevent the archive from falling behind?

Thanks!!

Re: archive falling behind

От
German Becker
Дата:
Strahinja,

From my experience, postgres will delete WAL (after checkpoint) regardless if they have been archived. Are you saying this is abnormal?


On Thu, Apr 25, 2013 at 7:45 PM, Strahinja Kustudić <strahinjak@nordeus.com> wrote:
How can the archive process fall behind? Postgres will never reuse WAL files which are not yet archived.

Regarding your question about slowing down WAL generation, that is not possible to do, unless you slow down the application which is doing the writing into the database.

Regards,
Strahinja

On Mon, Apr 22, 2013 at 6:03 PM, German Becker <german.becker@gmail.com> wrote:
Hi I am working with version 9.1.9. The dabase is in archiving mode. The archives are then sended to a hotstandby replication sever.
I have seen that under heavy load, the archive process will fall behind the WAL generation and thuse some WAL segments won't get archived so the replication stops. To recover from this I need to do a new cold backup and send it to the replication server. Is there a way to limit the WAL generation (i.e slowing down insterts and deletes) so as to prevent the archive from falling behind?

Thanks!!


Re: archive falling behind

От
Albe Laurenz
Дата:
German Becker wrote:
> From my experience, postgres will delete WAL (after checkpoint) regardless if they have been archived.
> Are you saying this is abnormal?

That would be quite abnormal.
Could it be that your archive_command has exit status 0
even if something goes wrong?

What are the archive settings?

Yours,
Laurenz Albe


Re: archive falling behind

От
German Becker
Дата:
Here is the archive part of the config:

archive_mode = on               # allows archiving to be done
                                # (change requires restart)
archive_command = '/var/lib/postgresql/scripts/archive_copy.sh %p %f'           # command to use to archive a logfile segment
#archive_timeout = 0            # force a logfile segment switch after this
                                # number of seconds; 0 disables


The archive coommand makes a local copy and then it copies to the backup server via ssh. Both copies are md5-checked and retried up to 3 times in case of failure. 

I have seen under heavy load that some WALs are skipped, some have less size, some are corrupted (i,e, the loop fails 3 times). 
I'm not sure about the return value (checking it). What is the expected behaviour of the archiver? Will it retry de archive if archive command returns differnt than 0? Will it retain the WAL segment until it is succesfuly archived?


On Fri, Apr 26, 2013 at 9:53 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
German Becker wrote:
> From my experience, postgres will delete WAL (after checkpoint) regardless if they have been archived.
> Are you saying this is abnormal?

That would be quite abnormal.
Could it be that your archive_command has exit status 0
even if something goes wrong?

What are the archive settings?

Yours,
Laurenz Albe

Re: archive falling behind

От
Federico
Дата:
The shell script call used in the archive_command is a bad idea. 

Each time a new segment is archived a new shell is started and this add a be massive overload, then and you have an extra overload for the ssh transfer.....

I suggest you to stick with the simple cp command with the test option from the manual then transfer the archived segments in a second time using a more reliable system like rsync.

Cheers
Federico


On 26 April 2013 14:06, German Becker <german.becker@gmail.com> wrote:
Here is the archive part of the config:

archive_mode = on               # allows archiving to be done
                                # (change requires restart)
archive_command = '/var/lib/postgresql/scripts/archive_copy.sh %p %f'           # command to use to archive a logfile segment
#archive_timeout = 0            # force a logfile segment switch after this
                                # number of seconds; 0 disables


The archive coommand makes a local copy and then it copies to the backup server via ssh. Both copies are md5-checked and retried up to 3 times in case of failure. 

I have seen under heavy load that some WALs are skipped, some have less size, some are corrupted (i,e, the loop fails 3 times). 
I'm not sure about the return value (checking it). What is the expected behaviour of the archiver? Will it retry de archive if archive command returns differnt than 0? Will it retain the WAL segment until it is succesfuly archived?


On Fri, Apr 26, 2013 at 9:53 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
German Becker wrote:
> From my experience, postgres will delete WAL (after checkpoint) regardless if they have been archived.
> Are you saying this is abnormal?

That would be quite abnormal.
Could it be that your archive_command has exit status 0
even if something goes wrong?

What are the archive settings?

Yours,
Laurenz Albe




--
Federico Campoli
DE MATERIALIZING, UK, Planet Earth, The Milky Way Galaxy
/*******************************
There's no point being grown-up if you can't be childish sometimes.
(The fourth Doctor)
http://www.pgdba.co.uk
*******************************/

Re: archive falling behind

От
Albe Laurenz
Дата:
German Becker wrote:
> Here is the archive part of the config:
>
> archive_mode = on               # allows archiving to be done
>                                 # (change requires restart)
> archive_command = '/var/lib/postgresql/scripts/archive_copy.sh %p %f'           # command to use to
> archive a logfile segment
> #archive_timeout = 0            # force a logfile segment switch after this
>                                 # number of seconds; 0 disables

So the problem might be in that script.

> The archive coommand makes a local copy and then it copies to the backup server via ssh. Both copies
> are md5-checked and retried up to 3 times in case of failure.

archive_command should not retry the operation, but rather
return a non-zero return code.

> I have seen under heavy load that some WALs are skipped, some have less size, some are corrupted (i,e,
> the loop fails 3 times).
> I'm not sure about the return value (checking it). What is the expected behaviour of the archiver?
> Will it retry de archive if archive command returns differnt than 0? Will it retain the WAL segment
> until it is succesfuly archived?

See http://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL

archive_command should exit with zero only if the
WAL segment was archived successfully.
PostgreSQL will retry and retain the WAL segment until
archival succeeds.

Yours,
Laurenz Albe


Re: archive falling behind

От
Strahinja Kustudić
Дата:
What does your shell script do that you need a script for archiving? The archive command is usually a cp/scp/rsync command, you usually don't need more than that.

Regards,
Strahinja


On Fri, Apr 26, 2013 at 3:55 PM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
German Becker wrote:
> Here is the archive part of the config:
>
> archive_mode = on               # allows archiving to be done
>                                 # (change requires restart)
> archive_command = '/var/lib/postgresql/scripts/archive_copy.sh %p %f'           # command to use to
> archive a logfile segment
> #archive_timeout = 0            # force a logfile segment switch after this
>                                 # number of seconds; 0 disables

So the problem might be in that script.

> The archive coommand makes a local copy and then it copies to the backup server via ssh. Both copies
> are md5-checked and retried up to 3 times in case of failure.

archive_command should not retry the operation, but rather
return a non-zero return code.

> I have seen under heavy load that some WALs are skipped, some have less size, some are corrupted (i,e,
> the loop fails 3 times).
> I'm not sure about the return value (checking it). What is the expected behaviour of the archiver?
> Will it retry de archive if archive command returns differnt than 0? Will it retain the WAL segment
> until it is succesfuly archived?

See http://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL

archive_command should exit with zero only if the
WAL segment was archived successfully.
PostgreSQL will retry and retain the WAL segment until
archival succeeds.

Yours,
Laurenz Albe

Re: archive falling behind

От
German Becker
Дата:
Hi I have reverted to cp as archive command, but know under heavy load (> 150 WAL segments  in a minute) it happens that some wal segments gets corrupted:

postgres@lemur:~/9.1/main/pg_xlog$ md5sum 000000010000001000000049
f1906d2745224430f811496df466203f  000000010000001000000049
postgres@lemur:~/9.1/main/pg_xlog$ md5sum ~/backups/wal/000000010000001000000049
7e73fe759e41e427497360a815f9d3e1  /var/lib/postgresql/backups/wal/000000010000001000000049





On Fri, Apr 26, 2013 at 10:55 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
German Becker wrote:
> Here is the archive part of the config:
>
> archive_mode = on               # allows archiving to be done
>                                 # (change requires restart)
> archive_command = '/var/lib/postgresql/scripts/archive_copy.sh %p %f'           # command to use to
> archive a logfile segment
> #archive_timeout = 0            # force a logfile segment switch after this
>                                 # number of seconds; 0 disables

So the problem might be in that script.

> The archive coommand makes a local copy and then it copies to the backup server via ssh. Both copies
> are md5-checked and retried up to 3 times in case of failure.

archive_command should not retry the operation, but rather
return a non-zero return code.

> I have seen under heavy load that some WALs are skipped, some have less size, some are corrupted (i,e,
> the loop fails 3 times).
> I'm not sure about the return value (checking it). What is the expected behaviour of the archiver?
> Will it retry de archive if archive command returns differnt than 0? Will it retain the WAL segment
> until it is succesfuly archived?

See http://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL

archive_command should exit with zero only if the
WAL segment was archived successfully.
PostgreSQL will retry and retain the WAL segment until
archival succeeds.

Yours,
Laurenz Albe

Re: archive falling behind

От
German Becker
Дата:
Actually this seems like a very strange filesystem /hw problem. The wal segments keep "changing" even after I stoped the database and noone is supposly accesing it:

root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
6fd36722641dc2857bb950437c052fa3  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
26e9c82d123513528824bdf9815dbd2b  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
649111a77ac7ec26f4ddeed18e039faa  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# lsof 000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
ac9ba79e672bc5df2c126044e9054ff7  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
8956e59a4542599e8ded7450b7cab5a6  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
514dccfe7f5df4c55747e14e6c13268f  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
f2c53795afcbc7c150443a3cdd3550bb  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
79687effd43c0e51a127a677e14a815c  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
51b66cd72ed3fb11aa57fab244696e0f  000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# md5sum 000000010000001000000049
bf1a2ec5847c40a0b9200769cff601e4  000000010000001000000049

root@lemur:/var/lib/postgresql/9.1/main/pg_xlog# lsof 000000010000001000000049
root@lemur:/var/lib/postgresql/9.1/main/pg_xlog#


Maybe this is off-topic but has anyone seen something like this? I'm on Ubuntu 12.04. This is the hard drive mount line (the hard drive is used exclusivly for the pg_xlog directory):

/dev/sdb1 on /storage/sdb1 type ext4 (rw,noatime,errors=remount-ro)

Thanks!


On Fri, Apr 26, 2013 at 4:25 PM, German Becker <german.becker@gmail.com> wrote:
Hi I have reverted to cp as archive command, but know under heavy load (> 150 WAL segments  in a minute) it happens that some wal segments gets corrupted:

postgres@lemur:~/9.1/main/pg_xlog$ md5sum 000000010000001000000049
f1906d2745224430f811496df466203f  000000010000001000000049
postgres@lemur:~/9.1/main/pg_xlog$ md5sum ~/backups/wal/000000010000001000000049
7e73fe759e41e427497360a815f9d3e1  /var/lib/postgresql/backups/wal/000000010000001000000049





On Fri, Apr 26, 2013 at 10:55 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
German Becker wrote:
> Here is the archive part of the config:
>
> archive_mode = on               # allows archiving to be done
>                                 # (change requires restart)
> archive_command = '/var/lib/postgresql/scripts/archive_copy.sh %p %f'           # command to use to
> archive a logfile segment
> #archive_timeout = 0            # force a logfile segment switch after this
>                                 # number of seconds; 0 disables

So the problem might be in that script.

> The archive coommand makes a local copy and then it copies to the backup server via ssh. Both copies
> are md5-checked and retried up to 3 times in case of failure.

archive_command should not retry the operation, but rather
return a non-zero return code.

> I have seen under heavy load that some WALs are skipped, some have less size, some are corrupted (i,e,
> the loop fails 3 times).
> I'm not sure about the return value (checking it). What is the expected behaviour of the archiver?
> Will it retry de archive if archive command returns differnt than 0? Will it retain the WAL segment
> until it is succesfuly archived?

See http://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL

archive_command should exit with zero only if the
WAL segment was archived successfully.
PostgreSQL will retry and retain the WAL segment until
archival succeeds.

Yours,
Laurenz Albe


Re: archive falling behind

От
"David F. Skoll"
Дата:
From: German Becker <german.becker@gmail.com>
Subject: Re: archive falling behind

> Actually this seems like a very strange filesystem /hw problem. The
> wal segments keep "changing" even after I stoped the database and
> noone is supposly accesing it:

I've seen this before.  It was bad RAM.  I bet it's a hardware problem
for you as well... either bad RAM or a flaky disk.

In my case, I found it out because I got syntax errors in the
PostgreSQL logs: SELGCT was not a valid keyword.  It turns out
that the flaky RAM was flipping bit 1 occasionally, so it would
randomly add or subtract 2 from a byte.

Regards,

David.


Re: archive falling behind

От
German Becker
Дата:
David, You were absolutly right. I broght the server down and it was a faulty RAM indeed.Thank you very much for your help!


On Sat, Apr 27, 2013 at 5:51 PM, David F. Skoll <dfs@roaringpenguin.com> wrote:

From: German Becker <german.becker@gmail.com>
Subject: Re: archive falling behind

> Actually this seems like a very strange filesystem /hw problem. The
> wal segments keep "changing" even after I stoped the database and
> noone is supposly accesing it:

I've seen this before.  It was bad RAM.  I bet it's a hardware problem
for you as well... either bad RAM or a flaky disk.

In my case, I found it out because I got syntax errors in the
PostgreSQL logs: SELGCT was not a valid keyword.  It turns out
that the flaky RAM was flipping bit 1 occasionally, so it would
randomly add or subtract 2 from a byte.

Regards,

David.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin