Re: [GENERAL] WAL Archive command.

Поиск
Список
Период
Сортировка
От John Britto
Тема Re: [GENERAL] WAL Archive command.
Дата
Msg-id CAExAUcVUWa+=1D3bDuh9nFokbpzKTxmLYSnvQFWWQ68_+vP88g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [GENERAL] WAL Archive command.  (Scott Mead <scottm@openscg.com>)
Список pgsql-general
Hi,

I am running streaming replication with the archive.

As you can see below that Master pg_xlog is at WAL: 000000010000013300000093. But archive_status shows way behind: 000000010000013300000088.done
What could be the reason behind this? How should I let the PostgreSQL archive the WAL from 000000010000013300000089 to 000000010000013300000092.

pg_xlog/
----------------------------------
-rw------- 1 postgres postgres 16777216 Sep 27 23:30 000000010000013300000082
-rw------- 1 postgres postgres 16777216 Sep 27 23:31 000000010000013300000083
-rw------- 1 postgres postgres 16777216 Sep 27 23:31 000000010000013300000084
-rw------- 1 postgres postgres 16777216 Sep 27 23:31 000000010000013300000085
-rw------- 1 postgres postgres 16777216 Sep 27 23:51 000000010000013300000086
-rw------- 1 postgres postgres 16777216 Sep 28 00:47 000000010000013300000087
-rw------- 1 postgres postgres 16777216 Sep 28 01:55 000000010000013300000088
-rw------- 1 postgres postgres 16777216 Sep 28 02:47 000000010000013300000089
-rw------- 1 postgres postgres 16777216 Sep 27 18:04 00000001000001330000008A
-rw------- 1 postgres postgres 16777216 Sep 27 18:02 00000001000001330000008B
-rw------- 1 postgres postgres 16777216 Sep 27 18:02 00000001000001330000008C
-rw------- 1 postgres postgres 16777216 Sep 27 18:05 00000001000001330000008D
-rw------- 1 postgres postgres 16777216 Sep 27 18:03 00000001000001330000008E
-rw------- 1 postgres postgres 16777216 Sep 27 18:02 00000001000001330000008F
-rw------- 1 postgres postgres 16777216 Sep 27 18:02 000000010000013300000090
-rw------- 1 postgres postgres 16777216 Sep 27 18:37 000000010000013300000091
-rw------- 1 postgres postgres 16777216 Sep 27 20:21 000000010000013300000092
-rw------- 1 postgres postgres 16777216 Sep 27 21:00 000000010000013300000093

pg_xlog/archive_status
----------------------------------
-rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000081.done
-rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000082.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000083.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000084.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000085.done
-rw------- 1 postgres postgres 0 Sep 27 23:51 000000010000013300000086.done
-rw------- 1 postgres postgres 0 Sep 28 00:47 000000010000013300000087.done
-rw------- 1 postgres postgres 0 Sep 28 01:55 000000010000013300000088.done

postgresql.conf
#----------------------------------
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 32
archive_mode = on
archive_command = '/opt/wal_archive.sh "%p" "%f" > archive_command.log 2>&1'

wal_archive.sh
------------------------------------
#!/bin/bash -xv
PG_XLOG="$1"
PG_XLOG_FILENAME="$2"
HOST=hostname
ARCHIVE_DIR="/pg_archive/master"
ARCHIVE_TO_KEEP="3"     #No of days of archive logs to keep
EMAIL="a@b.com"
ERROR_COUNT_FILE="/tmp/replication_archive_error_count.txt"

OLD_COUNT=`cat ${ERROR_COUNT_FILE}`
DNOW=`date +%u`
hour=$(date +%H)
D=`date`

#Do the cleanup if the day is Monday or Thursday and time is between 11 p.m. UTC and 22 hrs UTC
if [ "$DNOW" -eq "1" -o "$DNOW" -eq "4" -a "$hour" -ge 11 -a "$hour" -lt 22 ]; then
    find "${ARCHIVE_DIR}"/ -type f -mtime +"${ARCHIVE_TO_KEEP}" -exec rm -f {} +
    if [ "$?" -eq "1" ]; then
        echo "The wal_archive script could not cleanup the archive directory of $HOST" | mail -s "ERROR - WAL Archive for $HOST" "$EMAIL"
    fi
fi

if [ ! -f "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}" ]; then
    cp "${PG_XLOG}" "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
    /usr/bin/rsync -W -az "${PG_XLOG}" postgres@standby:"${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
    if [ "$?" -ne "0" ]; then
        #If rsync fails, then remove the copied file from master, increase the error count, and retry.
        rm -rf "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
        NEW_COUNT=`expr $OLD_COUNT + 1`
        if [ "$NEW_COUNT" -ge "100" ]; then
            echo -e "${D}""\n""Rsync could not transfer the WAL file from Master to slave." | mail -s "ALERT - WAL Archive for $HOST" "$EMAIL"
            echo "0" > $ERROR_COUNT_FILE
        else
            echo "$NEW_COUNT" > $ERROR_COUNT_FILE
        fi
        exit 1
    else
        echo "0" > $ERROR_COUNT_FILE
        exit 0
    fi
else
    exit 0
fi


John Britto, M.Sc.
DevOps Engineer
Sodexis, Inc.
www.sodexis.com
M: +91-0-8012186991



On Thu, Sep 28, 2017 at 1:05 AM, Scott Mead <scottm@openscg.com> wrote:


On Wed, Sep 27, 2017 at 2:55 PM, Jerry Sievers <gsievers19@comcast.net> wrote:
John Britto <john@sodexis.com> writes:

> Hello,
>
> I have a streaming replication setup along with WAL archive.
>
> archive_command = ‘test ! -f /var/pg_archive/%f && cp %p <archive
> location>%f && scp %p postgres@192.168.0.123:<archive location>/%f'
>
> When the SCP command fails, the master repeatedly tries to send the
> archived WAL to standby. But during this time, the pg_xlog directly
> grew with newer WAL files.
>
> The streaming replication hasn't had the problem because on my check,
> the WAL write location on the primary was same with the last WAL
> location received/replayed in standby.
>
> Since the pg_xlog in the master had few newer WAL files, the master
> archive is lagging to pick the current pg_xlog WAL file.  When a new
> WAL occur in the pg_xlog, Master picks the old WAL file to send to
> the standby.

Yup Pg is going to handle the unshipped WALs one at a time and it will
do them in order, oldest (lowest file name) first.

> How should I force the PostgreSQL to batch copy the lagging WAL files
> to pg_archive and then send to standby.  Can I do this manually using
> rsync?  I wonder how PostgreSQL knows the changes because it
> maintains info in archive_status with extension as .ready and .done.

I suggest you fix your basic archiving routine to complete and exit with
success to postgres.

+1

scp %p host:/archive/%f
if [ "$?" -ne 0 ]
then
    echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log
    cp %p /localarchive/%f
    exit 0
fi

  Essentially, always make sure that you are returning a 0 to postgres.  If there is a failure, either log it or handle it separately.  This code snippet is NOT COMPLETE, there's a lot more to do in order to make it production ready and recoverable.  The biggest issue I've had with scp is that you have to set and enforce a timeout and trap the timeout.  Note, the above only works until your local machine (or the /localarchive partition) runs out of space.  It's really important that you have ultra solid logging and alerting around this.




And as for archive command scripts in general, simpler is better.

If you want to manually ship them in bulk, you may do so but then will
need to remove the corresponding archive_status/$foo.ready file so that
postgres won't keep trying to ship the same one.

I'm a huge fan of this strategy, especially if you're sending to a remote datacenter.  

archive_command.sh:

cp %p /localarchive/%f
if [ "$?" -ne 0 ]
then
    echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log
    exit 0
fi

send_archive_to_remote.sh
rsync -avzP /localarchive/* host:/archive/


Of course, now you have to deal with files that are removed from the slave and making sure they get removed from the master appropriately, but, this is fairly straightforward.

--Scott




HTH

> Please assist.
>
> Thanks,
>
> John Britto
>
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
--
Scott Mead
Sr. Architect
OpenSCG

В списке pgsql-general по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [GENERAL] Logical Replication - test_decoding -unchanged-toast-datum
Следующее
От: francis cherat
Дата:
Сообщение: [GENERAL] many many open files on pgsql_tmp with size 0