Обсуждение: root cause of corruption in hot standby

Поиск
Список
Период
Сортировка

root cause of corruption in hot standby

От
Mike Broers
Дата:
So I have discovered corruption in a postgres 9.5.12 read replica, yay checksums!

2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:  page verification failed, calculated checksum 3482 but expected 32232

2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:  invalid page in block 15962 of relation base/16384/464832386

The rest of the log is clean and just has usual monitoring queries as this isnt a heavily used db.

This corruption isnt occurring on the primary or a second replica, so I'm not freaking out exactly, but Im not sure how I can further diagnose what the root cause of the corruption might be. 

There were no power outages.  This is a streaming hot standby replica that looks like it was connected fine to its primary xlog at the time, and not falling back on rsync'ed WALS or anything.  We run off an SSD SAN that is allocated using LVM and I've noticed documentation that states that can be problematic, but I'm unclear on how to diagnose what might have been the root cause and now I'm somewhat uncomfortable with this environments reliability in general.  

Does anyone have advice for what to check further to determine a possible root cause?  This is a CentOS 7 vm running on Hyper-V.  

Thanks for any assistance, greatly appreciated!
Mike

Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
Well I've verified my primary backups are working, and think my plan is to patch to 9.5.14, reprime a replica in the same environment and see how it goes unless someone has an idea of something to check on the host to avoid future corruption...





On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers@gmail.com> wrote:
So I have discovered corruption in a postgres 9.5.12 read replica, yay checksums!

2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:  page verification failed, calculated checksum 3482 but expected 32232

2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:  invalid page in block 15962 of relation base/16384/464832386

The rest of the log is clean and just has usual monitoring queries as this isnt a heavily used db.

This corruption isnt occurring on the primary or a second replica, so I'm not freaking out exactly, but Im not sure how I can further diagnose what the root cause of the corruption might be. 

There were no power outages.  This is a streaming hot standby replica that looks like it was connected fine to its primary xlog at the time, and not falling back on rsync'ed WALS or anything.  We run off an SSD SAN that is allocated using LVM and I've noticed documentation that states that can be problematic, but I'm unclear on how to diagnose what might have been the root cause and now I'm somewhat uncomfortable with this environments reliability in general.  

Does anyone have advice for what to check further to determine a possible root cause?  This is a CentOS 7 vm running on Hyper-V.  

Thanks for any assistance, greatly appreciated!
Mike

Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
A fresh replica using pg_basebackup on the same system generated similar errors:.

cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No such file or directory
2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG:  restored log file "0000000100007CFC00000040" from archive
2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7CFC/405ED198

I am going to run file system checks now.. Maybe the backup volume the archived wals get rsync'ed to has problems and is corrupting on the replay?  No checksum alerts on primary or an additional replica..

If I can supply additional info that would help get some advice please let me know.  

Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM

On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers@gmail.com> wrote:
Well I've verified my primary backups are working, and think my plan is to patch to 9.5.14, reprime a replica in the same environment and see how it goes unless someone has an idea of something to check on the host to avoid future corruption...





On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers@gmail.com> wrote:
So I have discovered corruption in a postgres 9.5.12 read replica, yay checksums!

2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:  page verification failed, calculated checksum 3482 but expected 32232

2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:  invalid page in block 15962 of relation base/16384/464832386

The rest of the log is clean and just has usual monitoring queries as this isnt a heavily used db.

This corruption isnt occurring on the primary or a second replica, so I'm not freaking out exactly, but Im not sure how I can further diagnose what the root cause of the corruption might be. 

There were no power outages.  This is a streaming hot standby replica that looks like it was connected fine to its primary xlog at the time, and not falling back on rsync'ed WALS or anything.  We run off an SSD SAN that is allocated using LVM and I've noticed documentation that states that can be problematic, but I'm unclear on how to diagnose what might have been the root cause and now I'm somewhat uncomfortable with this environments reliability in general.  

Does anyone have advice for what to check further to determine a possible root cause?  This is a CentOS 7 vm running on Hyper-V.  

Thanks for any assistance, greatly appreciated!
Mike

Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
I ran fsck and reran mkfs.ext4 in case there was something wrong with the underlying file systems, reran a pg_basebackup and got the same problem.  It seems to be choking up on an archived wal that was rsync'ed during a time of a nightly vacuum.

2018-09-20 09:48:22 CDT [21017]: [5305-1] user=,db= LOG:  restored log file "0000000100007D260000000E" from archive
2018-09-20 09:48:22 CDT [21017]: [5306-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7D26/EC04550
2018-09-20 09:48:22 CDT [28296]: [1-1] user=,db= LOG:  started streaming WAL from primary at 7D26/E000000 on timeline 1
2018-09-20 09:48:22 CDT [28296]: [2-1] user=,db= FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100007D260000000E has already been removed

I dont know if this is a good test but I attempted to diff the file from primary and the file that got rsynced and they 'differ'

-bash-4.2$ diff 0000000100007D260000000E 0000000100007D260000000E.sch1-db8 

Binary files 0000000100007D260000000E and 0000000100007D260000000E.sch1-db8 differ


It seems like my archived wal rsync strategy is flawed? Im just running a cron to rsync from the primary server to the replica every 15 minutes.  

rsync -az -B 16384 --rsh=ssh ${PGPRIMARYHOST}/archive ${PGBACKUP}/archive


Although its strange to me if its rsyncing a file before writes complete why are they the same size?  

-rw------- 1 postgres postgres 16777216 Sep 20 03:57 0000000100007D260000000E

-bash-4.2$ ls -l 0000000100007D260000000E.sch1-db8

-rw------- 1 postgres postgres 16777216 Sep 20 10:05 0000000100007D260000000E.sch1-db8 

I will RTD on how it is advised to copy the wals and see if I am missing something obvious, but any help is always appreciated :)

Mike



On Wed, Sep 19, 2018 at 9:14 AM Mike Broers <mbroers@gmail.com> wrote:
A fresh replica using pg_basebackup on the same system generated similar errors:.

cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No such file or directory
2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG:  restored log file "0000000100007CFC00000040" from archive
2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7CFC/405ED198

I am going to run file system checks now.. Maybe the backup volume the archived wals get rsync'ed to has problems and is corrupting on the replay?  No checksum alerts on primary or an additional replica..

If I can supply additional info that would help get some advice please let me know.  

Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM

On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers@gmail.com> wrote:
Well I've verified my primary backups are working, and think my plan is to patch to 9.5.14, reprime a replica in the same environment and see how it goes unless someone has an idea of something to check on the host to avoid future corruption...





On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers@gmail.com> wrote:
So I have discovered corruption in a postgres 9.5.12 read replica, yay checksums!

2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:  page verification failed, calculated checksum 3482 but expected 32232

2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:  invalid page in block 15962 of relation base/16384/464832386

The rest of the log is clean and just has usual monitoring queries as this isnt a heavily used db.

This corruption isnt occurring on the primary or a second replica, so I'm not freaking out exactly, but Im not sure how I can further diagnose what the root cause of the corruption might be. 

There were no power outages.  This is a streaming hot standby replica that looks like it was connected fine to its primary xlog at the time, and not falling back on rsync'ed WALS or anything.  We run off an SSD SAN that is allocated using LVM and I've noticed documentation that states that can be problematic, but I'm unclear on how to diagnose what might have been the root cause and now I'm somewhat uncomfortable with this environments reliability in general.  

Does anyone have advice for what to check further to determine a possible root cause?  This is a CentOS 7 vm running on Hyper-V.  

Thanks for any assistance, greatly appreciated!
Mike

Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
Strange to finally hit the race condition after using this approach for so long on so many different databases but I can accept my understanding was flawed.  Im trying to decide a path to fix this going forward.

If I switch to using a script to archive, and have the local copy as well as the rsync in the script, then Im assuming that would be the best way as postgres is only rsyncing AFTER the file has been fully written?

Something akin to:
archive_command = 'cp "%p" /mnt/backup/pgsql/9.5/archive"%f" && rsync -a "%p" postgres@destination_host:/mnt/backup/pgsql/archive/"%f"'   

I dont really like the idea of archive_command needing an rsync to succeed to rotate through pg_xlogs but perhaps I can wire that in a bash script so it exits with 0 but notifies me in some other way or something..

Thanks for commenting Rui!

On Thu, Sep 20, 2018 at 2:00 PM Rui DeSousa <rui@crazybean.net> wrote:
Hi Mike,

Using rsync on an active file will lead to a “corrupted” file; which is normal; however, there is a race condition with rsync and the filesystem metadata.  This means on subsequent rsync runs the file may not get rsync’d when it really should.  That is what the -c  “checksum" option in rsync is for; it will checksum the files to determine if it needs to be rsync'd and not only rely on filesystem metadata.

Using the -c option should fix your issue; however, I don’t think you really want to use it as that would cause a lot of unnecessary I/O to compare all the already archived files.  A better approach would be to archive the file with a temporary name and when it’s fully archived rename it to the final name.

What are you using for archiving, a script or just copy?



On Sep 20, 2018, at 11:13 AM, Mike Broers <mbroers@gmail.com> wrote:

I ran fsck and reran mkfs.ext4 in case there was something wrong with the underlying file systems, reran a pg_basebackup and got the same problem.  It seems to be choking up on an archived wal that was rsync'ed during a time of a nightly vacuum.

2018-09-20 09:48:22 CDT [21017]: [5305-1] user=,db= LOG:  restored log file "0000000100007D260000000E" from archive
2018-09-20 09:48:22 CDT [21017]: [5306-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7D26/EC04550
2018-09-20 09:48:22 CDT [28296]: [1-1] user=,db= LOG:  started streaming WAL from primary at 7D26/E000000 on timeline 1
2018-09-20 09:48:22 CDT [28296]: [2-1] user=,db= FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100007D260000000E has already been removed

I dont know if this is a good test but I attempted to diff the file from primary and the file that got rsynced and they 'differ'

-bash-4.2$ diff 0000000100007D260000000E 0000000100007D260000000E.sch1-db8 
Binary files 0000000100007D260000000E and 0000000100007D260000000E.sch1-db8 differ

It seems like my archived wal rsync strategy is flawed? Im just running a cron to rsync from the primary server to the replica every 15 minutes.  

rsync -az -B 16384 --rsh=ssh ${PGPRIMARYHOST}/archive ${PGBACKUP}/archive

Although its strange to me if its rsyncing a file before writes complete why are they the same size?  

-rw------- 1 postgres postgres 16777216 Sep 20 03:57 0000000100007D260000000E
-bash-4.2$ ls -l 0000000100007D260000000E.sch1-db8
-rw------- 1 postgres postgres 16777216 Sep 20 10:05 0000000100007D260000000E.sch1-db8 

I will RTD on how it is advised to copy the wals and see if I am missing something obvious, but any help is always appreciated :)

Mike



On Wed, Sep 19, 2018 at 9:14 AM Mike Broers <mbroers@gmail.com> wrote:
A fresh replica using pg_basebackup on the same system generated similar errors:.

cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No such file or directory
2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG:  restored log file "0000000100007CFC00000040" from archive
2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7CFC/405ED198

I am going to run file system checks now.. Maybe the backup volume the archived wals get rsync'ed to has problems and is corrupting on the replay?  No checksum alerts on primary or an additional replica..

If I can supply additional info that would help get some advice please let me know.  

Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM

On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers@gmail.com> wrote:
Well I've verified my primary backups are working, and think my plan is to patch to 9.5.14, reprime a replica in the same environment and see how it goes unless someone has an idea of something to check on the host to avoid future corruption...





On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers@gmail.com> wrote:
So I have discovered corruption in a postgres 9.5.12 read replica, yay checksums!

2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:  page verification failed, calculated checksum 3482 but expected 32232

2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:  invalid page in block 15962 of relation base/16384/464832386

The rest of the log is clean and just has usual monitoring queries as this isnt a heavily used db.

This corruption isnt occurring on the primary or a second replica, so I'm not freaking out exactly, but Im not sure how I can further diagnose what the root cause of the corruption might be. 

There were no power outages.  This is a streaming hot standby replica that looks like it was connected fine to its primary xlog at the time, and not falling back on rsync'ed WALS or anything.  We run off an SSD SAN that is allocated using LVM and I've noticed documentation that states that can be problematic, but I'm unclear on how to diagnose what might have been the root cause and now I'm somewhat uncomfortable with this environments reliability in general.  

Does anyone have advice for what to check further to determine a possible root cause?  This is a CentOS 7 vm running on Hyper-V.  

Thanks for any assistance, greatly appreciated!
Mike


Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
Ok so I have checksum errors in this replica AGAIN.

To remind the list of the scenario:
primary 9.5.14 db centos 7.4 ext4 filesystem on vmware in colo hosting environment with ssd san
one replica in same colo with same os configuration on hyper-v vm on the same san starts reporting checksum error
second replica with same os configuration in AWS does not have this problem.  both are connected directly to primary as hot standbys using streaming replication, with a restore command and rsynced archivelogs if streaming falls behind primary xlog.

This occurred during times of relatively low activity on the primary, I verified in the log that streaming replication is connected and it isnt happening during the application of an rsync'ed transaction log.  Previously it was thought that maybe i was rsyncing transaction logs as they were being written and this was the cause.  Im doubting that now and Ive changed the rsync command to base its list on a find with  -mmin +10 to avoid rsyncing files unless their mtime is older than 10 minutes.

The last time this happened I ran chkdsk (no errors) and even reformatted the volumes.  Is there any extensive testing I can do to verify the memory isnt the problem or some other tests to try to determine root cause?  Its a hyper-v vm so Im not sure how the virtual hardware abstraction is playing a role. My next step is going to be destroying and recreating the vm, but Im going to have a hard time being confident in the replica environment if I have no idea what is causing this.

Many other VMs share the hyperv environment, no other hosts have complained about this kind of problem.

Thanks for reading, any guidance is appreciated.





On Thu, Sep 20, 2018 at 3:17 PM Mike Broers <mbroers@gmail.com> wrote:
Strange to finally hit the race condition after using this approach for so long on so many different databases but I can accept my understanding was flawed.  Im trying to decide a path to fix this going forward.

If I switch to using a script to archive, and have the local copy as well as the rsync in the script, then Im assuming that would be the best way as postgres is only rsyncing AFTER the file has been fully written?

Something akin to:
archive_command = 'cp "%p" /mnt/backup/pgsql/9.5/archive"%f" && rsync -a "%p" postgres@destination_host:/mnt/backup/pgsql/archive/"%f"'   

I dont really like the idea of archive_command needing an rsync to succeed to rotate through pg_xlogs but perhaps I can wire that in a bash script so it exits with 0 but notifies me in some other way or something..

Thanks for commenting Rui!

On Thu, Sep 20, 2018 at 2:00 PM Rui DeSousa <rui@crazybean.net> wrote:
Hi Mike,

Using rsync on an active file will lead to a “corrupted” file; which is normal; however, there is a race condition with rsync and the filesystem metadata.  This means on subsequent rsync runs the file may not get rsync’d when it really should.  That is what the -c  “checksum" option in rsync is for; it will checksum the files to determine if it needs to be rsync'd and not only rely on filesystem metadata.

Using the -c option should fix your issue; however, I don’t think you really want to use it as that would cause a lot of unnecessary I/O to compare all the already archived files.  A better approach would be to archive the file with a temporary name and when it’s fully archived rename it to the final name.

What are you using for archiving, a script or just copy?



On Sep 20, 2018, at 11:13 AM, Mike Broers <mbroers@gmail.com> wrote:

I ran fsck and reran mkfs.ext4 in case there was something wrong with the underlying file systems, reran a pg_basebackup and got the same problem.  It seems to be choking up on an archived wal that was rsync'ed during a time of a nightly vacuum.

2018-09-20 09:48:22 CDT [21017]: [5305-1] user=,db= LOG:  restored log file "0000000100007D260000000E" from archive
2018-09-20 09:48:22 CDT [21017]: [5306-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7D26/EC04550
2018-09-20 09:48:22 CDT [28296]: [1-1] user=,db= LOG:  started streaming WAL from primary at 7D26/E000000 on timeline 1
2018-09-20 09:48:22 CDT [28296]: [2-1] user=,db= FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100007D260000000E has already been removed

I dont know if this is a good test but I attempted to diff the file from primary and the file that got rsynced and they 'differ'

-bash-4.2$ diff 0000000100007D260000000E 0000000100007D260000000E.sch1-db8 
Binary files 0000000100007D260000000E and 0000000100007D260000000E.sch1-db8 differ

It seems like my archived wal rsync strategy is flawed? Im just running a cron to rsync from the primary server to the replica every 15 minutes.  

rsync -az -B 16384 --rsh=ssh ${PGPRIMARYHOST}/archive ${PGBACKUP}/archive

Although its strange to me if its rsyncing a file before writes complete why are they the same size?  

-rw------- 1 postgres postgres 16777216 Sep 20 03:57 0000000100007D260000000E
-bash-4.2$ ls -l 0000000100007D260000000E.sch1-db8
-rw------- 1 postgres postgres 16777216 Sep 20 10:05 0000000100007D260000000E.sch1-db8 

I will RTD on how it is advised to copy the wals and see if I am missing something obvious, but any help is always appreciated :)

Mike



On Wed, Sep 19, 2018 at 9:14 AM Mike Broers <mbroers@gmail.com> wrote:
A fresh replica using pg_basebackup on the same system generated similar errors:.

cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No such file or directory
2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG:  restored log file "0000000100007CFC00000040" from archive
2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7CFC/405ED198

I am going to run file system checks now.. Maybe the backup volume the archived wals get rsync'ed to has problems and is corrupting on the replay?  No checksum alerts on primary or an additional replica..

If I can supply additional info that would help get some advice please let me know.  

Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM

On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers@gmail.com> wrote:
Well I've verified my primary backups are working, and think my plan is to patch to 9.5.14, reprime a replica in the same environment and see how it goes unless someone has an idea of something to check on the host to avoid future corruption...





On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers@gmail.com> wrote:
So I have discovered corruption in a postgres 9.5.12 read replica, yay checksums!

2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:  page verification failed, calculated checksum 3482 but expected 32232

2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:  invalid page in block 15962 of relation base/16384/464832386

The rest of the log is clean and just has usual monitoring queries as this isnt a heavily used db.

This corruption isnt occurring on the primary or a second replica, so I'm not freaking out exactly, but Im not sure how I can further diagnose what the root cause of the corruption might be. 

There were no power outages.  This is a streaming hot standby replica that looks like it was connected fine to its primary xlog at the time, and not falling back on rsync'ed WALS or anything.  We run off an SSD SAN that is allocated using LVM and I've noticed documentation that states that can be problematic, but I'm unclear on how to diagnose what might have been the root cause and now I'm somewhat uncomfortable with this environments reliability in general.  

Does anyone have advice for what to check further to determine a possible root cause?  This is a CentOS 7 vm running on Hyper-V.  

Thanks for any assistance, greatly appreciated!
Mike


Re: root cause of corruption in hot standby

От
Rui DeSousa
Дата:
> On Oct 9, 2018, at 1:04 PM, Mike Broers <mbroers@gmail.com> wrote:
>
> Ok so I have checksum errors in this replica AGAIN.

Mike,

I don’t think you are dealing with a “Postgres” issue but possibly bit rot from either faulty hardware or a
misconfigurationin your stack. 

If you recall the archive WAL file was originally corrupted.  Replicating the WAL files is outside the functionally of
Postgresthus it would either be a file replication issue, bit rot, or some other data corruption issue but not Postgres
bug.

This leaves me with the follow two points:

1. How was the replica instance instantiated? I would assume from your backup procedures as your backups should be used
tohelp validate them. 
2. Are there currently any WAL files that are corrupt?  You can quickly check using rsync with the “—checksum" option
butdon’t fix the file on the target but instead use "—dry-run" just to identify which files might have changed first.
Iwould check this every day until the issue is fully resolved. 

 i.e. rsync --archive --checksum --verbose --dry-run {source_wals}  {replica_wals}

Since you’re confident that you resolved the potential rsync race condition in archiving the WAL files we shouldn’t see
anydifferences between WALs that have already been transmitted.  If we do find WALs that are different then you’re
dealingwith data corruption on the replica and need to start looking into your stack and storage system; However, if
youdon’t find any corrupted WALs then question 1 needs to be scrutinized and you really need to ensure your backups are
rocksolid. 

I wouldn’t bother rebuilding the VM instance until the problem is identified — unless you’re moving it to an all new
hardwarestack. 

P.s. Is there any anti-virus software running on the the server or any other software that might modify files on your
behalf?




Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
The replica is instantiated with a pg_basebackup, and seems to run fine for a few days before the checksum error presents itself.  Initially it ran a few months without issue.  The replica vm was created in May and it ran until September without the checksum error.  This time it was 12 days after a fresh pg_basebackup. 

I'll look into rsync checksums, but this corruption presented itself during a time when streaming replication was working fine and it wasnt restoring archived rsynced transaction logs, and hadnt done so for around 30 hours.  The table it complained about it is accessed every minute with updates and monitoring so I dont think it would have taken so long if it was due to the application of a corrupted wal. 

Id like to know if there are diagnostics I can turn to validate the VM and its configuration..  Checking the usual logging in /var/log and dmesg isnt showing anything, or chkdsk..  



On Tue, Oct 9, 2018 at 7:56 PM Rui DeSousa <rui@crazybean.net> wrote:

> On Oct 9, 2018, at 1:04 PM, Mike Broers <mbroers@gmail.com> wrote:
>
> Ok so I have checksum errors in this replica AGAIN.

Mike,

I don’t think you are dealing with a “Postgres” issue but possibly bit rot from either faulty hardware or a misconfiguration in your stack.

If you recall the archive WAL file was originally corrupted.  Replicating the WAL files is outside the functionally of Postgres thus it would either be a file replication issue, bit rot, or some other data corruption issue but not Postgres bug.

This leaves me with the follow two points:

1. How was the replica instance instantiated? I would assume from your backup procedures as your backups should be used to help validate them.
2. Are there currently any WAL files that are corrupt?  You can quickly check using rsync with the “—checksum" option but don’t fix the file on the target but instead use "—dry-run" just to identify which files might have changed first.  I would check this every day until the issue is fully resolved.

 i.e. rsync --archive --checksum --verbose --dry-run {source_wals}  {replica_wals}

Since you’re confident that you resolved the potential rsync race condition in archiving the WAL files we shouldn’t see any differences between WALs that have already been transmitted.  If we do find WALs that are different then you’re dealing with data corruption on the replica and need to start looking into your stack and storage system; However, if you don’t find any corrupted WALs then question 1 needs to be scrutinized and you really need to ensure your backups are rock solid.

I wouldn’t bother rebuilding the VM instance until the problem is identified — unless you’re moving it to an all new hardware stack.

P.s. Is there any anti-virus software running on the the server or any other software that might modify files on your behalf?


Re: root cause of corruption in hot standby

От
Rui DeSousa
Дата:
> On Oct 10, 2018, at 10:15 AM, Mike Broers <mbroers@gmail.com> wrote:
>
>
> I'll look into rsync checksums, but this corruption presented itself during a time when streaming replication was
workingfine and it wasnt restoring archived rsynced transaction logs, and hadnt done so for around 30 hours.  The table
itcomplained about it is accessed every minute with updates and monitoring so I dont think it would have taken so long
ifit was due to the application of a corrupted wal.  
>

I think you missed my point.  If you are dealing with some sort of bit rot and/or data corruption on your storage
deviceyou need to sort of prove it which is very difficult to do. 

You have WAL files on primary and the same WAL files on the replica via your rsync copy job.  If you check and recheck
allthe WALs daily to see if any of the files are changing and find a difference than proves that there is some sort of
corruption/bitrot occurring as the WAL files are static files. 

I’ve seen this type of corruption before with RAID controllers that are over taxed; where they would corrupt over time
periodically. I ended up changing from a RAID configuration to a JBOD and managing the disks via ZFS instead and never
againexperience data corruption using the exact same hardware.  ZFS also detects bit rot and correct for it as well as
theability to scrub the pool to ensure the disks are not slowly rotting away. 

What storage system is being used? Does it have any measures to prevent bit rot? What is the RAID configuration? I
wouldnot recommend RAID 5 for a database; under heavy load the performance degradation and increase likelihood of data
corruptionis not worth it. 

It sounds like you have some sort of environmental issues which is corrupting your data and it is not a Postgres issue.
The problem you face is that without some sort of definitive poof you’ll enter the realm of finger pointing… it’s a
databaseissue, no is storage issue, etc. 

You have two replicas; one periodically fails and the other does not — the only difference is the environment in which
theyoperate. 

Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
Agreed, no good threads to pull for definite answers.  Ive added a checksum to the rsync as a precaution.  The SAN is the same between the two hosts.  One is running on a VMWare server, the other is on a Hyper-V host.  We have a few other postgres database hosts on that same hyper-v infrastructure and they all check out fine as far as I can tell.  

The only thing I could guess at this point is that although primary and replica are running on ext4 under LVM, the replica data volumes were grown online at some point because we were intending to cutover into an environment with more space.  This happened a while back and doesnt correspond to any of the times the checksum errors occurred in the logs, and Ive started with fresh pg_basebackups twice since then, so Im skeptical that is a cause, but its all I've got.  

There may be some additional hyper-v fsync write-thru cache configurations I'm unaware of as well.  That is my fear at this point.  

Thanks for following up again.


On Wed, Oct 10, 2018 at 10:22 AM Rui DeSousa <rui@crazybean.net> wrote:

> On Oct 10, 2018, at 10:15 AM, Mike Broers <mbroers@gmail.com> wrote:
>
>
> I'll look into rsync checksums, but this corruption presented itself during a time when streaming replication was working fine and it wasnt restoring archived rsynced transaction logs, and hadnt done so for around 30 hours.  The table it complained about it is accessed every minute with updates and monitoring so I dont think it would have taken so long if it was due to the application of a corrupted wal.
>

I think you missed my point.  If you are dealing with some sort of bit rot and/or data corruption on your storage device you need to sort of prove it which is very difficult to do.

You have WAL files on primary and the same WAL files on the replica via your rsync copy job.  If you check and recheck all the WALs daily to see if any of the files are changing and find a difference than proves that there is some sort of corruption/bit rot occurring as the WAL files are static files.

I’ve seen this type of corruption before with RAID controllers that are over taxed; where they would corrupt over time periodically.  I ended up changing from a RAID configuration to a JBOD and managing the disks via ZFS instead and never again experience data corruption using the exact same hardware.  ZFS also detects bit rot and correct for it as well as the ability to scrub the pool to ensure the disks are not slowly rotting away.

What storage system is being used? Does it have any measures to prevent bit rot? What is the RAID configuration? I would not recommend RAID 5 for a database; under heavy load the performance degradation and increase likelihood of data corruption is not worth it.

It sounds like you have some sort of environmental issues which is corrupting your data and it is not a Postgres issue.  The problem you face is that without some sort of definitive poof you’ll enter the realm of finger pointing… it’s a database issue, no is storage issue, etc.

You have two replicas; one periodically fails and the other does not — the only difference is the environment in which they operate.

Re: root cause of corruption in hot standby

От
Andres Freund
Дата:
Hi,

On 2018-09-06 12:28:18 -0500, Mike Broers wrote:
> So I have discovered corruption in a postgres 9.5.12 read replica, yay
> checksums!
> 
> 2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:
> page verification failed, calculated checksum 3482 but expected 32232
> 
> 2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:
> invalid page in block 15962 of relation base/16384/464832386

Have you compared whether that block looks significantly different on
the primary from the standby with a hex editor?

Greetings,

Andres Freund


Re: root cause of corruption in hot standby

От
Mike Broers
Дата:
No I havent.  Not sure how to or what it would reveal but willing to give it a try.

On Wed, Oct 10, 2018 at 2:25 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-09-06 12:28:18 -0500, Mike Broers wrote:
> So I have discovered corruption in a postgres 9.5.12 read replica, yay
> checksums!
>
> 2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:
> page verification failed, calculated checksum 3482 but expected 32232
>
> 2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:
> invalid page in block 15962 of relation base/16384/464832386

Have you compared whether that block looks significantly different on
the primary from the standby with a hex editor?

Greetings,

Andres Freund