Обсуждение: 'replication checkpoint has wrong magic' on the newly cloned replicas

Поиск
Список
Период
Сортировка

'replication checkpoint has wrong magic' on the newly cloned replicas

От
Alex Kliukin
Дата:
Hello,

We have recently received four new servers to replace our existing
replicas of
one sharded database. Those correspond to four shards, let's name them
shard1 -
shard4 (each taking roughly 560GB of disk space), all of them running
PostgreSQL 9.6.5 on the masters.

The new servers are running HP ProLiant DL380 G9 with Intel(R) Xeon(R)
CPU
E5-2667 v4 @ 3.20GHz, 188GB of ECC RAM and and HP Smart Array P440ar
RAID
controller. We are running RAID10,  the mirroring is done by the
controller, and
the striping is handled by the lvm. The uname reports:

Linux shard1 4.4.0-100-generic #123-Ubuntu SMP Thu Nov 2 10:16:13 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

The replicas being cloned are running PostgreSQL 9.6.6. The cloning
process on
all shards took roughly 30-32 minutes. On shards 1 and 2 we've got
the following errors (taken from shard1):

017-11-15 13:15:45.767 CET,,,15154,,5a0c2ff1.3b32,1,,2017-11-15 13:15:45
CET,,0,LOG,00000,"database system was interrupted; last known up at
2017-11-15 13:12:19 CET",,,,,,,,,""
2017-11-15 13:15:45.767 CET,,,15154,,5a0c2ff1.3b32,2,,2017-11-15
13:15:45 CET,,0,LOG,00000,"creating missing WAL directory
""pg_xlog/archive_status""",,,,,,,,,""
2017-11-15 13:15:46.506 CET,,,15154,,5a0c2ff1.3b32,3,,2017-11-15
13:15:45 CET,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2017-11-15 13:15:46.666 CET,,,15154,,5a0c2ff1.3b32,4,,2017-11-15
13:15:45 CET,,0,LOG,00000,"restored log file
""000000010000640F000000F4"" from archive",,,,,,,,,""
2017-11-15 13:15:46.673 CET,,,15154,,5a0c2ff1.3b32,5,,2017-11-15
13:15:45 CET,,0,PANIC,XX000,"replication checkpoint has wrong magic
5714534 instead of 307747550",,,,,,,,,""
2017-11-15 13:15:46.674 CET,,,15152,,5a0c2ff1.3b30,2,,2017-11-15
13:15:45 CET,,0,LOG,00000,"startup process (PID 15154) was terminated by
signal 6: Aborted",,,,,,,,,""

On the other shards the restore process worked as expected:

2017-11-15 13:16:52.985 CET,,,13901,,5a0c3034.364d,2,,2017-11-15
13:16:52 CET,,0,LOG,00000,"creating missing WAL directory
""pg_xlog/archive_status""",,,,,,,,,""
2017-11-15 13:16:53.790 CET,,,13901,,5a0c3034.364d,3,,2017-11-15
13:16:52 CET,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2017-11-15 13:16:53.968 CET,,,13901,,5a0c3034.364d,4,,2017-11-15
13:16:52 CET,,0,LOG,00000,"restored log file
""000000010000629100000080"" from archive",,,,,,,,,""
2017-11-15 13:16:54.057 CET,,,13901,,5a0c3034.364d,5,,2017-11-15
13:16:52 CET,1/0,0,LOG,00000,"redo starts at 6291/800B6C60",,,,,,,,,""

The second attempt to create the replicas for shards 1 and 2 was
successful,
producing normal replicas.

The cloning itself is done by copying a compressed image via ssh,
running the
following command from the replica:
"""ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \        --exclude "recovery.done" \
--exclude"pacemaker_instanz" \        --exclude "dont_start" \        --exclude "pg_log" \        --exclude "pg_xlog" \
      --exclude "postmaster.pid" \        --exclude "recovery.done" \          * | pigz -1 -p 4' | pigz -d -p 4 | tar
-xpmUv-C          {slave_datadir}""
 

The WAL archiving starts before the copy starts, as the script that
clones the
replica checks that the WALs archiving is running before the cloning.

We have cloned hundreds of replicas with that procedure and never saw
any
issues, also never saw the "replication checkpoint has wrong magic"
error, so
we are wondering what could be the possible reason behind that failure?
We also
saw the disk error on another shard not long after the initial copy (but
not on
those that had the "replication checkpoint error"), so hardware issues
are on
our list as well (but then how comes both had the same wrong value for
the
"wrong magic"?)
-- 
Sincerely,
Alex


Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Stephen Frost
Дата:
Greetings,

* Alex Kliukin (alexk@hintbits.com) wrote:
> The cloning itself is done by copying a compressed image via ssh,
> running the
> following command from the replica:
>
>  """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \
>          --exclude "recovery.done" \
>          --exclude "pacemaker_instanz" \
>          --exclude "dont_start" \
>          --exclude "pg_log" \
>          --exclude "pg_xlog" \
>          --exclude "postmaster.pid" \
>          --exclude "recovery.done" \
>            * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C
>            {slave_datadir}""
>
> The WAL archiving starts before the copy starts, as the script that
> clones the
> replica checks that the WALs archiving is running before the cloning.

Maybe you've doing it and haven't mentioned it, but you have to use
pg_start/stop_backup because otherwise PG is going to think it's doing
crash recovery from the last checkpoint written, rather than having to
go back to when the backup started and replay all of the WAL from that
point.

Basically, this process is entirely broken unless you're actually taking
a filesystem-level atomic snapshot first (and that has to be atomic
across all tablespaces too).  Perhaps that's what you meant when you
mentioned a snapshot, but if it, then this definitely isn't good.

Note that if you use pg_start/stop_backup, you need to make sure to wait
for the replica to be all the way caught up with where the
'pg_start_backup' was issued on the primary before you start copying
files on the replica.

> We have cloned hundreds of replicas with that procedure and never saw
> any
> issues, also never saw the "replication checkpoint has wrong magic"
> error, so
> we are wondering what could be the possible reason behind that failure?
> We also
> saw the disk error on another shard not long after the initial copy (but
> not on
> those that had the "replication checkpoint error"), so hardware issues
> are on
> our list as well (but then how comes both had the same wrong value for
> the
> "wrong magic"?)

If you've not seen any other corruption due to this, I'd call you
extremely lucky.  I'd strongly suggest you look at some of the existing
tools for doing backup/recovery of PG and use them to build out
replicas.

Thanks!

Stephen

Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

От
Oleksii Kliukin
Дата:
Hi Stephen,

> On 29. Nov 2017, at 15:54, Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Alex Kliukin (alexk@hintbits.com) wrote:
>> The cloning itself is done by copying a compressed image via ssh,
>> running the
>> following command from the replica:
>>
>> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \
>>         --exclude "recovery.done" \
>>         --exclude "pacemaker_instanz" \
>>         --exclude "dont_start" \
>>         --exclude "pg_log" \
>>         --exclude "pg_xlog" \
>>         --exclude "postmaster.pid" \
>>         --exclude "recovery.done" \
>>           * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C
>>           {slave_datadir}""
>>
>> The WAL archiving starts before the copy starts, as the script that
>> clones the
>> replica checks that the WALs archiving is running before the cloning.
>
> Maybe you've doing it and haven't mentioned it, but you have to use
> pg_start/stop_backup

Sorry for not mentioning it, as it seemed obvious, but we are calling pg_start_backup and pg_stop_backup at the right
time.

Cheers,
Alex


Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

От
Stephen Frost
Дата:
Greetings, On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin wrote: > Hi Stephen, > > > On 29. Nov 2017, at 15:54, Stephen Frost wrote: > > > > Greetings, > > > > * Alex Kliukin (alexk@hintbits.com) wrote: > >> The cloning itself is done by copying a compressed image via ssh, > >> running the > >> following command from the replica: > >> > >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ > >> --exclude "recovery.done" \ > >> --exclude "pacemaker_instanz" \ > >> --exclude "dont_start" \ > >> --exclude "pg_log" \ > >> --exclude "pg_xlog" \ > >> --exclude "postmaster.pid" \ > >> --exclude "recovery.done" \ > >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C > >> {slave_datadir}"" > >> > >> The WAL archiving starts before the copy starts, as the script that > >> clones the > >> replica checks that the WALs archiving is running before the cloning. > > > > Maybe you've doing it and haven't mentioned it, but you have to use > > pg_start/stop_backup > > Sorry for not mentioning it, as it seemed obvious, but we are calling > pg_start_backup and pg_stop_backup at the right time. Ah, not something I can assume, heh. Then it depends on which version of PG and if you’re able to run start/stop on the replica or not. If you can’t run it on the replica and have to run it on the primary (prior to 9.6) then you need to make sure to wait for things to happen on the primary and for that to be replicated before you can start. If you’re on 9.6 and using non-exclusive backup, you need to be sure to capture the contents of the stop backup and write it into backup_label before you start the system back up. This is off the top of my head, so be sure to go double check the docs, of course. Thanks! Stephen

Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

От
Stephen Frost
Дата:
Greetings, On Wed, Nov 29, 2017 at 12:52 Stephen Frost wrote: > Greetings, > > On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin > wrote: > >> Hi Stephen, >> >> > On 29. Nov 2017, at 15:54, Stephen Frost wrote: >> > >> > Greetings, >> > >> > * Alex Kliukin (alexk@hintbits.com) wrote: >> >> The cloning itself is done by copying a compressed image via ssh, >> >> running the >> >> following command from the replica: >> >> >> >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ >> >> --exclude "recovery.done" \ >> >> --exclude "pacemaker_instanz" \ >> >> --exclude "dont_start" \ >> >> --exclude "pg_log" \ >> >> --exclude "pg_xlog" \ >> >> --exclude "postmaster.pid" \ >> >> --exclude "recovery.done" \ >> >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C >> >> {slave_datadir}"" >> >> >> >> The WAL archiving starts before the copy starts, as the script that >> >> clones the >> >> replica checks that the WALs archiving is running before the cloning. >> > >> > Maybe you've doing it and haven't mentioned it, but you have to use >> > pg_start/stop_backup >> >> Sorry for not mentioning it, as it seemed obvious, but we are calling >> pg_start_backup and pg_stop_backup at the right time. > > > Ah, not something I can assume, heh. > > Then it depends on which version of PG and if you’re able to run > start/stop on the replica or not. If you can’t run it on the replica and > have to run it on the primary (prior to 9.6) then you need to make sure to > wait for things to happen on the primary and for that to be replicated > before you can start. > > If you’re on 9.6 and using non-exclusive backup, you need to be sure to > capture the contents of the stop backup and write it into backup_label > before you start the system back up. > > This is off the top of my head, so be sure to go double check the docs, of > course. > Oh, and you also need to make sure to copy pg_control last when you’re doing a non-exclusive backup on a replica, which should also be covered in the docs. Even so though, really, I’d recommend you consider using one of the existing tools that knows about how to do all of this correctly.. Thanks! Stephen >

Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Alex Kliukin
Дата:
> On 29. Nov 2017, at 18:52, Stephen Frost wrote: > > Greetings, > > On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin > wrote: > Hi Stephen, > > > On 29. Nov 2017, at 15:54, Stephen Frost > wrote: > > > > Greetings, > > > > * Alex Kliukin (alexk@hintbits.com ) wrote: > >> The cloning itself is done by copying a compressed image via ssh, > >> running the > >> following command from the replica: > >> > >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ > >> --exclude "recovery.done" \ > >> --exclude "pacemaker_instanz" \ > >> --exclude "dont_start" \ > >> --exclude "pg_log" \ > >> --exclude "pg_xlog" \ > >> --exclude "postmaster.pid" \ > >> --exclude "recovery.done" \ > >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C > >> {slave_datadir}"" > >> > >> The WAL archiving starts before the copy starts, as the script that > >> clones the > >> replica checks that the WALs archiving is running before the cloning. > > > > Maybe you've doing it and haven't mentioned it, but you have to use > > pg_start/stop_backup > > Sorry for not mentioning it, as it seemed obvious, but we are calling pg_start_backup and pg_stop_backup at the right time. > > Ah, not something I can assume, heh. > > Then it depends on which version of PG and if you’re able to run start/stop on the replica or not. If you can’t run it on the replica and have to run it on the primary (prior to 9.6) then you need to make sure to wait for things to happen on the primary and for that to be replicated before you can start. We are using exclusive backups from the master. First, the script checks that WAL files are shipped to the NFS, where the replica expects to find them (we check the md5 checksum of the file in order to make sure that the NFS actually delivers the file that the master has archived) . Then pg_start_backup runs on the master and its status is checked. On success, the copy command runs. When the copy command finishes, pg_stop_backup is executed. Once pg_stop_backup finishes successfully, replica configuration files (postgesql.conf, pg_hba.conf. pg_ident.conf) are linked from their location in the repository and the replica is started. This is a fairly typical procedure, which, I believe, is also well described in the docs. > > If you’re on 9.6 and using non-exclusive backup, you need to be sure to capture the contents of the stop backup and write it into backup_label before you start the system back up. We don’t use non-exclusive backups altogether. Cheers, Alex

Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

От
Stephen Frost
Дата:
Greetings, On Wed, Nov 29, 2017 at 13:33 Alex Kliukin wrote: > > On 29. Nov 2017, at 18:52, Stephen Frost wrote: > > Greetings, > > On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin > wrote: > >> Hi Stephen, >> >> > On 29. Nov 2017, at 15:54, Stephen Frost wrote: >> > >> > Greetings, >> > >> > * Alex Kliukin (alexk@hintbits.com) wrote: >> >> The cloning itself is done by copying a compressed image via ssh, >> >> running the >> >> following command from the replica: >> >> >> >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ >> >> --exclude "recovery.done" \ >> >> --exclude "pacemaker_instanz" \ >> >> --exclude "dont_start" \ >> >> --exclude "pg_log" \ >> >> --exclude "pg_xlog" \ >> >> --exclude "postmaster.pid" \ >> >> --exclude "recovery.done" \ >> >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C >> >> {slave_datadir}"" >> >> >> >> The WAL archiving starts before the copy starts, as the script that >> >> clones the >> >> replica checks that the WALs archiving is running before the cloning. >> > >> > Maybe you've doing it and haven't mentioned it, but you have to use >> > pg_start/stop_backup >> >> Sorry for not mentioning it, as it seemed obvious, but we are calling >> pg_start_backup and pg_stop_backup at the right time. > > > Ah, not something I can assume, heh. > > Then it depends on which version of PG and if you’re able to run > start/stop on the replica or not. If you can’t run it on the replica and > have to run it on the primary (prior to 9.6) then you need to make sure to > wait for things to happen on the primary and for that to be replicated > before you can start. > > > We are using exclusive backups from the master. First, the script checks > that WAL files are shipped to the NFS, where the replica expects to find > them (we check the md5 checksum of the file in order to make sure that the > NFS actually delivers the file that the master has archived) . Then > pg_start_backup runs on the master and its status is checked. On success, > the copy command runs. When the copy command finishes, pg_stop_backup is > executed. Once pg_stop_backup finishes successfully, replica configuration > files (postgesql.conf, pg_hba.conf. pg_ident.conf) are linked from their > location in the repository and the replica is started. > No, you must wait until the replica has moved forward far enough and you have to copy the backup_label file from the primary as well, otherwise PG won’t realize you’re doing a backup-based recovery This is a fairly typical procedure, which, I believe, is also well > described in the docs. > Please provide a link to where that is because if that’s the case then we need to correct it or remove it. This is absolutely not safe without additional checks being done and various other magic happening (like copying the backup_label off the primary where it’s created). If you’re on 9.6 and using non-exclusive backup, you need to be sure to > capture the contents of the stop backup and write it into backup_label > before you start the system back up. > > > We don’t use non-exclusive backups altogether. > All the more likely that your procedure is causing more corruption than you realize then. Seriously, again, this is not easy to get right, especially when you’re doing things that weren’t explicitly documented and supported. Using existing tools from those versed in why the processes used are safe and have written lots of tests to verify that it is safe is really the recommendation that you should take away from this. At least with 9.6 there’s proper documentation on how to run a non-exclusive backup on a replica properly and if you very carefully follow the procedure then you may get it right, but you will still want to test extensively. Thanks! Stephen >

Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Alex Kliukin
Дата:
> On 29. Nov 2017, at 19:44, Stephen Frost wrote: > > Greetings, > > On Wed, Nov 29, 2017 at 13:33 Alex Kliukin > wrote: > >> On 29. Nov 2017, at 18:52, Stephen Frost > wrote: >> >> Greetings, >> >> On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin > wrote: >> Hi Stephen, >> >> > On 29. Nov 2017, at 15:54, Stephen Frost > wrote: >> > >> > Greetings, >> > >> > * Alex Kliukin (alexk@hintbits.com ) wrote: >> >> The cloning itself is done by copying a compressed image via ssh, >> >> running the >> >> following command from the replica: >> >> >> >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ >> >> --exclude "recovery.done" \ >> >> --exclude "pacemaker_instanz" \ >> >> --exclude "dont_start" \ >> >> --exclude "pg_log" \ >> >> --exclude "pg_xlog" \ >> >> --exclude "postmaster.pid" \ >> >> --exclude "recovery.done" \ >> >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C >> >> {slave_datadir}"" >> >> >> >> The WAL archiving starts before the copy starts, as the script that >> >> clones the >> >> replica checks that the WALs archiving is running before the cloning. >> > >> > Maybe you've doing it and haven't mentioned it, but you have to use >> > pg_start/stop_backup >> >> Sorry for not mentioning it, as it seemed obvious, but we are calling pg_start_backup and pg_stop_backup at the right time. >> >> Ah, not something I can assume, heh. >> >> Then it depends on which version of PG and if you’re able to run start/stop on the replica or not. If you can’t run it on the replica and have to run it on the primary (prior to 9.6) then you need to make sure to wait for things to happen on the primary and for that to be replicated before you can start. > > We are using exclusive backups from the master. First, the script checks that WAL files are shipped to the NFS, where the replica expects to find them (we check the md5 checksum of the file in order to make sure that the NFS actually delivers the file that the master has archived) . Then pg_start_backup runs on the master and its status is checked. On success, the copy command runs. When the copy command finishes, pg_stop_backup is executed. Once pg_stop_backup finishes successfully, replica configuration files (postgesql.conf, pg_hba.conf. pg_ident.conf) are linked from their location in the repository and the replica is started. > > No, you must wait until the replica has moved forward far enough and you have to copy the backup_label file from the primary as well, otherwise PG won’t realize you’re doing a backup-based recovery Are you talking about the exclusive base backup from the master (the master being the source for the backup)? At least the backup label is written by pg_start_backup to the data directory and is being copied together with the data directory. The necessary WAL files are archived once pg_stop_backup returns, and the replica cannot move anywhere in recovery without being started. > > This is a fairly typical procedure, which, I believe, is also well described in the docs. > > Please provide a link to where that is because if that’s the case then we need to correct it or remove it. This is absolutely not safe without additional checks being done and various other magic happening (like copying the backup_label off the primary where it’s created). https://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-EXCLUSIVE It has been there for years and I don’t think there is anything terribly wrong there. > >> If you’re on 9.6 and using non-exclusive backup, you need to be sure to capture the contents of the stop backup and write it into backup_label before you start the system back up. > > We don’t use non-exclusive backups altogether. > > All the more likely that your procedure is causing more corruption than you realize then. How does exclusive backups make the procedure more prone to corruption? > > Seriously, again, this is not easy to get right, especially when you’re doing things that weren’t explicitly documented and supported. Using existing tools from those versed in why the processes used are safe and have written lots of tests to verify that it is safe is really the recommendation that you should take away from this. I believe what we are doing is rather simple and well documented by the link above. > > At least with 9.6 there’s proper documentation on how to run a non-exclusive backup on a replica properly and if you very carefully follow the procedure then you may get it right, but you will still want to test extensively. We are not doing non-exclusive backups from the replica. Cheers, Alex

Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

От
Stephen Frost
Дата:
Greetings, On Wed, Nov 29, 2017 at 14:12 Alex Kliukin wrote: > > On 29. Nov 2017, at 19:44, Stephen Frost wrote: > > Greetings, > > On Wed, Nov 29, 2017 at 13:33 Alex Kliukin wrote: > >> >> On 29. Nov 2017, at 18:52, Stephen Frost wrote: >> >> Greetings, >> >> On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin >> wrote: >> >>> Hi Stephen, >>> >>> > On 29. Nov 2017, at 15:54, Stephen Frost wrote: >>> > >>> > Greetings, >>> > >>> > * Alex Kliukin (alexk@hintbits.com) wrote: >>> >> The cloning itself is done by copying a compressed image via ssh, >>> >> running the >>> >> following command from the replica: >>> >> >>> >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ >>> >> --exclude "recovery.done" \ >>> >> --exclude "pacemaker_instanz" \ >>> >> --exclude "dont_start" \ >>> >> --exclude "pg_log" \ >>> >> --exclude "pg_xlog" \ >>> >> --exclude "postmaster.pid" \ >>> >> --exclude "recovery.done" \ >>> >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C >>> >> {slave_datadir}"" >>> >> >>> >> The WAL archiving starts before the copy starts, as the script that >>> >> clones the >>> >> replica checks that the WALs archiving is running before the cloning. >>> > >>> > Maybe you've doing it and haven't mentioned it, but you have to use >>> > pg_start/stop_backup >>> >>> Sorry for not mentioning it, as it seemed obvious, but we are calling >>> pg_start_backup and pg_stop_backup at the right time. >> >> >> Ah, not something I can assume, heh. >> >> Then it depends on which version of PG and if you’re able to run >> start/stop on the replica or not. If you can’t run it on the replica and >> have to run it on the primary (prior to 9.6) then you need to make sure to >> wait for things to happen on the primary and for that to be replicated >> before you can start. >> >> >> We are using exclusive backups from the master. First, the script checks >> that WAL files are shipped to the NFS, where the replica expects to find >> them (we check the md5 checksum of the file in order to make sure that the >> NFS actually delivers the file that the master has archived) . Then >> pg_start_backup runs on the master and its status is checked. On success, >> the copy command runs. When the copy command finishes, pg_stop_backup is >> executed. Once pg_stop_backup finishes successfully, replica configuration >> files (postgesql.conf, pg_hba.conf. pg_ident.conf) are linked from their >> location in the repository and the replica is started. >> > > No, you must wait until the replica has moved forward far enough and you > have to copy the backup_label file from the primary as well, otherwise PG > won’t realize you’re doing a backup-based recovery > > > > Are you talking about the exclusive base backup from the master (the > master being the source for the backup)? > Hrmpf. I could have sworn there was a comment somewhere that you were backing up from the replica, not the primary. If you’re doing pg_start/stop_backup on the primary *and* copying the files from the primary, then that’s much better. At least the backup label is written by pg_start_backup to the data > directory and is being copied together with the data directory. The > necessary WAL files are archived once pg_stop_backup returns, and the > replica cannot move anywhere in recovery without being started. > Ok, yes, if you’re getting the backup_label and it’s included in the copy of the data directory then that’s reasonable. This is a fairly typical procedure, which, I believe, is also well >> described in the docs. >> > > Please provide a link to where that is because if that’s the case then we > need to correct it or remove it. This is absolutely not safe without > additional checks being done and various other magic happening (like > copying the backup_label off the primary where it’s created). > > > > https://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-EXCLUSIVE > > It has been there for years and I don’t think there is anything terribly > wrong there. > I had somehow understood you to be copying the files off the replica and not the primary, though I’m not entirely clear why now. If you’re on 9.6 and using non-exclusive backup, you need to be sure to >> capture the contents of the stop backup and write it into backup_label >> before you start the system back up. >> >> >> We don’t use non-exclusive backups altogether. >> > > All the more likely that your procedure is causing more corruption than > you realize then. > > > How does exclusive backups make the procedure more prone to corruption? > An exclusive backup can’t be run on a replica, which is what I was getting at with the above comment. Seriously, again, this is not easy to get right, especially when you’re > doing things that weren’t explicitly documented and supported. Using > existing tools from those versed in why the processes used are safe and > have written lots of tests to verify that it is safe is really the > recommendation that you should take away from this. > > > I believe what we are doing is rather simple and well documented by the > link above. > If you’re copying it all from the primary and using start/stop backup, then yes. At least with 9.6 there’s proper documentation on how to run a > non-exclusive backup on a replica properly and if you very carefully follow > the procedure then you may get it right, but you will still want to test > extensively. > > > We are not doing non-exclusive backups from the replica. > Apologies for the confusion. Offhand (and off my phone) probably best if I don’t try to guess further at what the issue is that you’re running into.:) Thanks! Stephen >

Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Alvaro Herrera
Дата:
Alex Kliukin wrote:

> 2017-11-15 13:15:46.673 CET,,,15154,,5a0c2ff1.3b32,5,,2017-11-15
> 13:15:45 CET,,0,PANIC,XX000,"replication checkpoint has wrong magic
> 5714534 instead of 307747550",,,,,,,,,""

Uhh ... I had never heard of this "replication checkpoint" thing.  It is
part of replication origins feature, which is fairly new stuff (see
src/backend/replication/logical/origin.c).  I'd bet this problem is
related to a bug in logical replication "origins" code rather than any
procedural problems in your base-backup taking setup ...

I wonder if there is some truncation of the 0x1257DADE value that
produces the 5714534 value you're seeing -- something related to a
pg_logical/replorigin_checkpoint file being written partially while the
backup is being taken.

Another point towards not including pg_logical/ contents when taking a
base backup, I guess ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Andres Freund
Дата:
Hi,

On 2017-11-29 20:22:43 -0300, Alvaro Herrera wrote:
> Alex Kliukin wrote:
>
> > 2017-11-15 13:15:46.673 CET,,,15154,,5a0c2ff1.3b32,5,,2017-11-15
> > 13:15:45 CET,,0,PANIC,XX000,"replication checkpoint has wrong magic
> > 5714534 instead of 307747550",,,,,,,,,""
>
> Uhh ... I had never heard of this "replication checkpoint" thing.

Contains information about how far logical replication like solutions
have replayed from other systems.


> It is part of replication origins feature, which is fairly new stuff
> (see src/backend/replication/logical/origin.c).  I'd bet this problem
> is related to a bug in logical replication "origins" code rather than
> any procedural problems in your base-backup taking setup ...

Possible.

What's the max_replication_origins setting? Is the system receiving
logical replication data? Could you describe the setup a bit? Any chance
the system's partially been running without fsync? Could you attach both
a corrupt and a non-corrupt state file?

It's a bit weird to see such an error because normally the state file's
just written to a temporary file and then renamed into place,
overwriting the old file.


> I wonder if there is some truncation of the 0x1257DADE value that
> produces the 5714534 value you're seeing -- something related to a
> pg_logical/replorigin_checkpoint file being written partially while the
> backup is being taken.
>
> Another point towards not including pg_logical/ contents when taking a
> base backup, I guess ...

You'd cause corruption if logical replication is in use, so no, please
don't.

- Andres


Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Alex Kliukin
Дата:
On Thu, Nov 30, 2017, at 00:22, Alvaro Herrera wrote:
> Alex Kliukin wrote:
> 
> > 2017-11-15 13:15:46.673 CET,,,15154,,5a0c2ff1.3b32,5,,2017-11-15
> > 13:15:45 CET,,0,PANIC,XX000,"replication checkpoint has wrong magic
> > 5714534 instead of 307747550",,,,,,,,,""
> 
> Uhh ... I had never heard of this "replication checkpoint" thing.  It is
> part of replication origins feature, which is fairly new stuff (see
> src/backend/replication/logical/origin.c).  I'd bet this problem is
> related to a bug in logical replication "origins" code rather than any
> procedural problems in your base-backup taking setup ...

We are not using logical replication or logical decoding on those hosts.
On the master, pg_replication_origin is empty as well as
pg_replication_slots

Those masters were fairly recently (around 2 months ago) upgraded from
9.3.

> 
> I wonder if there is some truncation of the 0x1257DADE value that
> produces the 5714534 value you're seeing -- something related to a
> pg_logical/replorigin_checkpoint file being written partially while the
> backup is being taken.

307747550 = 0x1257DADE
0001 0010 0101 0111 1101 1010 1101 1110

5714534 = 0x573266 = w2f ASCII
0000 0000 0101 0111 0011 0010 0110 0110

I see no patterns here.

What is interesting is that 0x573266 is actually mentioned in relcache.c

#define RELCACHE_INIT_FILENAME  "pg_internal.init"
#define RELCACHE_INIT_FILEMAGIC         0x573266        /* version ID
value */

it's a file magic for the relcache init files, but given that the copy
is performed by just compressing and decompressing the original files I
don't see how those 2 could be confused by software. 

> 
> Another point towards not including pg_logical/ contents when taking a
> base backup, I guess ...

In our case wouldn't it just mask the real issue?
-- 
Sincerely,
Alex


Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Alex Kliukin
Дата:

On Thu, Nov 30, 2017, at 01:41, Andres Freund wrote:
> 
> > It is part of replication origins feature, which is fairly new stuff
> > (see src/backend/replication/logical/origin.c).  I'd bet this problem
> > is related to a bug in logical replication "origins" code rather than
> > any procedural problems in your base-backup taking setup ...
> 
> Possible.
> 
> What's the max_replication_origins setting? Is the system receiving
> logical replication data? Could you describe the setup a bit? Any chance
> the system's partially been running without fsync? Could you attach both
> a corrupt and a non-corrupt state file?

max_replication_slots is 5 and logical replication is not used
altogether there. fsync is always turned on, the other configuration
settings from the master are attached. 

The replica configuration is almost identical to the master (we
decreased random_page_costs for systems running on SSDs).

diff /tmp/settings_master.txt /tmp/settings_replica.txt
115c115
< krb_server_keyfile    FILE:/server/postgres/9.6.5/etc/krb5.keytab
---
> krb_server_keyfile    FILE:/server/postgres/9.6.6/etc/krb5.keytab
186c186
< random_page_cost      3
---
> random_page_cost    1.5
194,195c194,195
< server_version        9.6.5
< server_version_num    90605
---
> server_version    9.6.6
> server_version_num    90606
222c222
< tcp_keepalives_interval       75
---
> tcp_keepalives_interval    90
239c239
< transaction_read_only off
---
> transaction_read_only    on
273c273

The system is a typical OLTP, the master normally has a single streaming
physical replica and one delayed one. At the time the issue happened the
replica in question was the second physical replica, after it has been
created  the previous replica  has been decommissioned.

Unfortunately, I don't have a 'corrupt' file from the replica, as the
data has been reinitialized afterwards.  I will try to reproduce the
issue by cloning it couple more times. The replorigin_checkpoint from
the master is attached, but its magic seems to be fine:

od -x replorigin_checkpoint
0000000 dade 1257 b236 6a00
0000010

The same file from the current replica is identical.

-- 
Sincerely,
Alex



Вложения

Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Andres Freund
Дата:
Hi,

On 2017-11-30 11:44:29 +0100, Alex Kliukin wrote:
> max_replication_slots is 5 and logical replication is not used
> altogether there.

Can you show the results of
SELECT * FROM pg_replication_origin;
SELECT * FROM pg_replication_origin_status ;
?

Greetings,

Andres Freund


Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas

От
Alex Kliukin
Дата:

On Thu, Nov 30, 2017, at 16:36, Andres Freund wrote:
> Hi,
> 
> On 2017-11-30 11:44:29 +0100, Alex Kliukin wrote:
> > max_replication_slots is 5 and logical replication is not used
> > altogether there.
> 
> Can you show the results of
> SELECT * FROM pg_replication_origin;
> SELECT * FROM pg_replication_origin_status ;

SELECT * FROM pg_replication_origin;roident | roname
---------+--------
(0 rows)
SELECT * FROM pg_replication_origin_status ;local_id | external_id | remote_lsn | local_lsn
----------+-------------+------------+-----------
(0 rows)

-- 
Sincerely,
Alex