Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: 'replication checkpoint has wrong magic' on the newly cloned replicas
Дата
Msg-id CAOuzzgqeJfq=049Pikii7GjS6DFpo6j9Z6TYSo6V9eyMaOi-YA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas  (Alex Kliukin <oleksii@fastmail.com>)
Список pgsql-admin
Greetings, On Wed, Nov 29, 2017 at 14:12 Alex Kliukin wrote: > > On 29. Nov 2017, at 19:44, Stephen Frost wrote: > > Greetings, > > On Wed, Nov 29, 2017 at 13:33 Alex Kliukin wrote: > >> >> On 29. Nov 2017, at 18:52, Stephen Frost wrote: >> >> Greetings, >> >> On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin >> wrote: >> >>> Hi Stephen, >>> >>> > On 29. Nov 2017, at 15:54, Stephen Frost wrote: >>> > >>> > Greetings, >>> > >>> > * Alex Kliukin (alexk@hintbits.com) wrote: >>> >> The cloning itself is done by copying a compressed image via ssh, >>> >> running the >>> >> following command from the replica: >>> >> >>> >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ >>> >> --exclude "recovery.done" \ >>> >> --exclude "pacemaker_instanz" \ >>> >> --exclude "dont_start" \ >>> >> --exclude "pg_log" \ >>> >> --exclude "pg_xlog" \ >>> >> --exclude "postmaster.pid" \ >>> >> --exclude "recovery.done" \ >>> >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C >>> >> {slave_datadir}"" >>> >> >>> >> The WAL archiving starts before the copy starts, as the script that >>> >> clones the >>> >> replica checks that the WALs archiving is running before the cloning. >>> > >>> > Maybe you've doing it and haven't mentioned it, but you have to use >>> > pg_start/stop_backup >>> >>> Sorry for not mentioning it, as it seemed obvious, but we are calling >>> pg_start_backup and pg_stop_backup at the right time. >> >> >> Ah, not something I can assume, heh. >> >> Then it depends on which version of PG and if you’re able to run >> start/stop on the replica or not. If you can’t run it on the replica and >> have to run it on the primary (prior to 9.6) then you need to make sure to >> wait for things to happen on the primary and for that to be replicated >> before you can start. >> >> >> We are using exclusive backups from the master. First, the script checks >> that WAL files are shipped to the NFS, where the replica expects to find >> them (we check the md5 checksum of the file in order to make sure that the >> NFS actually delivers the file that the master has archived) . Then >> pg_start_backup runs on the master and its status is checked. On success, >> the copy command runs. When the copy command finishes, pg_stop_backup is >> executed. Once pg_stop_backup finishes successfully, replica configuration >> files (postgesql.conf, pg_hba.conf. pg_ident.conf) are linked from their >> location in the repository and the replica is started. >> > > No, you must wait until the replica has moved forward far enough and you > have to copy the backup_label file from the primary as well, otherwise PG > won’t realize you’re doing a backup-based recovery > > > > Are you talking about the exclusive base backup from the master (the > master being the source for the backup)? > Hrmpf. I could have sworn there was a comment somewhere that you were backing up from the replica, not the primary. If you’re doing pg_start/stop_backup on the primary *and* copying the files from the primary, then that’s much better. At least the backup label is written by pg_start_backup to the data > directory and is being copied together with the data directory. The > necessary WAL files are archived once pg_stop_backup returns, and the > replica cannot move anywhere in recovery without being started. > Ok, yes, if you’re getting the backup_label and it’s included in the copy of the data directory then that’s reasonable. This is a fairly typical procedure, which, I believe, is also well >> described in the docs. >> > > Please provide a link to where that is because if that’s the case then we > need to correct it or remove it. This is absolutely not safe without > additional checks being done and various other magic happening (like > copying the backup_label off the primary where it’s created). > > > > https://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-EXCLUSIVE > > It has been there for years and I don’t think there is anything terribly > wrong there. > I had somehow understood you to be copying the files off the replica and not the primary, though I’m not entirely clear why now. If you’re on 9.6 and using non-exclusive backup, you need to be sure to >> capture the contents of the stop backup and write it into backup_label >> before you start the system back up. >> >> >> We don’t use non-exclusive backups altogether. >> > > All the more likely that your procedure is causing more corruption than > you realize then. > > > How does exclusive backups make the procedure more prone to corruption? > An exclusive backup can’t be run on a replica, which is what I was getting at with the above comment. Seriously, again, this is not easy to get right, especially when you’re > doing things that weren’t explicitly documented and supported. Using > existing tools from those versed in why the processes used are safe and > have written lots of tests to verify that it is safe is really the > recommendation that you should take away from this. > > > I believe what we are doing is rather simple and well documented by the > link above. > If you’re copying it all from the primary and using start/stop backup, then yes. At least with 9.6 there’s proper documentation on how to run a > non-exclusive backup on a replica properly and if you very carefully follow > the procedure then you may get it right, but you will still want to test > extensively. > > > We are not doing non-exclusive backups from the replica. > Apologies for the confusion. Offhand (and off my phone) probably best if I don’t try to guess further at what the issue is that you’re running into.:) Thanks! Stephen >

В списке pgsql-admin по дате отправления:

Предыдущее
От: Alex Kliukin
Дата:
Сообщение: Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas
Следующее
От: Sean G
Дата:
Сообщение: Re: Barman WAL size issue