Обсуждение: PostgreSQL 14.8 - server fails to start even though all transaction logs with base backup are restored

Поиск
Список
Период
Сортировка

Hi Team,

 

We are following https://www.postgresql.org/docs/14/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP

Making An Exclusive Low-Level Backup.

 

After restoring, we see the server fails to start with below error:

 

PostgreSQL server log:

2023-08-04 16:47:47.227 IST [40582] LOG:  starting PostgreSQL 14.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit

2023-08-04 16:47:47.228 IST [40582] LOG:  listening on IPv4 address "0.0.0.0", port 5414

2023-08-04 16:47:47.228 IST [40582] LOG:  listening on IPv6 address "::", port 5414

2023-08-04 16:47:47.231 IST [40582] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5414"

2023-08-04 16:47:47.236 IST [40582] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5414"

2023-08-04 16:47:47.243 IST [40584] LOG:  database system was interrupted; last known up at 2023-08-04 14:48:29 IST

2023-08-04 16:47:49.254 IST [40584] LOG:  starting archive recovery

cp: cannot stat ‘/var/lib/pgsql/14/wal/00000004.history’: No such file or directory

2023-08-04 16:47:49.275 IST [40584] LOG:  restored log file "000000040000000200000041" from archive

2023-08-04 16:47:49.290 IST [40584] LOG:  redo starts at 2/41000028

2023-08-04 16:47:49.310 IST [40584] LOG:  restored log file "000000040000000200000042" from archive

2023-08-04 16:47:49.344 IST [40584] LOG:  restored log file "000000040000000200000043" from archive

2023-08-04 16:47:49.368 IST [40584] LOG:  restored log file "000000040000000200000044" from archive

2023-08-04 16:47:49.395 IST [40584] LOG:  restored log file "000000040000000200000045" from archive

2023-08-04 16:47:49.423 IST [40584] LOG:  restored log file "000000040000000200000046" from archive

2023-08-04 16:47:49.460 IST [40584] LOG:  restored log file "000000040000000200000047" from archive

2023-08-04 16:47:50.649 IST [40584] LOG:  restored log file "000000040000000200000048" from archive

2023-08-04 16:47:51.219 IST [40584] LOG:  restored log file "000000040000000200000049" from archive

cp: cannot stat ‘/var/lib/pgsql/14/wal/00000004000000020000004A’: No such file or directory

2023-08-04 16:47:51.603 IST [40584] LOG:  redo done at 2/49013500 system usage: CPU: user: 0.00 s, system: 0.19 s, elapsed: 2.31 s

2023-08-04 16:47:51.603 IST [40584] LOG:  last completed transaction was at log time 2023-08-04 14:58:51.021414+05:30

2023-08-04 16:47:51.628 IST [40584] LOG:  restored log file "000000040000000200000049" from archive

2023-08-04 16:47:51.973 IST [40584] FATAL:  WAL ends before end of online backup

2023-08-04 16:47:51.973 IST [40584] HINT:  Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.

2023-08-04 16:47:51.975 IST [40582] LOG:  startup process (PID 40584) exited with exit code 1

2023-08-04 16:47:51.975 IST [40582] LOG:  terminating any other active server processes

2023-08-04 16:47:51.975 IST [40582] LOG:  shutting down due to startup process failure

2023-08-04 16:47:51.977 IST [40582] LOG:  database system is shut down

 

 

Backup_label from data directory:

 

[root@gkstandby2 data]# cat backup_label

START WAL LOCATION: 2/41000028 (file 000000040000000200000041)

CHECKPOINT LOCATION: 2/41000060

BACKUP METHOD: pg_start_backup

BACKUP FROM: primary

START TIME: 2023-08-04 14:48:29 IST

LABEL: pgida_backup_5414_108625_1691140709

START TIMELINE: 4

[root@gkstandby2 data]#

 

 

Archive log directory:

 

[root@gkstandby2 wal]# ls

00000002000000020000003B                  00000002000000020000003E.00000028.backup  000000040000000200000040.00000028.backup  000000040000000200000044                  000000040000000200000048

00000002000000020000003C                  00000002.history                          000000040000000200000041                  000000040000000200000045                  000000040000000200000049

00000002000000020000003C.00000028.backup  0000000300000005000000E4                  000000040000000200000042                  000000040000000200000045.00005AE0.backup 

00000002000000020000003D                  00000004000000020000003F                  000000040000000200000043                  000000040000000200000046

00000002000000020000003E                  000000040000000200000040                  000000040000000200000043.00000028.backup  000000040000000200000047

 

 

[root@gkstandby2 wal]# /usr/pgsql-14/bin/pg_waldump 000000040000000200000046 | grep BACKUP

rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn: 2/46000110, prev 2/46000098, desc: BACKUP_END 2/45005AE0

[root@gkstandby2 wal]#

 

 

We are restoring all the transaction logs required for base backup to be consistent.

BACKUP_END is present in the 000000040000000200000046 segment. Why did recovery fail to find this?

 

 

Regards,

Meera

 


"26.3.3.2. Making An Exclusive Low-Level Backup

Note

The exclusive backup method is deprecated and should be avoided. Prior to PostgreSQL 9.6, this was the only low-level method available, but it is now recommended that all users upgrade their scripts to use non-exclusive backups."
Use PgBackRest, instead.


On 8/8/23 02:39, Meera Nair wrote:
@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face {font-family:Consolas; panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; font-size:11.0pt; font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink {mso-style-priority:99; color:#0563C1; text-decoration:underline;}span.EmailStyle19 {mso-style-type:personal-compose; font-family:"Calibri",sans-serif; color:windowtext;}.MsoChpDefault {mso-style-type:export-only; font-size:10.0pt; font-family:"Calibri",sans-serif; mso-ligatures:none;}div.WordSection1 {page:WordSection1;}

Hi Team,

 

We are following https://www.postgresql.org/docs/14/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP

Making An Exclusive Low-Level Backup.

 

After restoring, we see the server fails to start with below error:

 

PostgreSQL server log:

2023-08-04 16:47:47.227 IST [40582] LOG:  starting PostgreSQL 14.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit

2023-08-04 16:47:47.228 IST [40582] LOG:  listening on IPv4 address "0.0.0.0", port 5414

2023-08-04 16:47:47.228 IST [40582] LOG:  listening on IPv6 address "::", port 5414

2023-08-04 16:47:47.231 IST [40582] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5414"

2023-08-04 16:47:47.236 IST [40582] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5414"

2023-08-04 16:47:47.243 IST [40584] LOG:  database system was interrupted; last known up at 2023-08-04 14:48:29 IST

2023-08-04 16:47:49.254 IST [40584] LOG:  starting archive recovery

cp: cannot stat ‘/var/lib/pgsql/14/wal/00000004.history’: No such file or directory

2023-08-04 16:47:49.275 IST [40584] LOG:  restored log file "000000040000000200000041" from archive

2023-08-04 16:47:49.290 IST [40584] LOG:  redo starts at 2/41000028

2023-08-04 16:47:49.310 IST [40584] LOG:  restored log file "000000040000000200000042" from archive

2023-08-04 16:47:49.344 IST [40584] LOG:  restored log file "000000040000000200000043" from archive

2023-08-04 16:47:49.368 IST [40584] LOG:  restored log file "000000040000000200000044" from archive

2023-08-04 16:47:49.395 IST [40584] LOG:  restored log file "000000040000000200000045" from archive

2023-08-04 16:47:49.423 IST [40584] LOG:  restored log file "000000040000000200000046" from archive

2023-08-04 16:47:49.460 IST [40584] LOG:  restored log file "000000040000000200000047" from archive

2023-08-04 16:47:50.649 IST [40584] LOG:  restored log file "000000040000000200000048" from archive

2023-08-04 16:47:51.219 IST [40584] LOG:  restored log file "000000040000000200000049" from archive

cp: cannot stat ‘/var/lib/pgsql/14/wal/00000004000000020000004A’: No such file or directory

2023-08-04 16:47:51.603 IST [40584] LOG:  redo done at 2/49013500 system usage: CPU: user: 0.00 s, system: 0.19 s, elapsed: 2.31 s

2023-08-04 16:47:51.603 IST [40584] LOG:  last completed transaction was at log time 2023-08-04 14:58:51.021414+05:30

2023-08-04 16:47:51.628 IST [40584] LOG:  restored log file "000000040000000200000049" from archive

2023-08-04 16:47:51.973 IST [40584] FATAL:  WAL ends before end of online backup

2023-08-04 16:47:51.973 IST [40584] HINT:  Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.

2023-08-04 16:47:51.975 IST [40582] LOG:  startup process (PID 40584) exited with exit code 1

2023-08-04 16:47:51.975 IST [40582] LOG:  terminating any other active server processes

2023-08-04 16:47:51.975 IST [40582] LOG:  shutting down due to startup process failure

2023-08-04 16:47:51.977 IST [40582] LOG:  database system is shut down

 

 

Backup_label from data directory:

 

[root@gkstandby2 data]# cat backup_label

START WAL LOCATION: 2/41000028 (file 000000040000000200000041)

CHECKPOINT LOCATION: 2/41000060

BACKUP METHOD: pg_start_backup

BACKUP FROM: primary

START TIME: 2023-08-04 14:48:29 IST

LABEL: pgida_backup_5414_108625_1691140709

START TIMELINE: 4

[root@gkstandby2 data]#

 

 

Archive log directory:

 

[root@gkstandby2 wal]# ls

00000002000000020000003B                  00000002000000020000003E.00000028.backup  000000040000000200000040.00000028.backup  000000040000000200000044                  000000040000000200000048

00000002000000020000003C                  00000002.history                          000000040000000200000041                  000000040000000200000045                  000000040000000200000049

00000002000000020000003C.00000028.backup  0000000300000005000000E4                  000000040000000200000042                  000000040000000200000045.00005AE0.backup 

00000002000000020000003D                  00000004000000020000003F                  000000040000000200000043                  000000040000000200000046

00000002000000020000003E                  000000040000000200000040                  000000040000000200000043.00000028.backup  000000040000000200000047

 

 

[root@gkstandby2 wal]# /usr/pgsql-14/bin/pg_waldump 000000040000000200000046 | grep BACKUP

rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn: 2/46000110, prev 2/46000098, desc: BACKUP_END 2/45005AE0

[root@gkstandby2 wal]#

 

 

We are restoring all the transaction logs required for base backup to be consistent.

BACKUP_END is present in the 000000040000000200000046 segment. Why did recovery fail to find this?

 

 

Regards,

Meera

 


--
Born in Arizona, moved to Babylonia.
Greetings,

* Meera Nair (mnair@commvault.com) wrote:
> We are following https://www.postgresql.org/docs/14/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP
> Making An Exclusive Low-Level Backup.

Exclusive backups have been removed, so you really don't want to be
depending on it.

> After restoring, we see the server fails to start with below error:

Your archive doesn't seem to have all of the WAL which was generated
during the backup, which means that your backup wasn't complete and the
system cannot be restored from this backup.

> We are restoring all the transaction logs required for base backup to be consistent.

Evidently not, per the complaint from PG when it starts up.

> BACKUP_END is present in the 000000040000000200000046 segment. Why did recovery fail to find this?

It's possible to have multiple backups going concurrently.  Presumably,
this was from a different backup and that's why it wasn't picked up as
being the end for this backup.  You seem to have quite a few .backup
files in your archive directory which would seem to support this.

Not sure exactly what you're doing, but unless your goal is to spend a
great deal of effort developing a PG backup solution, you're really
better off using one of the existing solutions (eg: pg_basebackup or
pgBackRest).  If you are developing your own backup solution for PG, you
definitely want to be using the new APIs and not using the exclusive
backup method.  Please review the current (15) documentation:

https://www.postgresql.org/docs/current/continuous-archiving.html

Thanks,

Stephen

Вложения

Hi Stephen,

 

The backups we ran were all exclusive backups.

 

pg_start_backup(‘label’) returned 000000040000000200000041. Then we made a copy of data directory .

Then pg_stop_backup() was executed and it returned 000000040000000200000046.

 

After this 2 databases were dropped and recreated (using dropdb and createdb). Then we ran pg_switch_wal()

This returned 000000040000000200000049. Then we made a copy of the archive log directory.

 

To verify the copied data is restorable, we stopped server. Renamed data and archive log directories.

Replaced it with data and archive log directories copied earlier. In archive log directory logs from 41 to 49 is present.

Created recovery.signal and added restore_command to postgresql.conf file. And tried to start server. this failed with error in previous mail.

 

[root@gkstandby2 wal]# ls -ltr

total 262172

-rw------- 1 postgres postgres       42 Jun 13 01:08 00000002.history

-rw------- 1 postgres postgres 16777216 Jun 22 22:50 00000002000000020000003B

-rw------- 1 postgres postgres      357 Jun 22 22:51 00000002000000020000003C.00000028.backup

-rw------- 1 postgres postgres 16777216 Jun 22 22:51 00000002000000020000003C

-rw------- 1 postgres postgres 16777216 Jun 22 22:52 00000002000000020000003D

-rw------- 1 postgres postgres      357 Jun 22 22:52 00000002000000020000003E.00000028.backup

-rw------- 1 postgres postgres 16777216 Jun 22 22:52 00000002000000020000003E

-rw------- 1 postgres postgres 16777216 Aug  4 14:47 0000000300000005000000E4

-rw------- 1 postgres postgres 16777216 Aug  4 14:48 00000004000000020000003F

-rw------- 1 postgres postgres 16777216 Aug  4 14:48 000000040000000200000040

-rw------- 1 postgres postgres 16777216 Aug  4 14:49 000000040000000200000041

-rw------- 1 postgres postgres      351 Aug  4 14:49 000000040000000200000040.00000028.backup

-rw------- 1 postgres postgres 16777216 Aug  4 14:49 000000040000000200000042

-rw------- 1 postgres postgres 16777216 Aug  4 14:49 000000040000000200000043

-rw------- 1 postgres postgres 16777216 Aug  4 14:49 000000040000000200000044

-rw------- 1 postgres postgres 16777216 Aug  4 14:50 000000040000000200000045

-rw------- 1 postgres postgres      351 Aug  4 14:50 000000040000000200000043.00000028.backup

-rw------- 1 postgres postgres 16777216 Aug  4 14:50 000000040000000200000046

-rw------- 1 postgres postgres      358 Aug  4 14:50 000000040000000200000045.00005AE0.backup

-rw------- 1 postgres postgres 16777216 Aug  4 14:58 000000040000000200000047

-rw------- 1 postgres postgres 16777216 Aug  4 14:58 000000040000000200000048

-rw------- 1 postgres postgres 16777216 Aug  4 14:58 000000040000000200000049

[root@gkstandby2 wal]#

 

Regards,

Meera

 

-----Original Message-----
From: Stephen Frost <sfrost@snowman.net>
Sent: Tuesday, August 8, 2023 7:48 PM
To: Meera Nair <mnair@commvault.com>
Cc: pgsql-general@lists.postgresql.org; Punit Pranesh Koujalgi <pkoujalgi@commvault.com>
Subject: Re: PostgreSQL 14.8 - server fails to start even though all transaction logs with base backup are restored

 

Greetings,

 

* Meera Nair (mnair@commvault.com) wrote:

> We are following

> https://www.postgresql.org/docs/14/continuous-archiving.html#BACKUP-LO

> WLEVEL-BASE-BACKUP

> Making An Exclusive Low-Level Backup.

 

Exclusive backups have been removed, so you really don't want to be depending on it.

 

> After restoring, we see the server fails to start with below error:

 

Your archive doesn't seem to have all of the WAL which was generated during the backup, which means that your backup wasn't complete and the system cannot be restored from this backup.

 

> We are restoring all the transaction logs required for base backup to be consistent.

 

Evidently not, per the complaint from PG when it starts up.

 

> BACKUP_END is present in the 000000040000000200000046 segment. Why did recovery fail to find this?

 

It's possible to have multiple backups going concurrently.  Presumably, this was from a different backup and that's why it wasn't picked up as being the end for this backup.  You seem to have quite a few .backup files in your archive directory which would seem to support this.

 

Not sure exactly what you're doing, but unless your goal is to spend a great deal of effort developing a PG backup solution, you're really better off using one of the existing solutions (eg: pg_basebackup or pgBackRest).  If you are developing your own backup solution for PG, you definitely want to be using the new APIs and not using the exclusive backup method.  Please review the current (15) documentation:

 

https://www.postgresql.org/docs/current/continuous-archiving.html

 

Thanks,

 

Stephen

Greetings,

* Meera Nair (mnair@commvault.com) wrote:
> The backups we ran were all exclusive backups.

You should not use exclusive backups.  They've been deprecated for a
long, long time and have been removed in the most recent version of PG.

You really should not be trying to build your own tooling for this.
Please, use an existing tool where the authors have already worked
through how all of this works and understand how BACKUP_END works and
how to reliably perform a backup and restore of PG.  Despite the
documentation seeming to imply that it's trivial to do this on your own,
it really isn't.

> pg_start_backup('label') returned 000000040000000200000041. Then we made a copy of data directory .
> Then pg_stop_backup() was executed and it returned 000000040000000200000046.

I'm really not sure what you're doing exactly, but it sure looks like
the BACKUP_END in the 46 WAL file is associated with the .backup file
for the backup started in the 45 segment.  You have this file:

000000040000000200000045.00005AE0.backup

and then this is what you reported from the 46 file:

rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn: 2/46000110, prev 2/46000098, desc: BACKUP_END
2/45005AE0

Note that the 'BACKUP_END' data content there is the starting point of
the backup for which that is the BACKUP_END for- 2/45005AE0.  That
matches with a backup being started in the 45 file and matches the
start time of the .backup file that you have in the archive.  That's why
the restore of the backup started in the 41 segment isn't picking up on
the BACKUP_END in the 46 file as being the one matching that backup-
because that's from some other backup.

> To verify the copied data is restorable, we stopped server. Renamed data and archive log directories.
> Replaced it with data and archive log directories copied earlier. In archive log directory logs from 41 to 49 is
present.

The pg_wal dir can and should be empty at the start.

> Created recovery.signal and added restore_command to postgresql.conf file. And tried to start server. this failed
witherror in previous mail.
 

The restore_command should be copying WAL files from the location where
they were stored by the archive_command.

Thanks,

Stephen

Вложения