Обсуждение: Issue on restore / recover

Поиск
Список
Период
Сортировка

Issue on restore / recover

От
flumbador@virgilio.it
Дата:

Hello

I've a question about postgresql restore / recover.

I have restored and recovered a Postgresql 9.4.9 from an hot backup. The backup is a filesystem copy taken while the Postgrsql is in backup mode (I mean between start and stop backup).

During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for sure during the backup many transactions had modified this table and those missing files. What is surprising to me is that even if the files were missing the recover phase ended successfully. I expect an error (for example file not found) raised when postgresql try to apply the wal entries related to this table and those files. After the recover I find that these three file has been created during recover but when I try to query the table I get the error:

db4=# select count(*) from pgbench_accounts ;

ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192 bytes

This error confirm that my database is corrupted! The question is: why during the recover phase Postgresql doesn't throw any errors? I think that is better to know immediately that we have restored our database from a corrupted backup rather then discover the issue after maybe a long time when a query is executed on the corrupted table.

Thanks in advance

Regards

Gabriele

Re: Issue on restore / recover

От
Ben Primrose
Дата:
How was the backup taken, and what platform are you on?

Are these tables in unlogged mode?  Check pg_class.relpersistence to find out.

On Fri, Dec 29, 2017 at 10:21 AM, <flumbador@virgilio.it> wrote:

Hello

I've a question about postgresql restore / recover.

I have restored and recovered a Postgresql 9.4.9 from an hot backup. The backup is a filesystem copy taken while the Postgrsql is in backup mode (I mean between start and stop backup).

During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for sure during the backup many transactions had modified this table and those missing files. What is surprising to me is that even if the files were missing the recover phase ended successfully. I expect an error (for example file not found) raised when postgresql try to apply the wal entries related to this table and those files. After the recover I find that these three file has been created during recover but when I try to query the table I get the error:

db4=# select count(*) from pgbench_accounts ;

ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192 bytes

This error confirm that my database is corrupted! The question is: why during the recover phase Postgresql doesn't throw any errors? I think that is better to know immediately that we have restored our database from a corrupted backup rather then discover the issue after maybe a long time when a query is executed on the corrupted table.

Thanks in advance

Regards

Gabriele




--

Ben Primrose  |  Postgres DBA |  TraceLink Inc.
400 Riverpark Dr.
Floor 2, Suite 200
North Reading, MA 01864
o: +1.978.396.6507
e: bprimrose@tracelink.com

Re: Issue on restore / recover

От
Stephen Frost
Дата:
Greetings,

* flumbador@virgilio.it (flumbador@virgilio.it) wrote:
> I have restored and recovered a Postgresql 9.4.9 from an hot backup. The backup is a filesystem copy taken while the
Postgrsqlis in backup mode (I mean between start and stop backup). 
>
> During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for
sureduring the backup many transactions had modified this table and those missing files. What is surprising to me is
thateven if the files were missing the recover phase ended successfully. I expect an error (for example file not found)
raisedwhen postgresql try to apply the wal entries related to this table and those files. After the recover I find that
thesethree file has been created during recover but when I try to query the table I get the error: 
>
> db4=# select count(*) from pgbench_accounts ;
>
> ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192
bytes
>
> This error confirm that my database is corrupted! The question is: why during the recover phase Postgresql doesn't
throwany errors? I think that is better to know immediately that we have restored our database from a corrupted backup
ratherthen discover the issue after maybe a long time when a query is executed on the corrupted table. 

How are you handling WAL archiving?  Based on the symptoms, my first
guess is that you aren't putting the backup_label file in place (or
you're removing it) before PG starts doing WAL replay, which makes PG
think it's just doing crash recovery and so it does *some* WAL replay
but not *all* that it needs to do (it needs to start farther back, where
the backup started), which can certainly result in a corrupt backup.

I'd strongly suggest you look at the various backup tools which exist
for PostgreSQL and know how to properly do backups and restores instead
of trying to roll your own.  Personally, I'd suggest pgBackRest but
there are other tools out there such as barman and WAL-E/G that you
might consider.

Thanks!

Stephen

Вложения

Re: Issue on restore / recover

От
Stephen Frost
Дата:
Greetings Ben,

* Ben Primrose (bprimrose@tracelink.com) wrote:
> How was the backup taken, and what platform are you on?

Good questions, I've asked them also.

> Are these tables in unlogged mode?  Check pg_class.relpersistence to find
> out.
[...]
> > db4=# select count(*) from pgbench_accounts ;
> >
> > ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15":
> > read only 0 of 8192 bytes

PostgreSQL knows how to handle unlogged tables when doing backup/restore
(they'll get zero'd out during the restore process) and so you wouldn't
(shouldn't anyway) see an error like this even if the table is unlogged
(instead, it would just be empty).

Thanks!

Stephen

Вложения

Re: Issue on restore / recover

От
Alvaro Herrera
Дата:
flumbador@virgilio.it wrote:

> During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for
sureduring the backup many transactions had modified this table and those missing files. What is surprising to me is
thateven if the files were missing the recover phase ended successfully. I expect an error (for example file not found)
raisedwhen postgresql try to apply the wal entries related to this table and those files. After the recover I find that
thesethree file has been created during recover but when I try to query the table I get the error:
 
> 
> db4=# select count(*) from pgbench_accounts ;
> 
> ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192
bytes

Can you please get a backtrace from this error?  I suspect that the
problem here is that the planner is trying to estimate cardinality by
descending a btree index, and failing because the index points to a part
of the table that has been truncated.  But if the table has been
truncated because it's unlogged, then the index should have been
truncated too; and if it's a permanent table, then it has been truncated
improperly (because no index pointer to the truncated part should
remain).

What files do exist for pg_tblspc/16471/PG_9.4_201409291/16474/16593* ?
(In particular I'm curious about whether 16593.14 exists and what size
it is)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Issue on restore / recover

От
flumbador@virgilio.it
Дата:

Hi postgresql is 9.4.9 on SUSE 11sp4 ... The backup is a filesystem backup with Networker.
I confirm that I handle WAL archiving and backup_label file was in place when I started the recovery. The table is Persistent (Relpersistence = 'p').
The postgresql was succesfully restored from another backup and now is up and running; i'm not able to provide further datails.
I try to reproduce the issue on a test environment.
I'll update the thread with further details as soon as I reproduce the issue.
Thanks to all
Best Regards
Gabriele

Il 2 gennaio 2018 alle 16.38 Alvaro Herrera <alvherre@alvh.no-ip.org> ha scritto:

flumbador@virgilio.it wrote:

During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for sure during the backup many transactions had modified this table and those missing files. What is surprising to me is that even if the files were missing the recover phase ended successfully. I expect an error (for example file not found) raised when postgresql try to apply the wal entries related to this table and those files. After the recover I find that these three file has been created during recover but when I try to query the table I get the error:

db4=# select count(*) from pgbench_accounts ;

ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192 bytes

Can you please get a backtrace from this error? I suspect that the
problem here is that the planner is trying to estimate cardinality by
descending a btree index, and failing because the index points to a part
of the table that has been truncated. But if the table has been
truncated because it's unlogged, then the index should have been
truncated too; and if it's a permanent table, then it has been truncated
improperly (because no index pointer to the truncated part should
remain).

What files do exist for pg_tblspc/16471/PG_9.4_201409291/16474/16593* ?
(In particular I'm curious about whether 16593.14 exists and what size
it is)

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Issue on restore / recover

От
flumbador@virgilio.it
Дата:

Hi All, 

I reproduced the issue with the following test case:


1) Create a table for testing porpouse:

pdbabe=# CREATE TABLE test_corrupt (order_date timestamptz, item char(1024));
CREATE TABLE

2) Insert some data in the table

pdbabe=# INSERT INTO test_corrupt (order_date, item)
SELECT x, 'generic item'
FROM generate_series('2018-01-01 00:00:00'::timestamptz, '2018-01-01 00:10:00'::timestamptz,'60 seconds'::interval) a(x);
INSERT 0 11


pdbabe=# select order_date, substring(item, 1, 20) from test_corrupt ;
order_date | substring
------------------------+--------------
2018-01-01 00:00:00+01 | generic item
2018-01-01 00:01:00+01 | generic item
2018-01-01 00:02:00+01 | generic item
2018-01-01 00:03:00+01 | generic item
2018-01-01 00:04:00+01 | generic item
2018-01-01 00:05:00+01 | generic item
2018-01-01 00:06:00+01 | generic item
2018-01-01 00:07:00+01 | generic item
2018-01-01 00:08:00+01 | generic item
2018-01-01 00:09:00+01 | generic item
2018-01-01 00:10:00+01 | generic item


3) Check the file that contain the table:

pdbabe=# SELECT pg_relation_filepath('test_corrupt');
pg_relation_filepath
----------------------------------------------
pg_tblspc/16385/PG_9.4_201409291/16386/49391
(1 row)


pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 9 14:13 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm


4) At 2.40 A.M. Start the backup of the database (filesystem backup):

pg_start_backup
-----------------
17/30000028
(1 row)


5) At 2:57 A.M.  new record is inserted in the table.

insert into test_corrupt values (current_timestamp, 'new item') ;

After the insert the content of the table is:

pdbabe=# select order_date, substring(item, 1, 20) from test_corrupt ;
order_date | substring
-------------------------------+--------------
2018-01-01 00:00:00+01 | generic item
2018-01-01 00:01:00+01 | generic item
2018-01-01 00:02:00+01 | generic item
2018-01-01 00:03:00+01 | generic item
2018-01-01 00:04:00+01 | generic item
2018-01-01 00:05:00+01 | generic item
2018-01-01 00:06:00+01 | generic item
2018-01-01 00:07:00+01 | generic item
2018-01-01 00:08:00+01 | generic item
2018-01-01 00:09:00+01 | generic item
2018-01-01 00:10:00+01 | generic item
2018-01-10 02:57:01.106676+01 | new item
(12 rows)

No other insert / update / delete are executed against the table until the restore / recover.


6) The backup of filesystem on which resides the Postgresql, ended at 4:01 A.M

pg_stop_backup | now
----------------+-------------------------------
17/7EFE3900 | 2018-01-10 04:01:53.496869+01
(1 row)


8) Restore of the postgresql database, by executung the restore of the previous filesystem backup.


9) Contets of backup_label restored:

pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe> more backup_label
START WAL LOCATION: 17/30000028 (file 000000070000001700000030)
CHECKPOINT LOCATION: 17/30000060
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2018-01-10 02:40:00 CET
LABEL: bckdb


10) Delete of the file containg the table "test_corrupt" to simulate the file missing in the backup:

pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe> cd $PGDATA/pg_tblspc/16385/PG_9.4_201409291/16386
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 10 02:57 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> rm ./49391
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386>


12) Create file recovery.conf:

restore_command = 'cp /pgsql/pdbabe/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /pgsql/pdbabe/archive %r'


13) Start Postgresql to recover the database ... Recover is successfull ... NO error in logfile:

2018-01-10 12:05:25 CET [30299]: [1-1] user=,db=,app=,client= LOG: database system was interrupted; last known up at 2018-01
-10 02:50:00 CET
2018-01-10 12:05:25 CET [30299]: [2-1] user=,db=,app=,client= LOG: creating missing WAL directory "pg_xlog/archive_status"
2018-01-10 12:05:25 CET [30299]: [3-1] user=,db=,app=,client= LOG: starting archive recovery
2018-01-10 12:05:25 CET [30299]: [4-1] user=,db=,app=,client= LOG: restored log file "00000007.history" from archive
2018-01-10 12:05:25 CET [30299]: [5-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000030" from archive
2018-01-10 12:05:25 CET [30299]: [6-1] user=,db=,app=,client= LOG: redo starts at 17/30000028
2018-01-10 12:05:25 CET [30299]: [7-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000031" from archive
2018-01-10 12:05:26 CET [30299]: [8-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000032" from archive
2018-01-10 12:05:26 CET [30299]: [9-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000033" from archive
......
2018-01-10 12:05:46 CET [30299]: [81-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007B" from archiv
e
2018-01-10 12:05:48 CET [30299]: [82-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007C" from archiv
e
2018-01-10 12:05:49 CET [30299]: [83-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007D" from archiv
e
2018-01-10 12:05:50 CET [30299]: [84-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007E" from archiv
e
2018-01-10 12:05:50 CET [30299]: [85-1] user=,db=,app=,client= LOG: consistent recovery state reached at 17/7EFE3900
2018-01-10 12:05:50 CET [30297]: [3-1] user=,db=,app=,client= LOG: database system is ready to accept read only connections
2018-01-10 12:05:50 CET [30299]: [86-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007F" from archiv
e
2018-01-10 12:05:50 CET [30299]: [87-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000080" from archiv
e
cp: cannot stat `/pgsql/pdbabe/archive/000000070000001700000081': No such file or directory
2018-01-10 12:05:50 CET [30299]: [88-1] user=,db=,app=,client= LOG: redo done at 17/800875F0
2018-01-10 12:05:50 CET [30299]: [89-1] user=,db=,app=,client= LOG: last completed transaction was at log time 2018-01-10 09
:17:01.870329+01
2018-01-10 12:05:53 CET [30299]: [90-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000080" from archiv
e
cp: cannot stat `/pgsql/pdbabe/archive/00000008.history': No such file or directory
2018-01-10 12:05:53 CET [30299]: [91-1] user=,db=,app=,client= LOG: selected new timeline ID: 8
2018-01-10 12:05:53 CET [30299]: [92-1] user=,db=,app=,client= LOG: restored log file "00000007.history" from archive
2018-01-10 12:05:53 CET [30299]: [93-1] user=,db=,app=,client= LOG: archive recovery complete
2018-01-10 12:05:53 CET [30302]: [1-1] user=,db=,app=,client= LOG: checkpoint starting: end-of-recovery immediate wait
2018-01-10 12:05:57 CET [30397]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=10.151.201.1 LOG: connection receive
d: host=10.151.201.1 port=51057
2018-01-10 12:05:57 CET [30397]: [2-1] user=patrol,db=pdbabe,app=[unknown],client=10.151.201.1 FATAL: no pg_hba.conf entry f
or host "10.151.201.1", user "patrol", database "pdbabe", SSL off
2018-01-10 12:06:57 CET [30410]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=10.151.201.1 LOG: connection receive
d: host=10.151.201.1 port=51058
2018-01-10 12:06:57 CET [30410]: [2-1] user=patrol,db=pdbabe,app=[unknown],client=10.151.201.1 FATAL: no pg_hba.conf entry f
or host "10.151.201.1", user "patrol", database "pdbabe", SSL off
2018-01-10 12:07:15 CET [30302]: [2-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 254519 buffers (99.4%); 0 tran
saction log file(s) added, 0 removed, 0 recycled; write=71.245 s, sync=10.241 s, total=81.566 s; sync files=39, longest=10.21
4 s, average=0.262 s
2018-01-10 12:07:15 CET [30299]: [94-1] user=,db=,app=,client= LOG: MultiXact member wraparound protections are now enabled
2018-01-10 12:07:15 CET [30297]: [4-1] user=,db=,app=,client= LOG: database system is ready to accept connections
2018-01-10 12:07:15 CET [30421]: [1-1] user=,db=,app=,client= LOG: autovacuum launcher started



14) After the recover the file deleted is in place:

pgpdbabe@elaawddbad00:/pgsql/pdbabe/admin/log> cd $PGDATA/pg_tblspc/16385/PG_9.4_201409291/16386
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 10 12:10 49391         <-----------
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm



15) Let's check the content of the table ... no errors raised but the table has 5 records and not 12 records (as it should have).

pgpdbabe@elaawddbad00:/pgsql/pdbabe/admin/log> psql
psql (9.4.9)
Type "help" for help.

pdbabe=# select order_date, substring(item, 1, 20) from test_corrupt ;
order_date | substring
-------------------------------+--------------
2018-01-01 00:07:00+01 | generic item
2018-01-01 00:08:00+01 | generic item
2018-01-01 00:09:00+01 | generic item
2018-01-01 00:10:00+01 | generic item
2018-01-10 02:57:01.106676+01 | new item
(5 rows)

In conclusion, the recover operation was successfully but the database is corrupted. The point is: when the recover find a missing file it
should exit with an error instead of going ahead and terminating succesfully. What do you think about this opinion? Am I missing something?

Thanks & Regards
Gabriele

Il 8 gennaio 2018 alle 14.53 flumbador@virgilio.it ha scritto:

Hi postgresql is 9.4.9 on SUSE 11sp4 ... The backup is a filesystem backup with Networker.
I confirm that I handle WAL archiving and backup_label file was in place when I started the recovery. The table is Persistent (Relpersistence = 'p').
The postgresql was succesfully restored from another backup and now is up and running; i'm not able to provide further datails.
I try to reproduce the issue on a test environment.
I'll update the thread with further details as soon as I reproduce the issue.
Thanks to all
Best Regards
Gabriele

Il 2 gennaio 2018 alle 16.38 Alvaro Herrera <alvherre@alvh.no-ip.org> ha scritto:

flumbador@virgilio.it wrote:

During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for sure during the backup many transactions had modified this table and those missing files. What is surprising to me is that even if the files were missing the recover phase ended successfully. I expect an error (for example file not found) raised when postgresql try to apply the wal entries related to this table and those files. After the recover I find that these three file has been created during recover but when I try to query the table I get the error:

db4=# select count(*) from pgbench_accounts ;

ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192 bytes

Can you please get a backtrace from this error? I suspect that the
problem here is that the planner is trying to estimate cardinality by
descending a btree index, and failing because the index points to a part
of the table that has been truncated. But if the table has been
truncated because it's unlogged, then the index should have been
truncated too; and if it's a permanent table, then it has been truncated
improperly (because no index pointer to the truncated part should
remain).

What files do exist for pg_tblspc/16471/PG_9.4_201409291/16474/16593* ?
(In particular I'm curious about whether 16593.14 exists and what size
it is)

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Issue on restore / recover

От
Rui DeSousa
Дата:
You’ve just corrupted your system…  not sure what you actually trying to reproduced.

10) Delete of the file containg the table "test_corrupt" to simulate the file missing in the backup:


What is being used to backup the files? What are your file system mount options? Sound’s like your backup is not backing up all the files.


On Jan 10, 2018, at 11:28 AM, flumbador@virgilio.it wrote:

10) Delete of the file containg the table "test_corrupt" to simulate the file missing in the backup:


Re: Issue on restore / recover

От
flumbador@virgilio.it
Дата:

Hi,

I want to show that postgresql, while is recovering the database applying the WAL file, don't raise any error if it can't find a missing file.  It continue the recovery that end successfully. I know that the source of the problem is that I have to guarantee that the backup is complete and consistent but it sound me strange that postgresql don't signal any errors when it can't find a file .... I hope that is clear the scope of the testcase.

Thanks & Regards

Gabriele

 

Il 10 gennaio 2018 alle 18.04 Rui DeSousa <rui.desousa@icloud.com> ha scritto:

You’ve just corrupted your system…  not sure what you actually trying to reproduced.

10) Delete of the file containg the table "test_corrupt" to simulate the file missing in the backup:


What is being used to backup the files? What are your file system mount options? Sound’s like your backup is not backing up all the files.


On Jan 10, 2018, at 11:28 AM, flumbador@virgilio.it wrote:

10) Delete of the file containg the table "test_corrupt" to simulate the file missing in the backup:


Re: Issue on restore / recover

От
Alvaro Herrera
Дата:
flumbador@virgilio.it wrote:

> I want to show that postgresql, while is recovering the database
> applying the WAL file, don't raise any error if it can't find a
> missing file.  It continue the recovery that end successfully. I know
> that the source of the problem is that I have to guarantee that the
> backup is complete and consistent but it sound me strange that
> postgresql don't signal any errors when it can't find a file .... I
> hope that is clear the scope of the testcase.

It seems clear to me.  Our current answer is that we just don't check
for this.  It seems a very difficult situation to check for -- firstly
because, as you must have realized by now, the recovery process creates
some files that are missing when recovery starts, so just checking "do
all files exist" prior to starting recovery is not the right thing to
check for.  Maybe you could figure out a check to run just at the point
where consistent point is reached, but (secondly) that would be a very
expensive way to check for a condition in the "should not happen" camp
(namely, that your backup tool is randomly losing files).

Maybe one thing that is happening is that if you lose a whole file, and
later the recovery process fills page 2 with the correct tuples, then
page 0 and page 1 are initialized too to the empty state as a side effect.
If this is indeed what is happening, I have to say that it's not great.
But again, with a working backup system you wouldn't have lost the file
in the first place, so the case doesn't arise.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Issue on restore / recover

От
flumbador@virgilio.it
Дата:

Hi

I agree with You!

The most important thing is that the backup is consistent and don't loose any files. However my opinion  is that postgresql should raise an error when it find a missing file during recovery, and not try to initialize it with empty pages ....

Thanks to all for the support

Best Regards

Gabriele 


Il 12 gennaio 2018 alle 13.52 Alvaro Herrera <alvherre@alvh.no-ip.org> ha scritto:

flumbador@virgilio.it wrote:

I want to show that postgresql, while is recovering the database
applying the WAL file, don't raise any error if it can't find a
missing file. It continue the recovery that end successfully. I know
that the source of the problem is that I have to guarantee that the
backup is complete and consistent but it sound me strange that
postgresql don't signal any errors when it can't find a file .... I
hope that is clear the scope of the testcase.

It seems clear to me. Our current answer is that we just don't check
for this. It seems a very difficult situation to check for -- firstly
because, as you must have realized by now, the recovery process creates
some files that are missing when recovery starts, so just checking "do
all files exist" prior to starting recovery is not the right thing to
check for. Maybe you could figure out a check to run just at the point
where consistent point is reached, but (secondly) that would be a very
expensive way to check for a condition in the "should not happen" camp
(namely, that your backup tool is randomly losing files).

Maybe one thing that is happening is that if you lose a whole file, and
later the recovery process fills page 2 with the correct tuples, then
page 0 and page 1 are initialized too to the empty state as a side effect.
If this is indeed what is happening, I have to say that it's not great.
But again, with a working backup system you wouldn't have lost the file
in the first place, so the case doesn't arise.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services