Re: Issue on restore / recover

Поиск

Список

Период

Сортировка

От	flumbador@virgilio.it
Тема	Re: Issue on restore / recover
Дата	10 января 2018 г. 22:28:32
Msg-id	858243188.823822.1515601712820@mail.virgilio.it обсуждение исходный текст
Ответ на	Re: Issue on restore / recover (flumbador@virgilio.it)
Ответы	Re: Issue on restore / recover
Список	pgsql-admin

Дерево обсуждения

Hi All,

I reproduced the issue with the following test case:

1) Create a table for testing porpouse:

pdbabe=# CREATE TABLE test_corrupt (order_date timestamptz, item char(1024));
CREATE TABLE

2) Insert some data in the table

pdbabe=# INSERT INTO test_corrupt (order_date, item)
SELECT x, 'generic item'
FROM generate_series('2018-01-01 00:00:00'::timestamptz, '2018-01-01 00:10:00'::timestamptz,'60 seconds'::interval) a(x);
INSERT 0 11

3) Check the file that contain the table:

pdbabe=# SELECT pg_relation_filepath('test_corrupt');
pg_relation_filepath
----------------------------------------------
pg_tblspc/16385/PG_9.4_201409291/16386/49391
(1 row)

pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 9 14:13 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm

4) At 2.40 A.M. Start the backup of the database (filesystem backup):

pg_start_backup
-----------------
17/30000028
(1 row)

5) At 2:57 A.M. new record is inserted in the table.

insert into test_corrupt values (current_timestamp, 'new item') ;

After the insert the content of the table is:

No other insert / update / delete are executed against the table until the restore / recover.

6) The backup of filesystem on which resides the Postgresql, ended at 4:01 A.M

pg_stop_backup | now
----------------+-------------------------------
17/7EFE3900 | 2018-01-10 04:01:53.496869+01
(1 row)

8) Restore of the postgresql database, by executung the restore of the previous filesystem backup.

9) Contets of backup_label restored:

pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe> more backup_label
START WAL LOCATION: 17/30000028 (file 000000070000001700000030)
CHECKPOINT LOCATION: 17/30000060
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2018-01-10 02:40:00 CET
LABEL: bckdb

10) Delete of the file containg the table "test_corrupt" to simulate the file missing in the backup:

pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe> cd $PGDATA/pg_tblspc/16385/PG_9.4_201409291/16386
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 10 02:57 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> rm ./49391
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386>

12) Create file recovery.conf:

restore_command = 'cp /pgsql/pdbabe/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /pgsql/pdbabe/archive %r'

13) Start Postgresql to recover the database ... Recover is successfull ... NO error in logfile:

2018-01-10 12:05:25 CET [30299]: [1-1] user=,db=,app=,client= LOG: database system was interrupted; last known up at 2018-01
-10 02:50:00 CET
2018-01-10 12:05:25 CET [30299]: [2-1] user=,db=,app=,client= LOG: creating missing WAL directory "pg_xlog/archive_status"
2018-01-10 12:05:25 CET [30299]: [3-1] user=,db=,app=,client= LOG: starting archive recovery
2018-01-10 12:05:25 CET [30299]: [4-1] user=,db=,app=,client= LOG: restored log file "00000007.history" from archive
2018-01-10 12:05:25 CET [30299]: [5-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000030" from archive
2018-01-10 12:05:25 CET [30299]: [6-1] user=,db=,app=,client= LOG: redo starts at 17/30000028
2018-01-10 12:05:25 CET [30299]: [7-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000031" from archive
2018-01-10 12:05:26 CET [30299]: [8-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000032" from archive
2018-01-10 12:05:26 CET [30299]: [9-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000033" from archive
......
2018-01-10 12:05:46 CET [30299]: [81-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007B" from archiv
e
2018-01-10 12:05:48 CET [30299]: [82-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007C" from archiv
e
2018-01-10 12:05:49 CET [30299]: [83-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007D" from archiv
e
2018-01-10 12:05:50 CET [30299]: [84-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007E" from archiv
e
2018-01-10 12:05:50 CET [30299]: [85-1] user=,db=,app=,client= LOG: consistent recovery state reached at 17/7EFE3900
2018-01-10 12:05:50 CET [30297]: [3-1] user=,db=,app=,client= LOG: database system is ready to accept read only connections
2018-01-10 12:05:50 CET [30299]: [86-1] user=,db=,app=,client= LOG: restored log file "00000007000000170000007F" from archiv
e
2018-01-10 12:05:50 CET [30299]: [87-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000080" from archiv
e
cp: cannot stat `/pgsql/pdbabe/archive/000000070000001700000081': No such file or directory
2018-01-10 12:05:50 CET [30299]: [88-1] user=,db=,app=,client= LOG: redo done at 17/800875F0
2018-01-10 12:05:50 CET [30299]: [89-1] user=,db=,app=,client= LOG: last completed transaction was at log time 2018-01-10 09
:17:01.870329+01
2018-01-10 12:05:53 CET [30299]: [90-1] user=,db=,app=,client= LOG: restored log file "000000070000001700000080" from archiv
e
cp: cannot stat `/pgsql/pdbabe/archive/00000008.history': No such file or directory
2018-01-10 12:05:53 CET [30299]: [91-1] user=,db=,app=,client= LOG: selected new timeline ID: 8
2018-01-10 12:05:53 CET [30299]: [92-1] user=,db=,app=,client= LOG: restored log file "00000007.history" from archive
2018-01-10 12:05:53 CET [30299]: [93-1] user=,db=,app=,client= LOG: archive recovery complete
2018-01-10 12:05:53 CET [30302]: [1-1] user=,db=,app=,client= LOG: checkpoint starting: end-of-recovery immediate wait
2018-01-10 12:05:57 CET [30397]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=10.151.201.1 LOG: connection receive
d: host=10.151.201.1 port=51057
2018-01-10 12:05:57 CET [30397]: [2-1] user=patrol,db=pdbabe,app=[unknown],client=10.151.201.1 FATAL: no pg_hba.conf entry f
or host "10.151.201.1", user "patrol", database "pdbabe", SSL off
2018-01-10 12:06:57 CET [30410]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=10.151.201.1 LOG: connection receive
d: host=10.151.201.1 port=51058
2018-01-10 12:06:57 CET [30410]: [2-1] user=patrol,db=pdbabe,app=[unknown],client=10.151.201.1 FATAL: no pg_hba.conf entry f
or host "10.151.201.1", user "patrol", database "pdbabe", SSL off
2018-01-10 12:07:15 CET [30302]: [2-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 254519 buffers (99.4%); 0 tran
saction log file(s) added, 0 removed, 0 recycled; write=71.245 s, sync=10.241 s, total=81.566 s; sync files=39, longest=10.21
4 s, average=0.262 s
2018-01-10 12:07:15 CET [30299]: [94-1] user=,db=,app=,client= LOG: MultiXact member wraparound protections are now enabled
2018-01-10 12:07:15 CET [30297]: [4-1] user=,db=,app=,client= LOG: database system is ready to accept connections
2018-01-10 12:07:15 CET [30421]: [1-1] user=,db=,app=,client= LOG: autovacuum launcher started

14) After the recover the file deleted is in place:

pgpdbabe@elaawddbad00:/pgsql/pdbabe/admin/log> cd $PGDATA/pg_tblspc/16385/PG_9.4_201409291/16386
pgpdbabe@elaawddbad00:/pgsql/pdbabe/data/pdbabe/pg_tblspc/16385/PG_9.4_201409291/16386> ls -la | grep 49391
-rw------- 1 pgpdbabe pdbabe 16384 Jan 10 12:10 49391 <-----------
-rw------- 1 pgpdbabe pdbabe 24576 Jan 9 14:13 49391_fsm

15) Let's check the content of the table ... no errors raised but the table has 5 records and not 12 records (as it should have).

pgpdbabe@elaawddbad00:/pgsql/pdbabe/admin/log> psql
psql (9.4.9)
Type "help" for help.

pdbabe=# select order_date, substring(item, 1, 20) from test_corrupt ;
order_date | substring
-------------------------------+--------------
2018-01-01 00:07:00+01 | generic item
2018-01-01 00:08:00+01 | generic item
2018-01-01 00:09:00+01 | generic item
2018-01-01 00:10:00+01 | generic item
2018-01-10 02:57:01.106676+01 | new item
(5 rows)

In conclusion, the recover operation was successfully but the database is corrupted. The point is: when the recover find a missing file it
should exit with an error instead of going ahead and terminating succesfully. What do you think about this opinion? Am I missing something?

Thanks & Regards
Gabriele

Il 8 gennaio 2018 alle 14.53 flumbador@virgilio.it ha scritto:

Hi postgresql is 9.4.9 on SUSE 11sp4 ... The backup is a filesystem backup with Networker.
I confirm that I handle WAL archiving and backup_label file was in place when I started the recovery. The table is Persistent (Relpersistence = 'p').
The postgresql was succesfully restored from another backup and now is up and running; i'm not able to provide further datails.
I try to reproduce the issue on a test environment.
I'll update the thread with further details as soon as I reproduce the issue.
Thanks to all
Best Regards
Gabriele
Il 2 gennaio 2018 alle 16.38 Alvaro Herrera <alvherre@alvh.no-ip.org> ha scritto:
flumbador@virgilio.it wrote:
During the restore 3 files were missing; these three files belong to a table with high transaction workload, and for sure during the backup many transactions had modified this table and those missing files. What is surprising to me is that even if the files were missing the recover phase ended successfully. I expect an error (for example file not found) raised when postgresql try to apply the wal entries related to this table and those files. After the recover I find that these three file has been created during recover but when I try to query the table I get the error:
db4=# select count(*) from pgbench_accounts ;
ERROR: could not read block 1999996 in file "pg_tblspc/16471/PG_9.4_201409291/16474/16593.15": read only 0 of 8192 bytes
Can you please get a backtrace from this error? I suspect that the
problem here is that the planner is trying to estimate cardinality by
descending a btree index, and failing because the index points to a part
of the table that has been truncated. But if the table has been
truncated because it's unlogged, then the index should have been
truncated too; and if it's a permanent table, then it has been truncated
improperly (because no index pointer to the truncated part should
remain).
What files do exist for pg_tblspc/16471/PG_9.4_201409291/16474/16593* ?
(In particular I'm curious about whether 16593.14 exists and what size
it is)
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-admin по дате отправления:

Предыдущее

От: "David G. Johnston"
Дата: 10 января 2018 г., 22:13:21
Сообщение: Re: set not null column using existing constraint

Следующее

От: Rui DeSousa
Дата: 10 января 2018 г., 23:04:02
Сообщение: Re: Issue on restore / recover

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Issue on restore / recover

Предыдущее

Следующее