Re: Issues Outstanding for Point In Time Recovery (PITR)

Поиск
Список
Период
Сортировка
От Patrick Macdonald
Тема Re: Issues Outstanding for Point In Time Recovery (PITR)
Дата
Msg-id 3D27B93A.75A0C35F@redhat.com
обсуждение исходный текст
Ответ на Issues Outstanding for Point In Time Recovery (PITR)  ("J. R. Nield" <jrnield@usol.com>)
Ответы Re: Issues Outstanding for Point In Time Recovery (PITR)  (Bruce Momjian <pgman@candle.pha.pa.us>)
Список pgsql-hackers
J.R.,

Nice first draft and a good read. Was going to comment 
in-line but thought this method would be easier to follow. 
The comments/suggestions below assume that PIT recovery is
being performed at the cluster level with a data backup 
image created by a tar-like utility. 

As noted, one of the main problems is knowing where to begin
in the log.  This can be handled by having backup processing 
update the control file with the first lsn and log file 
required.  At the time of the backup, this information is or 
can be made available.  The control file can be the last file
added to the tar and can contain information spanning the entire
backup process.

For data consistency, since the backup is being performed on 
an active cluster, we have to make sure to mark the end of the 
backup.  On restore, to make the cluster consistent, you have 
to force the user to perform forward recovery passed the point
of the backup completion marker in the (archived) log.  This 
can be handled using a backup end log record.  The backup end
log record would have to contain an identifier unique to this 
backup.  If a user requests to stop PIT recovery before this 
log record is encountered, consistency is not guaranteed. 
PIT should either disallow the action or warn of possible / 
impending doom.

The necessary logging for rtee (and others) insertions/deletions
can be added to the base code.  Not much of a worry but I would
expect to encounter other missing log items during testing.

The idea of using the last lsn on the page to detect a partial
write is used by other dbms systems.  You already have that 
information available so there is no overhead in computing it. 
Nothing wrong with CRC though.

As for the DB_SHUTDOWNED state, this could be handled by having
the backup processing update the control file field to 
DB_PIT_REQUIRED (or some such identifier).  After a restore,
users would be blocked from connecting to the cluster's databases 
until a forward recovery passed the backup end log record has
completed successfully.  

At the end of normal crash recovery, the user has to go digging
to identify in-flight transactions still in the system and abort
them manually.  It would be nice if PIT recovery automatically
aborted all in-flight transactions at the end. 

As PostgreSQL heads towards forward recovery functionality, it
may be wise to add headers to the log files.  As the logs from
any cluster are identically named, the header would allow unique
identification of the file and contents (cluster name, unique 
log id, id of the prior log file for chaining purposes, lsn 
ranges, etc). Most helpful.

Just a few notes from the administrative side. PIT recovery
should probably offer the user the following actions:

. forward recover to end of logs [and stop] Process log files located in the current directory until you read through
thelast one.  Allow the user the option to stop or not, just in case the logs are archived.  Send back the timestamp of
thelast encountered commit log record and the series of log files scanned.   
 

. forward recover to PIT [and stop] Similar to that described above but use the commit timestamps to gauge PIT
progress.  
 
. forward recover query Send back the log series covered and the last commit timestamp encountered. 

. forward recover stop Stop the current forward recovery session. Undo all in-flight transactions and bring the
databasesdown in a consistent state.  No other external user actions should be required.
 

Looking forward to reading draft 2.

Cheers,
Patrick
--
Patrick Macdonald 
Red Hat Canada 

"J. R. Nield" wrote:
> 
> Hello:
> 
> I've got the logging system to the point where I can take a shutdown
> consistent copy of a system, and play forward through multiple
> checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE
> properly, and things are moving forward well. Recovery to an arbitrary
> point-in-time should be just as easy, but will need some administrative
> interface for it.
> 
> At this point, some input would be useful on how I should handle things.
> 
> The most important questions that need answering are in sections 2 & 5,
> since they impact the most other parts of the system. They will also
> require good documentation for sysadmins.
> 
> Issues Outstanding for Point In Time Recovery (PITR)
> 
>     $Date: 2002/07/04 14:23:37 $
> 
>     $Revision: 1.4 $
> 
>     J.R. Nield
> 
>     (Enc: ISO 8859-15 Latin-9)
> 
> §0 - Introduction
> 
>     This file is where I'm keeping track of all the issues I run into while
>     trying to get PITR to work properly. Hopefully it will evolve into a
>     description of how PITR actually works once it is implemented.
> 
>     I will also try to add feedback as it comes in.
> 
>     The big items so-far are:
>         §1 - Logging Relation file creation, truncation, and removal
>             This is mostly done. Can do infinte play-forward from
>             online logs.
>         §2 - Partial-Write and Bad Block detection
>             Need input before starting. Migration issues.
>         §3 - Detecting Shutdown Consistent System Recovery
>             Mostly done.
>         §4 - Interactive Play-Forward Recovery for an Entire System
>             Need input before starting.
>         §5 - Individual file consistent recovery
>             Need input. Semi-Major changes required.
> 
> §1 - Logging Relation file creation, truncation, and removal
> 
>   §1.1 - Problem:
> 
>     Without file creation in the log, we can't replay committed
>     transactions that create relations.
> 
>     The current code assumes that any transaction reaching commit has already
>     ensured it's files exist, and that those files will never be removed. This
>     is true now, but not for log-replay from an old backup database system.
> 
>     The current XLOG code silently ignores block-write requests for
>     non-existent files, and assumes that the transaction generating those
>     requests must have aborted.
> 
>     Right now a crash during TRUNCATE TABLE will leave the table in an
>     inconsistent state (partially truncated). This would not work when doing
>     replay from before the last checkpoint.
> 
>     §1.1.1 - CREATE DATABASE is also unlogged
> 
>       This will cause the same replay problems as above.
> 
>   §1.2 - Proposal:
> 
>     a) Augment the SMGR code to log relation file operations, and to handle
>     redo requests properly. This is simple in the case of create. Drop must be
>     logged only IN the commit record. For truncate see (b).
> 
>     The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink',
>     and 'smgr_retruncate'. smgr_recreate should accept a RelFileNode instead
>     of a Relation.
> 
>     Transactions that abort through system failure (ie.  unlogged aborts)
>     will simply continue to leak files.
> 
>     b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
>     may be used in a future command, and a replay-recovered database may
>     end-up with different data than the original.
> 
>     WAL must be flushed before truncate as well.
> 
>     WAL does not need to be flushed before create, if we don't mind
>     leaking files sometimes.
> 
>     c) Redo code should treat writes to non-existent files as an error.
>     Changes affect heap & nbtree AM's. [Check others]
> 
>     d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
>       a database should mark all the rtree indices as corrupt.
>       [ actually we should do that now, are we? ]
> 
>     e) CREATE DATABASE must be logged properly, not use system(cp...)
> 
>   §1.3 - Status:
> 
>     All logged SMGR operations are now in a START_CRIT_SECTION()/
>     END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs
>     operations.
> 
>     Code has been added to smgr and xact modules to log:
>         create (no XLogFlush)
>         truncate (XLogFlush)
>         pending deletes on commit record
>         files to delete on abort record
> 
>     Code added to md.c to support redo ops
> 
>     Code added to smgr for RMGR redo/desc callbacks
> 
>     Code added to xact RMGR callbacks for redo/desc
> 
>     Database will do infinite shutdown consistent system recovery from the
>     online logs, if you manually munge the control file to set state ==
>     DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
> 
>     Still need to do:
>         Item (c), recovery cleanup in all AM's
>         Item (d), logging in other index AM's
>         Item (e), CREATE DATABASE stuff
> 
> §2 - Partial-Write and Bad Block detection
> 
>   §2.1 - Problem:
> 
>     In order to protect against partial writes without logging pages
>     twice, we need to detect partial pages in system files and report them
>     to the system administrator. We also might want to be able to detect
>     damaged pages from other causes, like memory corruption, OS errors,
>     etc. or in the case where the disk doesn't report bad blocks, but
>     returns bad data.
> 
>     We should also decide what should happen when a file is marked as
>     containing corrupt pages, and requires log-archive recovery from a
>     backup.
> 
>   §2.2 - Proposal:
> 
>     Add a 1 byte 'pd_flags' field to PageHeaderData, with the following
>     flag definitions:
> 
>         PD_BLOCK_CHECKING           (1)
>         PD_BC_METHOD_BIT            (1<<1)
> 
>         PageHasBlockChecking(page)     ((page)->pd_flags & PD_BLOCK_CHECKING)
>         PageBCMethodIsCRC64(page)      ((page)->pd_flags & PD_BC_METHOD_BIT)
>         PageBCMethodIsLSNLast(page)    (!PageBCMethodIsCRC64(page))
> 
>     The last 64 bits of a page are reserved for use by the block checking
>     code.
> 
>     [ Is it worth the trouble to allow the last 8 bytes of a
>       page to contain data when block checking is turned off for a Page?
> 
>       This proposal does not allow that. ]
> 
>     If the block checking method is CRC64, then that field will contain
>     the CRC64 of the block computed at write time.
> 
>     If the block checking method is LSNLast, then the field contains a
>     duplicate of the pd_lsn field.
> 
>     §2.2.1 - Changes to Page handling routines
> 
>      All the page handling routines need to understand that
>      pd_special == (pd_special - (specialSize + 8))
> 
>      Change header comment in bufpage.h to reflect this.
> 
>     §2.2.2 - When Reading a Page
> 
>      Block corruption is detected on read in the obvious way with CRC64.
> 
>      In the case of LSNLast, we check to see if pd_lsn == the lsn in the
>      last 64 bits of the page. If not, we assume the page is corrupt from
>      a partial write (although it could be something else).
> 
>      IMPORTANT ASSUMPTION:
>         The OS/disk device will never write both the first part and
>         last part of a block without writing the middle as well.
>         This might be wrong in some cases, but at least it's fast.
> 
>     §2.2.4 - GUC Variables
> 
>      The user should be able to configure what method is used:
> 
>        block_checking_write_method  = [ checksum | torn_page_flag | none ]
> 
>           Which method should be used for blocks we write?
> 
>        check_blocks_on_read      = [ true | false ]
> 
>           When true, verify that the blocks we read are not corrupt, using
>           whatever method is in the block header.
> 
>           When false, ignore the block checking information.
> 
>   §2.3 - Status:
> 
>     Waiting for input from pgsql-hackers.
> 
>     Questions:
> 
>         Should we allow the user to have more detailed control over
>         which parts of a database use block checking?
> 
>         For example: use 'checksum' on all system catalogs in all databases,
>         'torn_page_flag' on the non-catalog parts of the production database,
>         and 'none' on everything else?
> 
> §3 - Detecting Shutdown Consistent System Recovery
> 
>     §3.1 - Problem:
> 
>      How to notice that we need to do log-replay for a system backup, when the
>      restored control file points to a shutdown checkpoint record that is
>      before the most recent checkpoint record in the log, and may point into
>      an archived file.
> 
>     §3.2 - Proposal:
> 
>      At startup, after reading the ControlFile, scan the log directory to
>      get the list of active log files, and find the lowest logId and
>      logSeg of the files. Ensure that the files cover a contiguous range
>      of LSN's.
> 
>      There are three cases:
> 
>       1) ControlFile points to the last valid checkpoint (either
>          checkPoint or prevCheckPoint, but one of them is the greatest
>          valid checkpoint record in the log stream).
> 
>       2) ControlFile points to a valid checkpoint record in an active
>          log file, but there are more valid checkpoint records beyond
>          it.
> 
>       3) ControlFile points to a checkpoint record that should be in the
>          archive logs, and is presumably valid.
> 
>      Case 1 is what we handle now.
> 
>      Cases 2 and 3 would result from restoring an entire system from
>      backup in preparation to do a play-forward recovery.
> 
>      We need to:
> 
>         Detect cases 2 and 3.
> 
>         Alert the administrator and abort startup.
>         [Question: Is this always the desired behavior?  We can
>          handle case 2 without intervention. ]
> 
>         Let the administrator start a standalone backend, and
>         perform a play-forward recovery for the system.
> 
>     §3.3 - Status:
> 
>        In progress.
> 
> §4 - Interactive Play-Forward Recovery for an Entire System
> 
>     Play-Forward File Recovery from a backup file must be interactive,
>     because not all log files that we need are necessarily in the
>     archive directory. It may be possible that not all the archive files
>     we need can even fit on disk at one time.
> 
>     The system needs to be able to prompt the system administrator to feed
>     it more log files.
> 
>     TODO: More here
> 
> §5 - Individual file consistent recovery
> 
>   §5.1 - Problem:
> 
>     If a file detects corruption, and we restore it from backup, how do
>     we know what archived files we need for recovery?
> 
>     Should file corruption (partial write, bad disk block, etc.) outside
>     the system catalog cause us to abort the system, or should we just
>     take the relation or database off-line?
> 
>     Given a backup file, how do we determine the point in the log
>     where we should start recovery for the file? What is the highest LSN
>     we can use that will fully recover the file?
> 
>   §5.2 - Proposal:
> 
>     Put a file header on each file, and update that header to the last
>     checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
>     at least once every dbsize/'file_lsn_log_slack' megabytes of log
>     written, where dbsize is the estimated size of the database. Have
>     these values be settable from the config file. These updates would be
>     distributed throughout the hour, or interspersed between regular
>     amounts of log generation.
> 
>     If we have a database backup program or command, it can update the
>     header on the file before backup to the greatest value it can assure
>     to be safe.
> 
>   §5.3 - Status:
> 
>     Waiting for input from pgsql-hackers.
> 
>     Questions:
> 
>         There are alternate methods than using a file header to get a
>         known-good LSN lower bound for the starting point to recover a backup
>         file. Is this the best way?
> 
> A) The Definitions
> 
>     This stuff is obtuse, but I need it here to keep track of what I'm
>     saying. Someday I should use it consistently in the rest of this
>     document.
> 
>     "system" or "database system":
> 
>         A collection of postgres "databases" in one $PGDATA directory,
>         managed by one postmaster instance at a time (and having one WAL
>         log, etc.)
> 
>         All the files composing such a system, as a group.
> 
>     "up to date" or "now" or "current" or "current LSN":
> 
>         The most recent durable LSN for the system.
> 
>     "block consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file, which may be written to during the process of
>         copying, but where each BLCKSZ size block is copied atomically.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, such that each is independently a "block
>         consistent copy"
> 
>     "file consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file that is not written to between the start and end
>         of the copy operation.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, such that each is independently a "file
>         consistent copy"
> 
>     "system consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file, where the entire system of which it is a member
>         is not written to during the copy.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, where the entire system of which they are
>         members was not written to between the start and end of the
>         copying of all the files, as a group.
> 
>     "shutdown consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file, where the entire system of which it is a member
>         had been cleanly shutdown before the start of and for the duration
>         of the copy.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, where the entire system of which they are
>         members had been cleanly shutdown before the start of and for the
>         duration of the copying of all the files, as a group.
> 
>     "consistent copy":
> 
>         A block, file, system, or shutdown consistent copy.
> 
>     "known-good LSN lower bound"
>     or "LSN lower bound"
>     or "LSN-LB":
> 
>         When referring to a group of blocks, a file, or a group of files:
> 
>         An LSN known to be old enough that no log entries before it are needed
>         to bring the blocks or files up-to-date.
> 
>     "known-good LSN greatest lower bound"
>     or "LSN greatest lower bound"
>     or "LSN-GLB":
> 
>         When referring to a group of blocks, a file, or a group of files:
> 
>         The greatest possible LSN that is a known-good LSN lower bound for
>         the group.
> 
>     "backup file":
> 
>         A consistent copy of a data file used by the system, for which
>         we have a known-good LSN lower bound.
> 
>     "optimal backup file":
> 
>         A backup file, for which we have the known-good LSN greatest lower
>         bound.
> 
>     "backup system":
> 
> 
>     "Play-Forward File Recovery" or "PFFR":
> 
>         The process of bringing an individual backup file up to date.
> 
> 
> --
> J. R. Nield
> jrnield@usol.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly




В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Marc G. Fournier"
Дата:
Сообщение: Re: (A) native Windows port
Следующее
От: Tom Lane
Дата:
Сообщение: Re: (A) native Windows port