Обсуждение: redolog - for discussion
Hi,
here are some details I would like to discuss before beeing
too deep in the implementation that it's hard to change them
later.
Point 1 is an extension of the pg_database relation that is
required to see the actual redolog state of a database. New
fields are:
lastbackup datetime
redomode int4
redoseq1 int4
lastbackup is the time, the last successful full backup was
taken. More precise, the time when pg_dump switched the
backend into online backup mode.
redomode is defined as 0=none, 1=async-logging, 2=sync-
logging, 4=restore, 5=recover, 6=error.
redoseq1 is the sequence number of the redolog file began
when pg_dump switched to online backup mode (this command
implies a logfile switch).
Point 2 is the extension of the querylanguage. All the
statements are restricted to superusers or the owner of the
database. The SQL statements to control the whole process
are:
ALTER DATABASE REDOMODE {NONE | ASYNCHRONOUS | SYNCHRONOUS};
Turns logging for the database on or off. Database must
be in normal operation mode for it (not restore or
recover mode).
ALTER DATABASE BEGIN BACKUP;
Issued by pg_dump before doing anything else.
The command stops ALL other activity in the database, so
pg_dump has time to pull out at least the information
about sequences (actually it does this while getting
tables, might require some changes there so the database
get's back accessible soon).
ALTER DATABASE ONLINE BACKUP;
Issued by pg_dump when it finished the things that
require total exclusive database access.
At this time, a logfile switch is done (only if the
actual database is really logged) and the sequence number
of the new logfile plus the current datetime remembered.
The behaviour of pg_dump's backend changes. It will see a
snapshot of this time (implemented in tqual code) in any
subsequent command and it is totally unable to do
anything that would update the database.
Until the final END BACKUP is given, no VACUUM or DROP
TABLE etc. commands can be run. If they are issued, the
command will be delayed until pg_dump finished.
ALTER DATABASE END BACKUP;
This turns back the special behaviour of pg_dump's
backend. Additionally the remembered time and redolog
sequence are stored in pg_database. pg_dump can read
them out for the final statement in the dump output (see
below).
ALTER DATABASE BEGIN RESTORE;
This command checks that the actual database is just
created and not one single command has been executed
before. It is the first command in pg_dump's output if
the database beeing dumped is a logged one.
It switches the database into restore mode. In this mode,
the first command on a new database connection must be
the special command
RECOVER DATABASE AS USER 'uname'
or an
ALTER DATABASE END RESTORE ...;
When doing the ACL stuff, pg_dump must output a reconnect
(\c) to the database without the additional username and
then issue the special command.
ALTER DATABASE END RESTORE [RECOVERY FROM redolog_seq];
This ends the restore mode. The additional RECOVERY FROM
is put into by pg_dump for logged databases only. It
reads out this information after END BACKUP. If not
given, the database is switched into normal operation
mode without logging. But if given, the sequence number
is stored in pg_database and the database is put into
recover mode. In that mode, only RECOVER commands can be
issued.
RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
The database must be in recover mode. If RESET is given,
the mode is switched to ASYNC logging, The lastbackup
field is set to 'epoch' and redoseq1 set to 0. It resets
the database to the state at the backup snapshot time.
For the others, the backend starts the recovery program
which reads the redolog files, establishes database
connections as required and reruns all the commands in
them. If a required logfile isn't found, it tells the
backend and waits for the reply. The backend tells the
user what happened on error (redolog file with seq n
required but not found ...). So the user can put back the
required redolog files and let recover resume (actually
the user pgsql or root must put them back :-).
If the recovery is interrupted (controlling backend
terminates), the database is set into error mode and only
a RECOVER DATABASE RESET will help.
If the recovery finally succeeds, the same as for RESET
happens. The database is online in async logmode.
Since the "destroydb" is also remembered in the redolog,
recovery will stop at least if it hit's that for the
database actually recoverd. This is to prevent faulty
recovery which could occure if someone destroy's one
database, creates a new one with the same name but
different contents that is logged, destroy's it again and
then want's to restore and recover the first.
RECOVER DATABASE CONTINUE;
After beeing told to restore some more redolog files,
this command let's the recovery resume.
RECOVER DATABASE AS USER 'uname';
A special command used in restore and recover mode only.
This is restricted to superusers with usecatupd right
(not db owner) and modifies the current username in the
backend. It's ugly, but the problem is that ordinary
users should not be able to use the database while it is
in restore or recover mode. So the connection cannot be
established like with '\c - username'.
For the restore and recover this means, that a real
superuser with unrestricted access is needed to restore a
database that was dumped with ACL info. But otherwise
one with createdb but not superuser rights could put a
CREATE USER into a file, create a new database and
"restore" that as user pgsql. I'm sure we don't want
this.
###
Whow - hopefully I didn't forgot anything.
All that might look very complicated, but the only commands
someone would really use manually will be
ALTER DATABASE REDOMODE ...;
and
RECOVER DATABASE ...;
Anything else is used by pg_dump and the recovery program.
What I'm trying to implement with it is a behaviour that
makes it possible to backup, restore and recover a database
on a production system without running closed shop. Doing so
for one database will not affect the others in the instance,
since no modification of the hba conf or anything else will
be required. Only the database actually beeing restored is
closed for normal usage.
Hmmm - just have the next idea right now. One more field in
pg_database could tell that the db is shut down or restricted
Someone could disable a single database or restrict usage to
superusers/dbowner for some time, make database readonly etc.
Anyway - does anybody see problems with the above? Do we need
more functionality? Oh yeah - another utility that follows
log and replicates databases onto other systems on the fly.
But let me get this all running first please :-).
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > ALTER DATABASE BEGIN BACKUP; > > Issued by pg_dump before doing anything else. > > The command stops ALL other activity in the database, so > pg_dump has time to pull out at least the information > about sequences (actually it does this while getting > tables, might require some changes there so the database > get's back accessible soon). > > ALTER DATABASE ONLINE BACKUP; > > Issued by pg_dump when it finished the things that > require total exclusive database access. > > At this time, a logfile switch is done (only if the > actual database is really logged) and the sequence number > of the new logfile plus the current datetime remembered. > The behaviour of pg_dump's backend changes. It will see a > snapshot of this time (implemented in tqual code) in any ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note, that I'm implementing multi-version concurrency control (MVCC) for 6.5: pg_dump will have to run all queries in one transaction in SERIALIZED mode to get snapshot of transaction' begin time... > subsequent command and it is totally unable to do > anything that would update the database. > > Until the final END BACKUP is given, no VACUUM or DROP > TABLE etc. commands can be run. If they are issued, the > command will be delayed until pg_dump finished. Vacuum will not be delete records in which any active backend is interested - don't worry. ... > > All that might look very complicated, but the only commands ^^^^^^^^^^^^^^^^ Yes -:) We could copy/move pg_dump' stuff into backend... This way pg_dump will just execute one command ALTER DATABASE ONLINE BACKUP; -- as I understand - backend will do all what it need and pg_dump just write backend' output to a file. I think that it would be nice to have code in backend to generate CREATE statements from catalog and extend EXPLAIN to handle something like EXPLAIN TABLE xxx etc. We could call EXPLAIN for all \dXXXX in psql and when dumping schema in pg_dump. Comments? Vadim
Vadim wrote:
>
> Jan Wieck wrote:
> >
> > At this time, a logfile switch is done (only if the
> > actual database is really logged) and the sequence number
> > of the new logfile plus the current datetime remembered.
> > The behaviour of pg_dump's backend changes. It will see a
> > snapshot of this time (implemented in tqual code) in any
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Note, that I'm implementing multi-version concurrency control
> (MVCC) for 6.5: pg_dump will have to run all queries
> in one transaction in SERIALIZED mode to get snapshot of
> transaction' begin time...
Sounds good and would make things easier. I'll keep my hands
off from the tqual code and wait for that.
But what about sequence values while in SERIALIZED
transaction mode. Sequences get overwritten in place! And for
a dump/restore/recover it is important, that the sequences
get restored ALL at once in the state they where.
>
> > subsequent command and it is totally unable to do
> > anything that would update the database.
> >
> > Until the final END BACKUP is given, no VACUUM or DROP
> > TABLE etc. commands can be run. If they are issued, the
> > command will be delayed until pg_dump finished.
>
> Vacuum will not be delete records in which any active
> backend is interested - don't worry.
That's the vacuum part, but I still need to delay DROP
TABLE/VIEW/SEQUENCE until the backup is complete.
>
> ...
>
> >
> > All that might look very complicated, but the only commands
> ^^^^^^^^^^^^^^^^
> Yes -:)
> We could copy/move pg_dump' stuff into backend...
> This way pg_dump will just execute one command
>
> ALTER DATABASE ONLINE BACKUP; -- as I understand
>
> - backend will do all what it need and pg_dump just
> write backend' output to a file.
>
> I think that it would be nice to have code in backend to
> generate CREATE statements from catalog and extend EXPLAIN
> to handle something like EXPLAIN TABLE xxx etc.
> We could call EXPLAIN for all \dXXXX in psql and
> when dumping schema in pg_dump.
>
> Comments?
Indeed :-)
If we have serialized transaction that covers sequences, only
BEGIN and END BACKUP must remain. BEGIN to force the logfile
switch and END to flag that dump is complete and backend can
update pg_database.
So you want to put major parts of pg_dump's functionality
into the backend. Hmmm - would be cool. And it would give us
a chance to include tests for most of the dump related code
in regression.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > > > > > > At this time, a logfile switch is done (only if the > > > actual database is really logged) and the sequence number > > > of the new logfile plus the current datetime remembered. > > > The behaviour of pg_dump's backend changes. It will see a > > > snapshot of this time (implemented in tqual code) in any > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Note, that I'm implementing multi-version concurrency control > > (MVCC) for 6.5: pg_dump will have to run all queries > > in one transaction in SERIALIZED mode to get snapshot of > > transaction' begin time... > > Sounds good and would make things easier. I'll keep my hands > off from the tqual code and wait for that. > > But what about sequence values while in SERIALIZED > transaction mode. Sequences get overwritten in place! And for > a dump/restore/recover it is important, that the sequences > get restored ALL at once in the state they where. It's time to re-implement sequences! When they were implemented ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so I had to create table for each sequence. There should be one system table - pg_sequence. One record for each sequence will be inserted into this table and one VIEW will be created: CREATE VIEW _seqname_ AS SELECT * FROM pg_sequence WHERE sequence_name = '_seqname_'; GRANT/REVOKE on sequnece' VIEW will control rights to read sequence using SELECT and rights to change sequence using nextval/setval. Having _one_ sequences table there will be easy to lock all sequences at once and read all values. > > > > Vacuum will not be delete records in which any active > > backend is interested - don't worry. > > That's the vacuum part, but I still need to delay DROP > TABLE/VIEW/SEQUENCE until the backup is complete. Yes. And ALTER too. Vadim
> It's time to re-implement sequences! When they were implemented > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so > I had to create table for each sequence. > There should be one system table - pg_sequence. One record > for each sequence will be inserted into this table and > one VIEW will be created: I thought you wanted a single table to prevent concurrent access/update contension? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sat, 12 Dec 1998, Bruce Momjian wrote:
> > It's time to re-implement sequences! When they were implemented
> > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so
> > I had to create table for each sequence.
> > There should be one system table - pg_sequence. One record
> > for each sequence will be inserted into this table and
> > one VIEW will be created:
>
> I thought you wanted a single table to prevent concurrent access/update
> contension?
let's revise what vadim stated *grin*
"we should re-implement sequences for v6.5, since row-level
locking will exist at that time"? *rofl*
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> On Sat, 12 Dec 1998, Bruce Momjian wrote: > > > > It's time to re-implement sequences! When they were implemented > > > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so > > > I had to create table for each sequence. > > > There should be one system table - pg_sequence. One record > > > for each sequence will be inserted into this table and > > > one VIEW will be created: > > > > I thought you wanted a single table to prevent concurrent access/update > > contension? > > let's revise what vadim stated *grin* > > "we should re-implement sequences for v6.5, since row-level > locking will exist at that time"? *rofl* Oh, now I understand. :-) Vadim, don't be so modest. :-) -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Jan Wieck wrote:
>
> RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
>
...
>
> For the others, the backend starts the recovery program
> which reads the redolog files, establishes database
> connections as required and reruns all the commands in
^^^^^^^^^^^^^^^^^^^^^^^^^^
> them. If a required logfile isn't found, it tells the ^^^^^
I foresee problems with using _commands_ logging for
recovery/replication -:((
Let's consider two concurrent updates in READ COMMITTED mode:
update test set x = 2 where y = 1;
and
update test set x = 3 where y = 1;
The result of both committed transaction will be x = 2
if the 1st transaction updated row _after_ 2nd transaction
and x = 3 if the 2nd transaction gets row after 1st one.
Order of updates is not defined by order in which commands
begun and so order in which commands should be rerun
will be unknown...
Comments?
Vadim
Vadim wrote:
>
> Jan Wieck wrote:
> >
> > RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
> >
> ...
> >
> > For the others, the backend starts the recovery program
> > which reads the redolog files, establishes database
> > connections as required and reruns all the commands in
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> > them. If a required logfile isn't found, it tells the
> ^^^^^
>
> I foresee problems with using _commands_ logging for
> recovery/replication -:((
>
> Let's consider two concurrent updates in READ COMMITTED mode:
>
> update test set x = 2 where y = 1;
>
> and
>
> update test set x = 3 where y = 1;
>
> The result of both committed transaction will be x = 2
> if the 1st transaction updated row _after_ 2nd transaction
> and x = 3 if the 2nd transaction gets row after 1st one.
> Order of updates is not defined by order in which commands
> begun and so order in which commands should be rerun
> will be unknown...
Yepp, the order in which commands begun is absolutely not of
interest. Locking could already delay the execution of one
command until another one started later has finished and
released the lock. It's a classic race condition.
Thus, my plan was to log the queries just before the call to
CommitTransactionCommand() in tcop. This has the advantage,
that queries which bail out with errors don't get into the
log at all and must not get rerun. And I can set a static
flag to false before starting the command, which is set to
true in the buffer manager when a buffer is written (marked
dirty), so filtering out queries that do no updates at all is
easy.
Unfortunately query level logging get's hit by the current
implementation of sequence numbers. If a query that get's
aborted somewhere in the middle (maybe by a trigger) called
nextval() for rows processed earlier, the sequence number
isn't advanced at recovery time, because the query is
suppressed at all. And sequences aren't locked, so for
concurrently running queries getting numbers from the same
sequence, the results aren't reproduceable. If some
application selects a value resulting from a sequence and
uses that later in another query, how could the redolog know
that this has changed? It's a Const in the query logged, and
all that corrupts the whole thing.
All that is painful and I don't see another solution yet than
to hook into nextval(), log out the numbers generated in
normal operation and getting back the same numbers in redo
mode.
The whole thing gets more and more complicated :-(
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > > > > I foresee problems with using _commands_ logging for > > recovery/replication -:(( > > ... > > Yepp, the order in which commands begun is absolutely not of > interest. Locking could already delay the execution of one > command until another one started later has finished and > released the lock. It's a classic race condition. > > Thus, my plan was to log the queries just before the call to > CommitTransactionCommand() in tcop. This has the advantage, Oh, I see - you right! ... > > Unfortunately query level logging get's hit by the current > implementation of sequence numbers. If a query that get's ... > > All that is painful and I don't see another solution yet than > to hook into nextval(), log out the numbers generated in Not so bad, having buffering these numbers in memory... > normal operation and getting back the same numbers in redo > mode. > > The whole thing gets more and more complicated :-( As usual -:)) Vadim
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Vadim Mikheev > Sent: Thursday, December 17, 1998 5:03 PM > To: Jan Wieck > Cc: pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] redolog - for discussion > > > Jan Wieck wrote: > > > > > > > > I foresee problems with using _commands_ logging for > > > recovery/replication -:(( > > > > ... > > > > Yepp, the order in which commands begun is absolutely not of > > interest. Locking could already delay the execution of one > > command until another one started later has finished and > > released the lock. It's a classic race condition. > > > > Thus, my plan was to log the queries just before the call to > > CommitTransactionCommand() in tcop. This has the advantage, > > Oh, I see - you right! > If image level logging is used,probably it's OK. But if query(command) level logging is used ??? If the isolation level of all transactions is SERIARIZABLE,it's probably OK because they are serializable order by the time when they are committed. But if there are transactions whose isolation level is READ COMMITTED, they are not serializable. So commands must be issued according to the original order when they were issued ? If the same mechanism of locking is used at recovery time,the order of locks caused by commands(rerun) will be same ????? I'm not confident. Thanks. Hiroshi inoue Inoue@tpf.co.jp