Обсуждение: redolog - for discussion

Поиск
Список
Период
Сортировка

redolog - for discussion

От
jwieck@debis.com (Jan Wieck)
Дата:
Hi,

    here  are  some details I would like to discuss before beeing
    too deep in the implementation that it's hard to change  them
    later.

    Point  1  is an extension of the pg_database relation that is
    required to see the actual redolog state of a  database.  New
    fields are:

        lastbackup    datetime
        redomode      int4
        redoseq1      int4

    lastbackup  is  the time, the last successful full backup was
    taken.  More precise, the  time  when  pg_dump  switched  the
    backend into online backup mode.

    redomode  is  defined  as  0=none,  1=async-logging,  2=sync-
    logging, 4=restore, 5=recover, 6=error.

    redoseq1 is the sequence number of  the  redolog  file  began
    when  pg_dump  switched  to  online backup mode (this command
    implies a logfile switch).

    Point 2 is  the  extension  of  the  querylanguage.  All  the
    statements  are  restricted to superusers or the owner of the
    database. The SQL statements to  control  the  whole  process
    are:

    ALTER DATABASE REDOMODE {NONE | ASYNCHRONOUS | SYNCHRONOUS};

        Turns  logging  for the database on or off. Database must
        be in normal  operation  mode  for  it  (not  restore  or
        recover mode).

    ALTER DATABASE BEGIN BACKUP;

        Issued by pg_dump before doing anything else.

        The  command stops ALL other activity in the database, so
        pg_dump has time to pull out  at  least  the  information
        about  sequences  (actually  it  does  this while getting
        tables, might require some changes there so the  database
        get's back accessible soon).

    ALTER DATABASE ONLINE BACKUP;

        Issued  by  pg_dump  when  it  finished  the  things that
        require total exclusive database access.

        At this time, a logfile  switch  is  done  (only  if  the
        actual database is really logged) and the sequence number
        of the new logfile plus the current datetime  remembered.
        The behaviour of pg_dump's backend changes. It will see a
        snapshot of this time (implemented in tqual code) in  any
        subsequent  command  and  it  is  totally  unable  to  do
        anything that would update the database.

        Until the final END BACKUP is given, no  VACUUM  or  DROP
        TABLE  etc.  commands can be run. If they are issued, the
        command will be delayed until pg_dump finished.

    ALTER DATABASE END BACKUP;

        This  turns  back  the  special  behaviour  of  pg_dump's
        backend.  Additionally  the  remembered  time and redolog
        sequence are stored in  pg_database.   pg_dump  can  read
        them  out for the final statement in the dump output (see
        below).

    ALTER DATABASE BEGIN RESTORE;

        This command checks that  the  actual  database  is  just
        created  and  not  one  single  command has been executed
        before. It is the first command in  pg_dump's  output  if
        the database beeing dumped is a logged one.

        It switches the database into restore mode. In this mode,
        the first command on a new database  connection  must  be
        the special command

            RECOVER DATABASE AS USER 'uname'

        or an

            ALTER DATABASE END RESTORE ...;

        When doing the ACL stuff, pg_dump must output a reconnect
        (\c) to the database without the additional username  and
        then issue the special command.

    ALTER DATABASE END RESTORE [RECOVERY FROM redolog_seq];

        This  ends the restore mode. The additional RECOVERY FROM
        is put into by pg_dump  for  logged  databases  only.  It
        reads  out  this  information  after  END  BACKUP. If not
        given, the database is  switched  into  normal  operation
        mode  without  logging. But if given, the sequence number
        is stored in pg_database and the  database  is  put  into
        recover  mode. In that mode, only RECOVER commands can be
        issued.

    RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};

        The database must be in recover mode. If RESET is  given,
        the  mode  is  switched  to ASYNC logging, The lastbackup
        field is set to 'epoch' and redoseq1 set to 0. It  resets
        the database to the state at the backup snapshot time.

        For  the  others, the backend starts the recovery program
        which  reads  the  redolog  files,  establishes  database
        connections  as  required  and reruns all the commands in
        them. If a required logfile isn't  found,  it  tells  the
        backend  and  waits for the reply.  The backend tells the
        user what happened on error  (redolog  file  with  seq  n
        required but not found ...). So the user can put back the
        required redolog files and let recover  resume  (actually
        the user pgsql or root must put them back :-).

        If  the  recovery  is  interrupted  (controlling  backend
        terminates), the database is set into error mode and only
        a RECOVER DATABASE RESET will help.

        If  the  recovery finally succeeds, the same as for RESET
        happens.  The database is online in async logmode.

        Since the "destroydb" is also remembered in the  redolog,
        recovery  will  stop  at  least  if it hit's that for the
        database actually recoverd.  This is  to  prevent  faulty
        recovery  which  could  occure  if  someone destroy's one
        database, creates a  new  one  with  the  same  name  but
        different contents that is logged, destroy's it again and
        then want's to restore and recover the first.

    RECOVER DATABASE CONTINUE;

        After beeing told to restore  some  more  redolog  files,
        this command let's the recovery resume.

    RECOVER DATABASE AS USER 'uname';

        A  special command used in restore and recover mode only.
        This is restricted to  superusers  with  usecatupd  right
        (not  db  owner) and modifies the current username in the
        backend. It's ugly, but  the  problem  is  that  ordinary
        users  should not be able to use the database while it is
        in restore or recover mode.  So the connection cannot  be
        established like with '\c - username'.

        For  the  restore  and  recover  this  means, that a real
        superuser with unrestricted access is needed to restore a
        database  that  was  dumped with ACL info.  But otherwise
        one with createdb but not superuser rights  could  put  a
        CREATE  USER  into  a  file,  create  a  new database and
        "restore" that as user pgsql.  I'm  sure  we  don't  want
        this.

    ###

    Whow - hopefully I didn't forgot anything.

    All  that  might look very complicated, but the only commands
    someone would really use manually will be

        ALTER DATABASE REDOMODE ...;

    and

        RECOVER DATABASE ...;

    Anything else is used by pg_dump and the recovery program.

    What I'm trying to implement with  it  is  a  behaviour  that
    makes  it  possible to backup, restore and recover a database
    on a production system without running closed shop. Doing  so
    for  one database will not affect the others in the instance,
    since no modification of the hba conf or anything  else  will
    be  required.  Only  the database actually beeing restored is
    closed for normal usage.

    Hmmm - just have the next idea right now. One more  field  in
    pg_database could tell that the db is shut down or restricted
    Someone could disable a single database or restrict usage  to
    superusers/dbowner for some time, make database readonly etc.

    Anyway - does anybody see problems with the above? Do we need
    more  functionality?  Oh  yeah - another utility that follows
    log and replicates databases onto other systems on  the  fly.
    But let me get this all running first please :-).


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

Re: [HACKERS] redolog - for discussion

От
Vadim Mikheev
Дата:
Jan Wieck wrote:
> 
>     ALTER DATABASE BEGIN BACKUP;
> 
>         Issued by pg_dump before doing anything else.
> 
>         The  command stops ALL other activity in the database, so
>         pg_dump has time to pull out  at  least  the  information
>         about  sequences  (actually  it  does  this while getting
>         tables, might require some changes there so the  database
>         get's back accessible soon).
> 
>     ALTER DATABASE ONLINE BACKUP;
> 
>         Issued  by  pg_dump  when  it  finished  the  things that
>         require total exclusive database access.
> 
>         At this time, a logfile  switch  is  done  (only  if  the
>         actual database is really logged) and the sequence number
>         of the new logfile plus the current datetime  remembered.
>         The behaviour of pg_dump's backend changes. It will see a
>         snapshot of this time (implemented in tqual code) in  any
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note, that I'm implementing multi-version concurrency control 
(MVCC) for 6.5: pg_dump will have to run all queries
in one transaction in SERIALIZED mode to get snapshot of
transaction' begin time...

>         subsequent  command  and  it  is  totally  unable  to  do
>         anything that would update the database.
> 
>         Until the final END BACKUP is given, no  VACUUM  or  DROP
>         TABLE  etc.  commands can be run. If they are issued, the
>         command will be delayed until pg_dump finished.

Vacuum will not be delete records in which any active
backend is interested - don't worry.

...

> 
>     All  that  might look very complicated, but the only commands                           ^^^^^^^^^^^^^^^^
Yes -:)
We could copy/move pg_dump' stuff into backend...
This way pg_dump will just execute one command

ALTER DATABASE ONLINE BACKUP; -- as I understand

- backend will do all what it need and pg_dump just
write backend' output to a file.

I think that it would be nice to have code in backend to 
generate CREATE statements from catalog and extend EXPLAIN
to handle something like EXPLAIN TABLE xxx etc.
We could call EXPLAIN for all \dXXXX in psql and
when dumping schema in pg_dump.

Comments?

Vadim


Re: [HACKERS] redolog - for discussion

От
jwieck@debis.com (Jan Wieck)
Дата:
Vadim wrote:

>
> Jan Wieck wrote:
> >
> >         At this time, a logfile  switch  is  done  (only  if  the
> >         actual database is really logged) and the sequence number
> >         of the new logfile plus the current datetime  remembered.
> >         The behaviour of pg_dump's backend changes. It will see a
> >         snapshot of this time (implemented in tqual code) in  any
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Note, that I'm implementing multi-version concurrency control
> (MVCC) for 6.5: pg_dump will have to run all queries
> in one transaction in SERIALIZED mode to get snapshot of
> transaction' begin time...

    Sounds  good and would make things easier. I'll keep my hands
    off from the tqual code and wait for that.

    But  what  about  sequence   values   while   in   SERIALIZED
    transaction mode. Sequences get overwritten in place! And for
    a dump/restore/recover it is important,  that  the  sequences
    get restored ALL at once in the state they where.

>
> >         subsequent  command  and  it  is  totally  unable  to  do
> >         anything that would update the database.
> >
> >         Until the final END BACKUP is given, no  VACUUM  or  DROP
> >         TABLE  etc.  commands can be run. If they are issued, the
> >         command will be delayed until pg_dump finished.
>
> Vacuum will not be delete records in which any active
> backend is interested - don't worry.

    That's  the  vacuum  part,  but  I  still  need to delay DROP
    TABLE/VIEW/SEQUENCE until the backup is complete.

>
> ...
>
> >
> >     All  that  might look very complicated, but the only commands
>                             ^^^^^^^^^^^^^^^^
> Yes -:)
> We could copy/move pg_dump' stuff into backend...
> This way pg_dump will just execute one command
>
> ALTER DATABASE ONLINE BACKUP; -- as I understand
>
> - backend will do all what it need and pg_dump just
> write backend' output to a file.
>
> I think that it would be nice to have code in backend to
> generate CREATE statements from catalog and extend EXPLAIN
> to handle something like EXPLAIN TABLE xxx etc.
> We could call EXPLAIN for all \dXXXX in psql and
> when dumping schema in pg_dump.
>
> Comments?

    Indeed :-)

    If we have serialized transaction that covers sequences, only
    BEGIN  and END BACKUP must remain. BEGIN to force the logfile
    switch and END to flag that dump is complete and backend  can
    update pg_database.

    So  you  want  to  put major parts of pg_dump's functionality
    into the backend. Hmmm - would be cool. And it would give  us
    a  chance  to include tests for most of the dump related code
    in regression.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

Re: [HACKERS] redolog - for discussion

От
Vadim Mikheev
Дата:
Jan Wieck wrote:
> 
> > >
> > >         At this time, a logfile  switch  is  done  (only  if  the
> > >         actual database is really logged) and the sequence number
> > >         of the new logfile plus the current datetime  remembered.
> > >         The behaviour of pg_dump's backend changes. It will see a
> > >         snapshot of this time (implemented in tqual code) in  any
> >           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Note, that I'm implementing multi-version concurrency control
> > (MVCC) for 6.5: pg_dump will have to run all queries
> > in one transaction in SERIALIZED mode to get snapshot of
> > transaction' begin time...
> 
>     Sounds  good and would make things easier. I'll keep my hands
>     off from the tqual code and wait for that.
> 
>     But  what  about  sequence   values   while   in   SERIALIZED
>     transaction mode. Sequences get overwritten in place! And for
>     a dump/restore/recover it is important,  that  the  sequences
>     get restored ALL at once in the state they where.

It's time to re-implement sequences! When they were implemented
~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so
I had to create table for each sequence.
There should be one system table - pg_sequence. One record
for each sequence will be inserted into this table and
one VIEW will be created: 

CREATE VIEW _seqname_ AS
SELECT * FROM pg_sequence WHERE sequence_name = '_seqname_';

GRANT/REVOKE on sequnece' VIEW will control rights to read sequence
using SELECT and rights to change sequence using nextval/setval.

Having _one_ sequences table there will be easy to lock
all sequences at once and read all values.

> >
> > Vacuum will not be delete records in which any active
> > backend is interested - don't worry.
> 
>     That's  the  vacuum  part,  but  I  still  need to delay DROP
>     TABLE/VIEW/SEQUENCE until the backup is complete.

Yes. And ALTER too.

Vadim


Re: [HACKERS] redolog - for discussion

От
Bruce Momjian
Дата:
> It's time to re-implement sequences! When they were implemented
> ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so
> I had to create table for each sequence.
> There should be one system table - pg_sequence. One record
> for each sequence will be inserted into this table and
> one VIEW will be created: 

I thought you wanted a single table to prevent concurrent access/update
contension?


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] redolog - for discussion

От
The Hermit Hacker
Дата:
On Sat, 12 Dec 1998, Bruce Momjian wrote:

> > It's time to re-implement sequences! When they were implemented
> > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so
> > I had to create table for each sequence.
> > There should be one system table - pg_sequence. One record
> > for each sequence will be inserted into this table and
> > one VIEW will be created: 
> 
> I thought you wanted a single table to prevent concurrent access/update
> contension?
let's revise what vadim stated *grin*
"we should re-implement sequences for v6.5, since row-level
locking will exist at that time"? *rofl*

Marc G. Fournier                                
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] redolog - for discussion

От
Bruce Momjian
Дата:
> On Sat, 12 Dec 1998, Bruce Momjian wrote:
> 
> > > It's time to re-implement sequences! When they were implemented
> > > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so
> > > I had to create table for each sequence.
> > > There should be one system table - pg_sequence. One record
> > > for each sequence will be inserted into this table and
> > > one VIEW will be created: 
> > 
> > I thought you wanted a single table to prevent concurrent access/update
> > contension?
> 
>     let's revise what vadim stated *grin*
> 
>     "we should re-implement sequences for v6.5, since row-level
> locking will exist at that time"? *rofl*

Oh, now I understand.  :-)  Vadim, don't be so modest.  :-)

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] redolog - for discussion

От
Vadim Mikheev
Дата:
Jan Wieck wrote:
> 
>     RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
> 
...
>
>         For  the  others, the backend starts the recovery program
>         which  reads  the  redolog  files,  establishes  database
>         connections  as  required  and reruns all the commands in
^^^^^^^^^^^^^^^^^^^^^^^^^^
>         them. If a required logfile isn't  found,  it  tells  the         ^^^^^

I foresee problems with using _commands_ logging for
recovery/replication -:((

Let's consider two concurrent updates in READ COMMITTED mode:

update test set x = 2 where y = 1;
and

update test set x = 3 where y = 1;

The result of both committed transaction will be x = 2
if the 1st transaction updated row _after_ 2nd transaction
and x = 3 if the 2nd transaction gets row after 1st one.
Order of updates is not defined by order in which commands
begun and so order in which commands should be rerun
will be unknown...

Comments?

Vadim


Re: [HACKERS] redolog - for discussion

От
jwieck@debis.com (Jan Wieck)
Дата:
Vadim wrote:

>
> Jan Wieck wrote:
> >
> >     RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
> >
> ...
> >
> >         For  the  others, the backend starts the recovery program
> >         which  reads  the  redolog  files,  establishes  database
> >         connections  as  required  and reruns all the commands in
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
> >         them. If a required logfile isn't  found,  it  tells  the
>           ^^^^^
>
> I foresee problems with using _commands_ logging for
> recovery/replication -:((
>
> Let's consider two concurrent updates in READ COMMITTED mode:
>
> update test set x = 2 where y = 1;
>
>    and
>
> update test set x = 3 where y = 1;
>
> The result of both committed transaction will be x = 2
> if the 1st transaction updated row _after_ 2nd transaction
> and x = 3 if the 2nd transaction gets row after 1st one.
> Order of updates is not defined by order in which commands
> begun and so order in which commands should be rerun
> will be unknown...

    Yepp,  the order in which commands begun is absolutely not of
    interest. Locking could already delay the  execution  of  one
    command  until  another  one  started  later has finished and
    released the lock.  It's a classic race condition.

    Thus, my plan was to log the queries just before the call  to
    CommitTransactionCommand()  in  tcop. This has the advantage,
    that queries which bail out with errors don't  get  into  the
    log  at  all  and  must not get rerun. And I can set a static
    flag to false before starting the command, which  is  set  to
    true  in  the buffer manager when a buffer is written (marked
    dirty), so filtering out queries that do no updates at all is
    easy.

    Unfortunately  query  level  logging get's hit by the current
    implementation of sequence numbers. If  a  query  that  get's
    aborted  somewhere  in the middle (maybe by a trigger) called
    nextval() for rows processed  earlier,  the  sequence  number
    isn't  advanced  at  recovery  time,  because  the  query  is
    suppressed at all.   And  sequences  aren't  locked,  so  for
    concurrently  running  queries  getting numbers from the same
    sequence,  the  results   aren't   reproduceable.   If   some
    application  selects  a  value  resulting from a sequence and
    uses that later in another query, how could the redolog  know
    that  this has changed? It's a Const in the query logged, and
    all that corrupts the whole thing.

    All that is painful and I don't see another solution yet than
    to  hook  into  nextval(),  log  out the numbers generated in
    normal operation and getting back the same  numbers  in  redo
    mode.

    The whole thing gets more and more complicated :-(


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

Re: [HACKERS] redolog - for discussion

От
Vadim Mikheev
Дата:
Jan Wieck wrote:
> 
> >
> > I foresee problems with using _commands_ logging for
> > recovery/replication -:((
> >
...
> 
>     Yepp,  the order in which commands begun is absolutely not of
>     interest. Locking could already delay the  execution  of  one
>     command  until  another  one  started  later has finished and
>     released the lock.  It's a classic race condition.
> 
>     Thus, my plan was to log the queries just before the call  to
>     CommitTransactionCommand()  in  tcop. This has the advantage,

Oh, I see - you right!

...
> 
>     Unfortunately  query  level  logging get's hit by the current
>     implementation of sequence numbers. If  a  query  that  get's
...
> 
>     All that is painful and I don't see another solution yet than
>     to  hook  into  nextval(),  log  out the numbers generated in

Not so bad, having buffering these numbers in memory...

>     normal operation and getting back the same  numbers  in  redo
>     mode.
> 
>     The whole thing gets more and more complicated :-(

As usual -:))

Vadim


RE: [HACKERS] redolog - for discussion

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Vadim Mikheev
> Sent: Thursday, December 17, 1998 5:03 PM
> To: Jan Wieck
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] redolog - for discussion
> 
> 
> Jan Wieck wrote:
> > 
> > >
> > > I foresee problems with using _commands_ logging for
> > > recovery/replication -:((
> > >
> ...
> > 
> >     Yepp,  the order in which commands begun is absolutely not of
> >     interest. Locking could already delay the  execution  of  one
> >     command  until  another  one  started  later has finished and
> >     released the lock.  It's a classic race condition.
> > 
> >     Thus, my plan was to log the queries just before the call  to
> >     CommitTransactionCommand()  in  tcop. This has the advantage,
> 
> Oh, I see - you right!
>

If   image level logging is used,probably it's OK.
But if query(command) level logging is used ???

If   the isolation level of all transactions is SERIARIZABLE,it's probably 
OK because they are serializable  order by the time when they are 
committed.
But if there are transactions whose isolation level is READ COMMITTED,
they are not serializable.
So commands must be issued according to the original order when they 
were issued ?

If   the same mechanism of locking is used at recovery time,the order 
of locks caused by commands(rerun) will be same ?????
I'm not confident.   
Thanks.

Hiroshi inoue
Inoue@tpf.co.jp