Обсуждение: Proposal for DROP TABLE rollback mechanism

Поиск
Список
Период
Сортировка

Proposal for DROP TABLE rollback mechanism

От
Tom Lane
Дата:
(This is mostly directed at Vadim, but kibitzing is welcome.)

Here's what I plan to do to make DROP TABLE rollbackable and clean up
the handling of CREATE TABLE rollback.  Comments?


Overview:

1. smgrcreate() will create the file for the relation same as it does now,
and will add the rel's RelFileNode information to an smgr-private list of
rels created in the current transaction.

2. smgrunlink() normally will NOT immediately delete the file; instead it
will perform smgrclose() and then add the rel's RelFileNode information to
an smgr-private list of rels to be deleted at commit.  However, if the
file appears in the list created by smgrcreate() --- ie, the rel is being
created and deleted in the same xact --- then we can delete it
immediately.  In this case we remove the file from the smgrcreate list
and do not put it on the unlink list.

3. smgrcommit() will delete all the files mentioned in the list created
by smgrunlink, then discard both lists.

4. smgrabort() will delete all the files mentioned in the list created
by smgrcreate, then discard both lists.

Points 1 and 4 will replace the existing relcache-based mechanism for
deleting files created in the current xact when the xact aborts.


Various details:

To support deleting files at xact commit/abort, we will need something
like an "mdblindunlink" entrypoint to md.c.  I am inclined to simply
redefine mdunlink to take a RelFileNode instead of a complete Relation,
rather than supporting two entrypoints --- I don't think there'll be any
future use for the existing mdunlink.  Objections?

bufmgr.c's ReleaseRelationBuffers drops any dirty buffers for the target
rel, and therefore it must NOT be called inside the transaction (else,
rollback would mean we'd lost data).  I am inclined to let it continue
to behave that way, but to call it from smgrcommit/smgrabort, not from
anywhere else.  This would mean changing its API to take a RelFileNode,
but that seems no big problem.  This way, dirty buffers for a doomed
relation will be allowed to live until transaction commit, in the hopes
that we will be able to discard them unwritten.

Will remove notices in DROP TABLE etc. warning that these operations
are not rollbackable.  Note that CREATE/DROP DATABASE is still not
rollback-able, and so those two ops will continue to elog(ERROR) when
called in a transaction block.  Ditto for VACUUM; probably also ditto
for REINDEX, though I haven't looked closely at that yet.

The temp table name mapper will need to be modified so that it can
undo all current-xact changes to its name mapping list at xact abort.
Currently I think it only handles undoing additions, not
deletions/renames.  This does not need to be WAL-aware, does it?


WAL:

AFAICS, things will behave properly if calls to smgrcreate/smgrunlink
are logged as WAL events.  For redo, they are executed just the same
as normal, except they shouldn't complain if the target file already
exists (or already doesn't exist, for unlink).  Undo of smgrcreate
is just immediate mdunlink; undo of smgrunlink is a no-op.

I have not studied the WAL code enough to be prepared to add the
logging/undo/redo code, and it looks like you haven't implemented that
anyway yet for smgr.c, so I will leave that part to you, OK?
        regards, tom lane


Re: Proposal for DROP TABLE rollback mechanism

От
"Vadim Mikheev"
Дата:
> 1. smgrcreate() will create the file for the relation same as it does now,
> and will add the rel's RelFileNode information to an smgr-private list of
> rels created in the current transaction.
>
> 2. smgrunlink() normally will NOT immediately delete the file; instead it
> will perform smgrclose() and then add the rel's RelFileNode information to        ^^^^^^^^^^^^^^^
Seems that smgrclose still need in Relation (where rd_fd lives and this is
bad) and you're going to put just file node to smgrunlink (below). Shouldn't
relation be removed from relcache (and smgrclose called from there)
before smgrunlink?

> an smgr-private list of rels to be deleted at commit.  However, if the
> file appears in the list created by smgrcreate() --- ie, the rel is being
> created and deleted in the same xact --- then we can delete it
> immediately.  In this case we remove the file from the smgrcreate list
> and do not put it on the unlink list.

(This wouldn't work for savepoints but ok for now.)

> 3. smgrcommit() will delete all the files mentioned in the list created
> by smgrunlink, then discard both lists.
>
> 4. smgrabort() will delete all the files mentioned in the list created
> by smgrcreate, then discard both lists.
>
> Points 1 and 4 will replace the existing relcache-based mechanism for
> deleting files created in the current xact when the xact aborts.
>
>
> Various details:
>
> To support deleting files at xact commit/abort, we will need something
> like an "mdblindunlink" entrypoint to md.c.  I am inclined to simply
> redefine mdunlink to take a RelFileNode instead of a complete Relation,
> rather than supporting two entrypoints --- I don't think there'll be any
> future use for the existing mdunlink.  Objections?

No one. Actually I would like to see all smgr entries taking RelFileNode
instead of Relation. This would require smgr cache to map nodes to fd-es.
Having this we could get rid of all blind smgr entries: smgropen
would put file node & fd into smgr cache and all other smgrmethods
would act as blind ones do now - no file node found in cache then
open file, performe op, close file.
But probably it's too late to implement this now.

> bufmgr.c's ReleaseRelationBuffers drops any dirty buffers for the target
> rel, and therefore it must NOT be called inside the transaction (else,
> rollback would mean we'd lost data).  I am inclined to let it continue
> to behave that way, but to call it from smgrcommit/smgrabort, not from
> anywhere else.  This would mean changing its API to take a RelFileNode,
> but that seems no big problem.  This way, dirty buffers for a doomed
> relation will be allowed to live until transaction commit, in the hopes
> that we will be able to discard them unwritten.

Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr
doesn't look like right thing. ?

> Will remove notices in DROP TABLE etc. warning that these operations
> are not rollbackable.  Note that CREATE/DROP DATABASE is still not
> rollback-able, and so those two ops will continue to elog(ERROR) when
> called in a transaction block.  Ditto for VACUUM; probably also ditto
> for REINDEX, though I haven't looked closely at that yet.
>
> The temp table name mapper will need to be modified so that it can
> undo all current-xact changes to its name mapping list at xact abort.
> Currently I think it only handles undoing additions, not
> deletions/renames.  This does not need to be WAL-aware, does it?
>
>
> WAL:
>
> AFAICS, things will behave properly if calls to smgrcreate/smgrunlink
> are logged as WAL events.  For redo, they are executed just the same

Yes, there will be logging for smgrcreate, but not for smgrunlink because
of we'll make real unlinking after commit. All what is needed for WAL
is list of file nodes to remove - I need to put this list into commit log
record to ensure that files are removed on redo and need in ability
to remove file immediately in this case.

> as normal, except they shouldn't complain if the target file already
> exists (or already doesn't exist, for unlink).  Undo of smgrcreate
> is just immediate mdunlink; undo of smgrunlink is a no-op.
>
> I have not studied the WAL code enough to be prepared to add the
> logging/undo/redo code, and it looks like you haven't implemented that
> anyway yet for smgr.c, so I will leave that part to you, OK?

Unfortunately, there will be no undo in 7.1 -:(
I've found that to undo index updates we would need in either compensation
records or in xmin/cmin in index tuples. So, we'll still live with dust in
storage -:(
Redo is much easy.

Vadim




Re: Proposal for DROP TABLE rollback mechanism

От
Tom Lane
Дата:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
>> 2. smgrunlink() normally will NOT immediately delete the file; instead it
>> will perform smgrclose() and then add the rel's RelFileNode information to
>          ^^^^^^^^^^^^^^^
> Seems that smgrclose still need in Relation (where rd_fd lives and this is
> bad) and you're going to put just file node to smgrunlink (below). Shouldn't
> relation be removed from relcache (and smgrclose called from there)
> before smgrunlink?

No, the way the existing higher-level code works is first to call
smgrunlink (NOT smgrclose) and then remove the relcache entry.  I don't
see a need to change that.  In the interval where we're waiting to
commit, there will be no relcache entry, only a RelFileNode value
sitting in smgr's list of files to delete at commit.

> Actually I would like to see all smgr entries taking RelFileNode
> instead of Relation.

I agree, but I think that's a project for a future release.  Not enough
time for it for 7.1.

>> bufmgr.c's ReleaseRelationBuffers drops any dirty buffers for the target
>> rel, and therefore it must NOT be called inside the transaction (else,
>> rollback would mean we'd lost data).  I am inclined to let it continue
>> to behave that way, but to call it from smgrcommit/smgrabort, not from
>> anywhere else.  This would mean changing its API to take a RelFileNode,
>> but that seems no big problem.  This way, dirty buffers for a doomed
>> relation will be allowed to live until transaction commit, in the hopes
>> that we will be able to discard them unwritten.

> Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr
> doesn't look like right thing. ?

Yes, it's a little bit ugly, but if we call FlushRelationBuffers then we
will likely be doing some useless writes (to flush out pages that we are
only going to throw away anyway).  If we leave the buffers alone till
commit, then we'll only write out pages if we need to recycle a buffer
for another use during that transaction.

Also, I don't feel comfortable with the idea of doing
FlushRelationBuffers mid-transaction and then relying on the buffer
cache to still be empty of pages for that relation much later on when
we finally commit.  Sure, it *should* be empty, but I'll be happier
if we flush the buffer cache immediately before deleting the file.

What might make sense is to make a pass over the buffer cache at the
time of DROP (inside the transaction) to make sure there are no pinned
buffers for the rel --- if so, we want to elog() during the transaction
not after commit.  We could also release any non-dirty buffers at
that point.  Then after commit we know we don't care about the dirty
buffers anymore, so we come back and discard them.
        regards, tom lane


Re: Proposal for DROP TABLE rollback mechanism

От
"Vadim Mikheev"
Дата:
> > Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr
> > doesn't look like right thing. ?
> 
> Yes, it's a little bit ugly, but if we call FlushRelationBuffers then we
> will likely be doing some useless writes (to flush out pages that we are
> only going to throw away anyway).  If we leave the buffers alone till
> commit, then we'll only write out pages if we need to recycle a buffer
> for another use during that transaction.

BTW, why do we force buffers to disk in FlushRelationBuffers at all?
Seems all what is required is to flush them *from* pool, not *to* disk
immediately. In WAL bufmgr version (temporary in xlog_bufmgr.c)
I've changed this. Actually I wouldn't worry about these writes at all -
drop relation is rare case, - but ok.

> Also, I don't feel comfortable with the idea of doing
> FlushRelationBuffers mid-transaction and then relying on the buffer
> cache to still be empty of pages for that relation much later on when
> we finally commit.  Sure, it *should* be empty, but I'll be happier
> if we flush the buffer cache immediately before deleting the file.

(It *must* be empty -:) Sorry, I don't understand "*immediately* before"
in multi-user environment. Relation must be excl locked and this must
be only guarantee for us, not time of doing anything with this relation.)

> What might make sense is to make a pass over the buffer cache at the
> time of DROP (inside the transaction) to make sure there are no pinned
> buffers for the rel --- if so, we want to elog() during the transaction
> not after commit.  We could also release any non-dirty buffers at
> that point.  Then after commit we know we don't care about the dirty
> buffers anymore, so we come back and discard them.

Please note that there is xlog_bufmgr.c If you'll add/change something in
bufmgr please let me know later.

Vadim




Re: Proposal for DROP TABLE rollback mechanism

От
Tom Lane
Дата:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> BTW, why do we force buffers to disk in FlushRelationBuffers at all?
> Seems all what is required is to flush them *from* pool, not *to* disk
> immediately.

Good point.  Seems like it'd be sufficient to do a standard async write
rather than write + fsync.

We'd still need some additional logic at commit time, however, because
we want to make sure there are no BufferDirtiedByMe bits set for the
file we're about to delete...

>> Also, I don't feel comfortable with the idea of doing
>> FlushRelationBuffers mid-transaction and then relying on the buffer
>> cache to still be empty of pages for that relation much later on when
>> we finally commit.  Sure, it *should* be empty, but I'll be happier
>> if we flush the buffer cache immediately before deleting the file.

> (It *must* be empty -:)

In the absence of bugs, yes, it must be empty for the case where we
are committing a DROP.  But I want a safety check to catch those bugs.
Besides, what of the case where we are aborting a CREATE?  There must be
a bufmgr call that will allow us to ensure there are no buffers left for
the relation we intend to delete.

How about this: change FlushRelationBuffers so that it does standard
async writes for dirty buffers and then removes all the rel's buffers
from the pool.  This is invoked inside the transaction, same as now.
Make DROP TABLE call this routine, *not* RemoveRelationBuffers.
Then call RemoveRelationBuffers from smgr during transaction commit or
abort.  In the commit case, there really shouldn't be any buffers for
the rel, so we can emit an elog NOTICE (it's too late for ERROR, no?)
if we find any.  But in the abort case, we'd not be surprised to find
buffers, even dirty buffers, and we just want to throw 'em away.  This
would also be the place to clean out the BufferDirtiedByMe state.

If you don't like the smgr->bufmgr call, the reason for that is we're
asking smgr to keep the list of pending relation deletes.  We could make
a new module, logically above smgr and bufmgr, to keep that list
instead.  But I don't think it's worth the trouble.

> Please note that there is xlog_bufmgr.c If you'll add/change something in
> bufmgr please let me know later.

Will keep you informed.
        regards, tom lane


Re: Proposal for DROP TABLE rollback mechanism

От
"Vadim Mikheev"
Дата:
> > BTW, why do we force buffers to disk in FlushRelationBuffers at all?
> > Seems all what is required is to flush them *from* pool, not *to* disk
> > immediately.
>
> Good point.  Seems like it'd be sufficient to do a standard async write
> rather than write + fsync.
>
> We'd still need some additional logic at commit time, however, because
> we want to make sure there are no BufferDirtiedByMe bits set for the
> file we're about to delete...

Note that there is no BufferDirtiedByMe in WAL bufmgr version.

> How about this: change FlushRelationBuffers so that it does standard
> async writes for dirty buffers and then removes all the rel's buffers

^^^^^^^^^^^^^^^^^^^^^^^^^
When it's used from vacuum no reason to do this.

> from the pool.  This is invoked inside the transaction, same as now.
> Make DROP TABLE call this routine, *not* RemoveRelationBuffers.
> Then call RemoveRelationBuffers from smgr during transaction commit or
> abort.  In the commit case, there really shouldn't be any buffers for
> the rel, so we can emit an elog NOTICE (it's too late for ERROR, no?)

^^^^^^^^^^^^^^^^^^^^^
Sure, but good time for Assert -:)

> if we find any.  But in the abort case, we'd not be surprised to find
> buffers, even dirty buffers, and we just want to throw 'em away.  This
> would also be the place to clean out the BufferDirtiedByMe state.

BTW, currently smgrcommit is called twice on commit *before* xact
status in pg_log updated and so you can't use it to remove files. In
WAL version smgrcommit isn't called at all now but we easy can
add smgrcommit call after commit record is logged.

Vadim




Re: Proposal for DROP TABLE rollback mechanism

От
Tom Lane
Дата:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> Please note that there is xlog_bufmgr.c If you'll add/change something in
> bufmgr please let me know later.

Per your request: I've changed bufmgr.c.  I think I made appropriate
changes in xlog_bufmgr, but please check.  The changes were:

1. Modify FlushRelationBuffers to do plain write, not flush (no fsync)
of dirty buffers.  This was per your suggestion.  FlushBuffer() now
takes an extra parameter indicating whether fsync is wanted.  I think
this change does not affect xlog_bufmgr at all.

2. Rename ReleaseRelationBuffers to DropRelationBuffers to make it more
clear what it's doing.

3. Add a DropRelFileNodeBuffers, which is just like DropRelationBuffers
except it takes a RelFileNode argument.  This is used by smgr to ensure
that the buffer cache is clear of buffers for a rel about to be deleted.

4. Update comments about usage of DropRelationBuffers and
FlushRelationBuffers. 

Rollback of DROP TABLE now works in non-WAL code, and seems to work in
WAL code too.  I did not add WAL logging, because I'm not quite sure
what to do, so rollforward probably does the wrong thing.  Could you
deal with that part?  smgr.c is the place that keeps the list of what
to delete at commit or abort.
        regards, tom lane


RE: Proposal for DROP TABLE rollback mechanism

От
"Mikheev, Vadim"
Дата:
> Rollback of DROP TABLE now works in non-WAL code, and seems to work in
> WAL code too.  I did not add WAL logging, because I'm not quite sure
> what to do, so rollforward probably does the wrong thing.  Could you
> deal with that part?  smgr.c is the place that keeps the list of what
> to delete at commit or abort.

Ok, thanks! I'll take list of relfilenodes to delete just before
commit and put it into commit record.

Vadim