Обсуждение: Concurrent-update problem in bufmgr.c

Поиск
Список
Период
Сортировка

Concurrent-update problem in bufmgr.c

От
Tom Lane
Дата:
The comments for bufmgr.c's BufferSync routine point out that it's a
bad thing for some other backend to be modifying a page while it is
written out.  The following text has gone unchanged since Postgres95:
*    Also, we need to be sure that no other transaction is*    modifying the page as we flush it.  This is only a
problemfor objects*    that use a non-two-phase locking protocol, like btree indices.  For*    those objects, we would
liketo set a write lock for the duration of*    our IO.  Another possibility is to code updates to btree pages*
carefully,so that writing them out out of order cannot cause*    any unrecoverable errors.**    I don't want to think
hardabout this right now, so I will try*    to come back to it later.
 

Unfortunately, the comment is wrong about this being a problem only for
indexes.  It's possible for an invalid state of a heap page to be
written out, as well.  PageAddItem() sets the item pointer for an added
tuple before it copies the tuple onto the page, so if it is recycling an
existing item pointer slot, there is a state where a valid item pointer
is pointing at a tuple that's wholly or partly not valid.  This doesn't
matter as far as active backends are concerned because we should be
holding BUFFER_LOCK_EXCLUSIVE on the page while modifying it.  But some
other backend could be in process of writing out the page (if it had
previously dirtied the page), and so it's possible for this invalid
state to reach disk.  If the database is shut down before the new
update of the page can be written out, then we have a problem.

Normally, the new page state will be written out at transaction commit,
but what happens if the current transaction aborts?  In that case, the
dirty page just sits in shared memory.  It will get written the next
time a transaction modifies the page (and commits), or when some backend
decides to recycle the buffer to hold another page.  But if the
postmaster gets shut down before that happens, we lose; the dirty page
is never written at all, and when it's re-read after database restart,
the corrupted page state becomes visible.

The window of vulnerability is considerably wider in 7.0 than in prior
releases, because in prior releases *any* transaction commit will write
all dirty pages.  In 7.0 the dirtied page will not get written out until
we commit a transaction that modified that particular page (or decide to
recycle the buffer).  The odds of seeing a problem are still pretty
small, but the risk is definitely there.

I believe the correct fix for this problem is for bufmgr.c to grab
a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out.
A read lock is sufficient since there's no need to prevent other
backends from reading the page, we just need to prevent them from
changing it during the I/O.

Comments anyone?
        regards, tom lane


Re: Concurrent-update problem in bufmgr.c

От
Hiroshi Inoue
Дата:

Tom Lane wrote:

[snip]

>
> The window of vulnerability is considerably wider in 7.0 than in prior
> releases, because in prior releases *any* transaction commit will write
> all dirty pages.  In 7.0 the dirtied page will not get written out until
> we commit a transaction that modified that particular page (or decide to
> recycle the buffer).  The odds of seeing a problem are still pretty
> small, but the risk is definitely there.
>
> I believe the correct fix for this problem is for bufmgr.c to grab
> a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out.
> A read lock is sufficient since there's no need to prevent other
> backends from reading the page, we just need to prevent them from
> changing it during the I/O.
>
> Comments anyone?

This seems to be almost same as I posted 4 months ago(smgrwrite()
without LockBuffer(was RE: ...).
Maybe Vadim would take care of it in the inplementation of WAL.
The following was Vadim's reply to you and me.

>
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > > As far as I see,PostgreSQL doesn't call LockBuffer() before
> > > calling smgrwrite(). This seems to mean that smgrwrite()
> > > could write buffers to disk which are being changed by
> > > another backend. If the(another) backend was aborted by
> > > some reason the buffer page would remain half-changed.
> >
> > Hmm ... looks fishy to me too.  Seems like we ought to hold
> > BUFFER_LOCK_SHARE on the buffer while dumping it out.  It
> > wouldn't matter under normal circumstances, but as you say
> > there could be trouble if the other backend crashed before
> > it could mark the buffer dirty again, or if we had a system
> > crash before the dirtied page got written again.
>
> Well, known issue. Buffer latches were implemented in 6.5 only
> and there was no time to think about subj hard -:)
> Yes, we have to shlock buffer before writing and this is what
> bufmgr will must do for WAL anyway (to ensure that last buffer
> changes already logged)... but please check that buffer is not
> exc-locked by backend itself somewhere before smgrwrite()...
>
> Vadim
>
>

Regards.

Hiroshi Inoue




Re: Concurrent-update problem in bufmgr.c

От
Tom Lane
Дата:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> This seems to be almost same as I posted 4 months ago(smgrwrite()
> without LockBuffer(was RE: ...).

You are right, this was already a known issue (and I had it buried in
my to-do list, in fact).  I rediscovered it while puzzling over some
of the corrupted-data reports we've seen lately.

I'll go ahead and make the change in current sources.  Does anyone
have a strong feeling about whether or not to back-patch it into
REL7_0 branch for the upcoming 7.0.3 release?
        regards, tom lane


RE: Concurrent-update problem in bufmgr.c

От
"Mikheev, Vadim"
Дата:
> I believe the correct fix for this problem is for bufmgr.c to grab
> a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out.
> A read lock is sufficient since there's no need to prevent other
> backends from reading the page, we just need to prevent them from
> changing it during the I/O.
> 
> Comments anyone?

Do it.

Vadim