Re: Race conditions, race conditions!
От | Tatsuo Ishii |
---|---|
Тема | Re: Race conditions, race conditions! |
Дата | |
Msg-id | 20050508.113149.115902969.t-ishii@sra.co.jp обсуждение исходный текст |
Ответ на | Race conditions, race conditions! (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Race conditions, race conditions!
(Bruce Momjian <pgman@candle.pha.pa.us>)
|
Список | pgsql-hackers |
Are we going to put the fixes into 8.0.3 and so on? Or will it be included in 8.0.4? -- Tatsuo Ishii > I promised an analysis of the problems Jan Wieck uncovered yesterday, > so here it is. > > Jan had developed a testbed (which I hope he will post) that essentially > has a bunch of client threads all doing instances of the same > transaction in READ COMMITTED mode. The intended database invariant is > that the number of rows in table t2 with a particular ID value is equal > to the "cnt" field of the single row in t1 with that ID value: > > begin; > select cnt from t1 where id = K for update; > delete from t2 where id = K; > -- check returned rowcount to see that exactly CNT rows were deleted > insert into t2 values (K); -- repeat this N times > update t1 set cnt = N where id = K; > commit; > > K is randomly chosen for each execution from the known set of t1 keys, > and N is randomly chosen each time as a small positive integer. This > should maintain the invariant, since at any time only one transaction > can be holding the SELECT FOR UPDATE lock on a particular t1 row. > > The trick is to run this under astronomical load; 100 or so client > threads on a garden-variety PC is about right. > > What Jan was seeing was that every so often, a client thread would > error out, reporting that its DELETE had deleted zero rows rather > than the expected number. But subsequent examination showed the t2 > rows as being there. > > Investigation showed that the connected backend had properly acquired > FOR UPDATE lock on the t1 row, but the snapshot it was using for the > subsequent DELETE showed the inserter of the t1 row as still running. > This should be impossible, since the SELECT FOR UPDATE cannot lock an > uncommitted row, and in READ COMMITTED mode we certainly will take a > new snapshot for the DELETE. > > However, there is a race condition here. During transaction commit, > xact.c first marks the transaction committed in pg_clog, and then > clears its XID from PGPROC. This means there is a narrow window > in which both TransactionIdDidCommit and TransactionIdIsInProgress > will return true. (We cannot do it the other way around, because > if neither one is returning true, onlookers are entitled to assume > that the transaction has crashed.) However, the tqual.c routines > will allow a row to be seen as committed as soon as > TransactionIdDidCommit(xmin) returns true. So the scenario is: > > 1. Backend A does RecordTransactionCommit to mark itself committed > in pg_clog, but then loses the CPU in the narrow window between > doing that and clearing its PGPROC entry. Because of the ridiculous > load, it doesn't get control back for awhile. > > 2. Backend B comes along to run the test transaction for the same K > value. It inspects the t1 row, concludes it's committed, marks the > row as locked FOR UPDATE, and returns the results to the client. > > 3. The client now issues the DELETE command. B takes a new snapshot, > but because A is still not cleared out of PGPROC, A's transaction > is shown as still running in the snapshot. > > 4. Now the DELETE will delete no rows, because it doesn't consider the > t2 rows it should delete to be committed. > > > AFAICS this race condition has always been there; certainly at > least since Vadim put in MVCC, and it looks likely that the > original Berkeley code had a form of the problem. > > The correct fix is that the tqual.c routines should check commit status > in the following way: > > if (TransactionIdIsInProgress(xid)) > // still in progress, don't touch tuple > else if (TransactionIdDidCommit(xid)) > // committed, mark accordingly > else > // must be aborted or crashed, mark accordingly > > rather than what they have traditionally done: > > if (TransactionIdDidCommit(xid)) > // committed, mark accordingly > else if (TransactionIdDidAbort(xid)) > // aborted, mark accordingly > else > // assume still in progress, don't touch tuple > > Vadim recognized that the former logic was necessary for VACUUM to use > in deciding if it could clean up dead tuples, but apparently he never > made the extrapolation that it should be used *everywhere* transaction > status is tested. > > > The other interesting thing we saw was an Assertion failure. The > postmaster log showed > > WARNING: relation "t1" page 196 is uninitialized --- fixing > TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263) > LOG: server process (PID 11296) was terminated by signal 6 > > The WARNING could only have come from VACUUM. (Jan's testbed does > launch a VACUUM every so often.) The Assert failure is in > RelationGetBufferForTuple where it is adding a new page to a table. > > I interpret this as the guy doing RelationGetBufferForTuple added a > page, but before he could initialize it, he lost control for long enough > for a VACUUM to scan through the entire table, see the zeroed page, and > "fix" it. Then when the first guy got control again, his Assert saying > the page was zeroes failed. The window for this exists because bufmgr/smgr > do physically extend the file, but when the new buffer is returned > to the caller, it is only pinned, not locked. So it is possible for > VACUUM to acquire lock on the new page before RelationGetBufferForTuple > does. (This is exceedingly improbable, because VACUUM would have to > scan through the entire relation first --- it wouldn't examine the > new page at all unless it was included in RelationGetNumberOfBlocks > at the start of VACUUM's scan. We have not been able to reproduce the > crash, but we did see it once.) > > At first I thought this didn't have any conseqences worse than an > Assert, since after all both VACUUM and the original extender are just > trying to set up the page as empty. But consider this scenario: > > 1. Process A adds a zero page and gets blocked ahead of initializing it. > > 2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it > in FSM. > > 3. Process C takes the page from FSM and puts some new tuples in it. > > 4. Process A initializes the page, thus discarding C's tuples. > > This could not happen if Asserts are enabled, because A would Assert > before re-initializing the page (and that's inside a lock so it is > sure to happen that way). But with Asserts off, it's within the realm > of possibility under sufficiently heavy load. > > This risk has been there since lazy VACUUM was created (it cannot > happen with VACUUM FULL because that takes an exclusive lock on the > whole table). I believe the best fix is: > > 1. Change RelationGetBufferForTuple to hold the "relation extension" > lock until after it's exclusive-locked the new buffer. > > 2. Adjust VACUUM to not try to "fix" a page without getting the relation > extension lock. If the page is still uninitialized after obtaining > that lock, then it must be a really lost page, not one that someone > is still in process of setting up. > > There is a similar risk in btree indexes, with a similar fix. > > > We are currently testing candidate patches for these problems, > and so far haven't seen any further failures. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
В списке pgsql-hackers по дате отправления: