Re: Post-mortem: final 2PC patch
От | Tom Lane |
---|---|
Тема | Re: Post-mortem: final 2PC patch |
Дата | |
Msg-id | 8641.1119133122@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: Post-mortem: final 2PC patch (Heikki Linnakangas <hlinnaka@iki.fi>) |
Ответы |
Re: Post-mortem: final 2PC patch
(Heikki Linnakangas <hlinnaka@iki.fi>)
|
Список | pgsql-patches |
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Sat, 18 Jun 2005, Tom Lane wrote: >> I'm not totally satisfied with this --- it's OK from a correctness >> standpoint but the performance leaves something to be desired. > Ouch, that really hurts performance. > In typical 2PC use, the state files live for a very short period of time, > just long enough for the transaction manager to prepare all the resource > managers participating in the global transaction, and then commit them. > We're talking < 1 s. If we let checkpoint to fsync the state files, we > would only have to fsync those state files that happen to be alive when > the checkpoint comes. That's a good point --- I was thinking this was basically 4 fsyncs per xact (counting the additional WAL fsync needed for COMMIT PREPARED) versus 3, but if the average lifetime of a state file is short then it's 4 vs 2, and what's more the 2 are on WAL, which should be way cheaper than fsyncing random files. > And if we fsync the state files at the end of the > checkpoint, after all the usual heap pages etc, it's very likely that > even those rare state files that were alive when the checkpoint began, > have already been deleted. That argument is bogus, at least with respect to the way you were doing it in the original patch, because what you were fsyncing was whatever existed when CheckPointTwoPhase() started. It could however be interesting if we actually implemented something that checked the age of the prepared xact. > Can we figure out another way to solve the race condition? Would it > in fact be ok for the checkpointer to hold the TwoPhaseStateLock, > considering that it usually wouldn't be held for long, since usually the > checkpoint would have very little work to do? If you're concerned about throughput of 2PC xacts then we can't sit on the TwoPhaseStateLock while doing I/O; that will block both preparation and commital of all 2PC xacts for a pretty long period in CPU terms. Here's a sketch of an idea inspired by your comment above: 1. In each gxact in shared memory, store the WAL offset of the PREPARE record, which we will know before we are ready to mark the gxact "valid". 2. When CheckPointTwoPhase runs (which we'll put near the end of the checkpoint sequence), the only gxacts that need to be fsync'd are those that are marked valid and have a PREPARE WAL location older than the checkpoint's redo horizon (anything newer will be replayed from WAL on crash, so it doesn't need fsync to complete the checkpoint). If you're right that the lifespan of a state file is often shorter than the time needed for a checkpoint, this wins big. In any case we'll never have to fsync state files that disappear before the next checkpoint. 3. One way to handle CheckPointTwoPhase is: * At start, take TwoPhaseStateLock (can be in shared mode) for just long enough to scan the gxact list and make a list of the XID of things that need fsync per above rule. * Without the lock, try to open and fsync each item in the list. Success: remove from list ENOENT failure on open: add to list of not-there failures Any other failure: ereport(ERROR) * If the failure list is not empty, again take TwoPhaseStateLock in shared mode, and check that each of the failures is now gone (or at least marked invalid); if so it's OK, otherwise ereport the ENOENT error. Another possibility is to further extend the locking protocol for gxacts so that the checkpointer can lock just the item it is fsyncing (which is not possible at the moment because the checkpointer hasn't got an XID, but probably we could think of another approach). But that would certainly delay attempts to commit the item being fsync'd, whereas the above approach might not have to do so, depending on the filesystem implementation. Now there's a small problem with this approach, which is that we cannot store the PREPARE WAL record location in the state files, since the state file has to be completely computed before writing the WAL record. However, we don't really need to do that: during recovery of a prepared xact we know the thing has been fsynced (either originally, or when we rewrote it during the WAL recovery sequence --- we can force an immediate fsync in that one case). So we can just put zero, or maybe better the current end-of-WAL location, into the reconstructed gxact in memory. Thoughts? regards, tom lane
В списке pgsql-patches по дате отправления: