Обсуждение: lazy snapshots?
I had the following idea for an optimization. Feel free to tell me I'm nuts. Would it be possible to postpone the operation of taking a snapshot until we encounter an in-doubt tuple - that is, a tuple whose XMIN or XMAX is committed but not all-visible? It seems to me that there are many transactions that probably never look at any recently-modified data, and that the overhead (and contention) of scanning the ProcArray could be avoided for such transactions. At the time when we currently take a snapshot, we could instead record an estimate of the oldest XID still running; I'll call this value the threshold XID. Ideally, this would be something we could read from shared memory in O(1) time. Subsequently, when we examine XMIN or XMAX, we may find that it's aborted (in which case we don't need a snapshot to decide what to do) or that the XID we're examining precedes the threshold XID (in which case we don't need a snapshot to decide what to do) or that the XID we're examining is our own (in which case we again don't need a snapshot to decide what to do). If none of those conditions hold, we take a snapshot. (Possibly, we could try rereading the threshold XID from shared memory, because it might have advanced far enough to get us out of the woods.) It's necessary to convince ourselves not only that this has some performance benefit but that it's actually correct. It's easy to see that, if we never take a snapshot, all the tuple visibility decisions we make will be exactly identical to the ones that we would have made with a snapshot; the choice of snapshot in that case is arbitrary. But if we do eventually take a snapshot, we'll likely make different tuple visibility decisions than we would have made had we taken the snapshot earlier. However, the decisions that we make prior to taking the snapshot will be consistent with the snapshot, and we will certainly see the effects of all transactions that committed before we started. We may also see the effects of some transactions that commit after we started, but that is OK: it is just as if our whole transaction had been started slightly later and then executed more quickly thereafter. It would be bad if we saw the effect of transaction A but not transaction B where transaction B committed after transaction A, but the way snapshots are taken prevents that regardless of exactly when we do it. VACUUM can't remove any tuples with committed XMINs unless their XMAX precedes our threshold XID, but I think that's not any worse under this proposal than it is anyway. If we took a full snapshot instead of just writing down a threshold XID, we'd have the same problem. OK, that's it. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > It's necessary to convince ourselves not only that this has some > performance benefit but that it's actually correct. It's easy to see > that, if we never take a snapshot, all the tuple visibility decisions > we make will be exactly identical to the ones that we would have made > with a snapshot; the choice of snapshot in that case is arbitrary. > But if we do eventually take a snapshot, we'll likely make different > tuple visibility decisions than we would have made had we taken the > snapshot earlier. However, the decisions that we make prior to taking > the snapshot will be consistent with the snapshot, and we will > certainly see the effects of all transactions that committed before we > started. We may also see the effects of some transactions that commit > after we started, but that is OK: it is just as if our whole > transaction had been started slightly later and then executed more > quickly thereafter. I don't think this is going to be acceptable at all. You're assuming that clients have no independent means of determining what order transactions execute in, which isn't the case. It would be quite possible, for example, for a query submitted to one backend to see the effects of a transaction that was submitted to another backend long after the first query started. If the two clients involved interact at all, they're not going to be happy. Even if they just compare transaction timestamps, they're not going to be happy. I'm less than convinced by the hypothesis that most transactions would avoid taking snapshots in this regime, anyway. It would only hold up if there's little locality of reference in terms of which tuples are getting examined/modified by concurrent transactions, and that's a theory that seems much more likely to be wrong than right. I wonder whether we could do something involving WAL properties --- the current tuple visibility logic was designed before WAL existed, so it's not exploiting that resource at all. I'm imagining that the kernel of a snapshot is just a WAL position, ie the end of WAL as of the time you take the snapshot (easy to get in O(1) time). Visibility tests then reduce to "did this transaction commit with a WAL record located before the specified position?". You'd need some index datastructure that made it reasonably cheap to find out the commit locations of recently committed transactions, where "recent" means "back to recentGlobalXmin". That seems possibly do-able, though I don't have a concrete design in mind. regards, tom lane
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It's necessary to convince ourselves not only that this has some >> performance benefit but that it's actually correct. It's easy to see >> that, if we never take a snapshot, all the tuple visibility decisions >> we make will be exactly identical to the ones that we would have made >> with a snapshot; the choice of snapshot in that case is arbitrary. >> But if we do eventually take a snapshot, we'll likely make different >> tuple visibility decisions than we would have made had we taken the >> snapshot earlier. However, the decisions that we make prior to taking >> the snapshot will be consistent with the snapshot, and we will >> certainly see the effects of all transactions that committed before we >> started. We may also see the effects of some transactions that commit >> after we started, but that is OK: it is just as if our whole >> transaction had been started slightly later and then executed more >> quickly thereafter. > > I don't think this is going to be acceptable at all. You're assuming > that clients have no independent means of determining what order > transactions execute in, which isn't the case. It would be quite > possible, for example, for a query submitted to one backend to see the > effects of a transaction that was submitted to another backend long > after the first query started. If the two clients involved interact > at all, they're not going to be happy. Even if they just compare > transaction timestamps, they're not going to be happy. I'm not sure they're entitled to rely on any other behavior. Couldn't the exact same thing happen in a non-MVCC database based on SS2PL? > I'm less than convinced by the hypothesis that most transactions would > avoid taking snapshots in this regime, anyway. It would only hold up > if there's little locality of reference in terms of which tuples are > getting examined/modified by concurrent transactions, and that's a > theory that seems much more likely to be wrong than right. There will certainly be workloads where most transactions acquire a snapshot, but just to take one obvious example, suppose we have a data warehouse where every night we bulk load the day's data, and then we run reporting queries all day. Except during the overnight bulk loads, there's no concurrent write activity at all, and thus no need for snapshots. Or imagine a database where we store monitoring data. There's a continuous flow of monitoring data from multiple sources; and then people run reports. The users running reports will need snapshots, but the processes updating the monitoring data will presumably be touching discrete sets of tuples. They may be INSERT-only, and even if they do updates, the process monitoring resource A only needs to look at the rows for resource A, not the rows for resource B. If the tables are large enough that index scans are used and the threshold XID is updated sufficiently frequently, you might get away without snapshots. This isn't quite so clear a win as the first one but maybe it's worth thinking about. One thing we could do is instrument the current code to track whether any field of the snapshot other than snapshot->xmin is ever used, and then run some benchmarks to see how often that happens. > I wonder whether we could do something involving WAL properties --- the > current tuple visibility logic was designed before WAL existed, Wow. > so it's > not exploiting that resource at all. I'm imagining that the kernel of a > snapshot is just a WAL position, ie the end of WAL as of the time you > take the snapshot (easy to get in O(1) time). Visibility tests then > reduce to "did this transaction commit with a WAL record located before > the specified position?". You'd need some index datastructure that made > it reasonably cheap to find out the commit locations of recently > committed transactions, where "recent" means "back to recentGlobalXmin". > That seems possibly do-able, though I don't have a concrete design in > mind. Interesting. O(1) snapshots would be great. I need to think about this more before commenting on it, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm less than convinced by the hypothesis that most transactions would >> avoid taking snapshots in this regime, anyway. �It would only hold up >> if there's little locality of reference in terms of which tuples are >> getting examined/modified by concurrent transactions, and that's a >> theory that seems much more likely to be wrong than right. > There will certainly be workloads where most transactions acquire a > snapshot, but just to take one obvious example, suppose we have a data > warehouse where every night we bulk load the day's data, and then we > run reporting queries all day. Except during the overnight bulk > loads, there's no concurrent write activity at all, and thus no need > for snapshots. Well, yeah, but in this scenario there's also no contention involved in taking snapshots --- there are only readers of ProcArray and (IIRC) they only need shared locks on the array. If you want to make any meaningful improvement in this area, you need something that solves the ProcArray access contention caused by a heavy mixed read/write transaction load. regards, tom lane
On Wed, Oct 20, 2010 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> so it's >> not exploiting that resource at all. I'm imagining that the kernel of a >> snapshot is just a WAL position, ie the end of WAL as of the time you >> take the snapshot (easy to get in O(1) time). Visibility tests then >> reduce to "did this transaction commit with a WAL record located before >> the specified position?". You'd need some index datastructure that made >> it reasonably cheap to find out the commit locations of recently >> committed transactions, where "recent" means "back to recentGlobalXmin". >> That seems possibly do-able, though I don't have a concrete design in >> mind. > > Interesting. O(1) snapshots would be great. I need to think about > this more before commenting on it, though. I spent a bunch of time thinking about this, and I don't see any way to get the memory usage requirements down to something reasonable. The problem is that RecentGlobalXmin might be arbitrarily far back in XID space, and you'll need to know the LSN of every commit from that point forward; whereas the ProcArray requires only constant space. To put this another way, for any given snapshot, the "interesting" XIDs are the top-level XIDs between the its xmin and xmax; the number of those that are invisible to the snapshot is bounded, but the total number is unbounded. To make this work, you'd need an inexpensive way of finding the set of transactions which committed after a given LSN with XIDs less than the snapshot's xmax. Another idea for making that work is to keep a table of the <min-LSN, max-xmax> pairs for every snapshot in the system and only track exactly those commits that are relevant, but it seems hard to think about how to maintain that data structure, and hard also to prune the table of tracked commits as snapshots are released. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 20, 2010 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I'm imagining that the kernel of a >>> snapshot is just a WAL position, ie the end of WAL as of the time you >>> take the snapshot (easy to get in O(1) time). �Visibility tests then >>> reduce to "did this transaction commit with a WAL record located before >>> the specified position?". > I spent a bunch of time thinking about this, and I don't see any way > to get the memory usage requirements down to something reasonable. > The problem is that RecentGlobalXmin might be arbitrarily far back in > XID space, and you'll need to know the LSN of every commit from that > point forward; whereas the ProcArray requires only constant space. That's like arguing that clog is no good because it doesn't fit in constant space. ISTM it would be entirely practical to remember the commit LSN positions of every transaction back to RecentGlobalXmin, using a data structure similar to pg_subtrans --- in fact, it'd require exactly twice as much working space as pg_subtrans, ie 64 bits per XID instead of 32. Now, it might be that access contention problems would make this unworkable (I think pg_subtrans works largely because we don't have to access it often) but it's not something that can be dismissed on space grounds. [ thinks for a bit... ] But actually this probably ends up being a wash or a loss as far as contention goes. We're talking about a data structure that has to be updated during each commit, and read pretty frequently, and it's not obvious how that's any better than getting commit info from the ProcArray. Although neither commit nor reading would require a *global* lock, so maybe there's a way ... regards, tom lane
On Thu, Nov 4, 2010 at 2:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Oct 20, 2010 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> I'm imagining that the kernel of a >>>> snapshot is just a WAL position, ie the end of WAL as of the time you >>>> take the snapshot (easy to get in O(1) time). Visibility tests then >>>> reduce to "did this transaction commit with a WAL record located before >>>> the specified position?". > >> I spent a bunch of time thinking about this, and I don't see any way >> to get the memory usage requirements down to something reasonable. >> The problem is that RecentGlobalXmin might be arbitrarily far back in >> XID space, and you'll need to know the LSN of every commit from that >> point forward; whereas the ProcArray requires only constant space. > > That's like arguing that clog is no good because it doesn't fit in > constant space. ISTM it would be entirely practical to remember the > commit LSN positions of every transaction back to RecentGlobalXmin, > using a data structure similar to pg_subtrans --- in fact, it'd require > exactly twice as much working space as pg_subtrans, ie 64 bits per XID > instead of 32. Now, it might be that access contention problems would > make this unworkable (I think pg_subtrans works largely because we don't > have to access it often) but it's not something that can be dismissed > on space grounds. Maybe I didn't explain that very well. The point is not so much how much memory you're using in an absolute sense as how much of it you have to look at to construct a snapshot. If you store a giant array indexed by XID whose value is an LSN, you have to read a potentially unbounded number of entries from that array. You can either make a single read through the relevant portion of the array (snapshot xmin to snapshot xmax) or you can check each XID as you see it and try to build up a local cache, but either way there's no fixed limit on how many bytes must be read from the shared data structure. That compares unfavorably with the current design, where you do a one-time read of a bounded amount of data and you're done. I suspect your theory about pg_subtrans is correct. > [ thinks for a bit... ] But actually this probably ends up being a > wash or a loss as far as contention goes. We're talking about a data > structure that has to be updated during each commit, and read pretty > frequently, and it's not obvious how that's any better than getting > commit info from the ProcArray. Although neither commit nor reading > would require a *global* lock, so maybe there's a way ... Well, if we're talking about the "giant XID array" design, or some variant of that, I would expect that nearly all of the contention would be on the last page or two, so I don't think it would be much better than a global lock. You might be able to get around that by not using an LWLock, and instead using something like lock xchg or LL/SC to atomatically update entries, but I'm not sure there's a portable way to do such operations on anything larger than a 4-byte word. At any rate, I think the problem described in the preceding paragraph is the more serious one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 20, 2010 at 8:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I'm less than convinced by the hypothesis that most transactions would >>> avoid taking snapshots in this regime, anyway. It would only hold up >>> if there's little locality of reference in terms of which tuples are >>> getting examined/modified by concurrent transactions, and that's a >>> theory that seems much more likely to be wrong than right. > >> There will certainly be workloads where most transactions acquire a >> snapshot, but just to take one obvious example, suppose we have a data >> warehouse where every night we bulk load the day's data, and then we >> run reporting queries all day. Except during the overnight bulk >> loads, there's no concurrent write activity at all, and thus no need >> for snapshots. > > Well, yeah, but in this scenario there's also no contention involved in > taking snapshots --- there are only readers of ProcArray and (IIRC) they > only need shared locks on the array. If you want to make any meaningful > improvement in this area, you need something that solves the ProcArray > access contention caused by a heavy mixed read/write transaction load. There is certainly some improvement to be had even in eliminating shared ProcArrayLocks, and the work that must be done while holding them. For example, you have to hold an exclusive lock to end a transaction, so that's going to compete with any shared lockers. But suppose we do have a heavy mixed read/write transaction load. Who is to say that requires a snapshot? It will require a snapshot if one transactions reads data that has been recently written by another transaction, but I'm not convinced that'll necessarily happen that often. For example, suppose you have a table with a million records and you have 200 concurrent database connections. Each connection repeatedly starts a transaction where it reads 10 records and then writes 10 records. When a new transaction starts, it overlaps 199 other transactions; if all of those transactions have done all of their reads and writes already, there are 3,980 records in the table for which we'll need a snapshot to determine tuple visibility; assuming we look at no other tuples (all the accesses using index scans rather than sequential scans), the chances that we'll need to take a snapshot are only 1-((1-(3980/1000000))^20) = ~7.7%. That's pretty good, and of course if you have fewer overlapping transactions or fewer operations per transaction it gets better very quickly. Now, of course, if you have a lot of locality of reference, things aren't going to be nearly so good. If we assume that the accesses are spread across only 100,000 records instead of 1,000,000, then each transaction has a better-than-even chance of needing a snapshot. However, in that situation, you're going to have other contention problems, too: there's very significant probability that two backends will actually try to update the same tuple, and one will sleep until the other commits. So maybe the cost of taking snapshots won't be the biggest problem in that case anyway (he said hopefully). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company