Обсуждение: Multixid hindsight design
I'd like to discuss how we should've implemented the infamous 9.3 multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight is always 20/20 - I'll readily admit that I didn't understand the problems until well after the release - so this isn't meant to bash what's been done. Rather, let's think of the future. The main problem with the infamous multixid changes was that it made pg_multixact a permanent, critical, piece of data. Without it, you cannot decipher whether some rows have been deleted or not. The 9.3 changes uncovered pre-existing issues with vacuuming and wraparound, but the fact that multixids are now critical turned those the otherwise relatively harmless bugs into data loss. We have pg_clog, which is a similar critical data structure. That's a pain too - you need VACUUM and you can't easily move tables from one cluster to another for example - but we've learned to live with it. But we certainly don't need any more such data structures. So the lesson here is that having a permanent pg_multixact is not nice, and we should get rid of it. Here's how to do that: Looking at the tuple header, the CID and CTID fields are only needed, when either xmin or xmax is running. Almost: in a HOT-updated tuple, CTID is required even after xmax has committed, but since it's a HOT update, the new tuple is always on the same page so you only need the offsetnumber part. That leaves us with 8 bytes that are always available for storing "ephemeral" information. By ephemeral, I mean that it is only needed when xmin or xmax is in-progress. After that, e.g. after a shutdown, it's never looked at. Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed by a 64-bit pointer, which means that it never wraps around. That 64-bit pointer is stored in the tuple header, in those 8 ephemeral bytes currently used for CID and CTID. Whenever a tuple is deleted/updated and locked at the same time, a TED entry is created for it, in the new SLRU, and the pointer to the entry is put on the tuple. In the TED entry, we can use as many bytes as we need to store the ephemeral data. It would include the CID (or possibly both CMIN and CMAX separately, now that we have the space), CTID, and the locking XIDs. The list of locking XIDs could be stored there directly, replacing multixids completely, or we could store a multixid there, and use the current pg_multixact system to decode them. Or we could store the multixact offset in the TED, replacing the multixact offset SLRU, but keep the multixact member SLRU as is. The XMAX stored on the tuple header would always be a real transaction ID, not a multixid. Hence locked-only tuples don't need to be frozen afterwards. The beauty of this would be that the TED entries can be zapped at restart, just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be WAL-logged, and we are free to change its on-disk layout even in a minor release. Further optimizations are possible. If the TED entry fits in 8 bytes, it can be stored directly in the tuple header. Like today, if a tuple is locked but not deleted/updated, you only need to store the locker XID, and you can store the locking XID directly on the tuple. Or if it's deleted and locked, CTID is not needed, only CID and locker XID, so you can store those direcly on the tuple. Plus some spare bits to indicate what is stored. And if the XMIN is older than global-xmin, you could also steal the XMIN field for storing TED data, making it possible to store 12 bytes directly in the tuple header. Plus some spare bits again to indicate that you've done that. Now, given where we are, how do we get there? Upgrade is a pain, because even if we no longer generate any new multixids, we'll have to be able to decode them after pg_upgrade. Perhaps condense pg_multixact into a simpler pg_clog-style bitmap at pg_upgrade, to make it small and simple to read, but it would nevertheless be a fair amount of code just to deal with pg_upgraded databases. I think this is worth doing, even after we've fixed all the acute multixid bugs, because this would be more robust in the long run. It would also remove the need to do anti-wraparound multixid vacuums, and the newly-added tuning knobs related to that. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > So the lesson here is that having a permanent pg_multixact is not nice, > and we should get rid of it. Here's how to do that: That would be cool, but ... > Looking at the tuple header, the CID and CTID fields are only needed, > when either xmin or xmax is running. Almost: in a HOT-updated tuple, > CTID is required even after xmax has committed, but since it's a HOT > update, the new tuple is always on the same page so you only need the > offsetnumber part. I think this is totally wrong. HOT update or not, you need the forward link represented by ctid not just until xmin/xmax commit, but until they're in the past according to all active snapshots. That's so that other transactions can chain forward to the "current" version of a tuple which they found according to their snapshots. It might be you can salvage the idea anyway, since it's still true that the forward links wouldn't be needed after a crash and restart. But the argument as stated is wrong. (There's also the question of forensic requirements, although I'm aware that it's barely worth bringing that up since nobody else here seems to put much weight on that.) regards, tom lane
On 05/12/2015 01:51 AM, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> So the lesson here is that having a permanent pg_multixact is not nice, >> and we should get rid of it. Here's how to do that: > > That would be cool, but ... > >> Looking at the tuple header, the CID and CTID fields are only needed, >> when either xmin or xmax is running. Almost: in a HOT-updated tuple, >> CTID is required even after xmax has committed, but since it's a HOT >> update, the new tuple is always on the same page so you only need the >> offsetnumber part. > > I think this is totally wrong. HOT update or not, you need the forward > link represented by ctid not just until xmin/xmax commit, but until > they're in the past according to all active snapshots. That's so that > other transactions can chain forward to the "current" version of a tuple > which they found according to their snapshots. > > It might be you can salvage the idea anyway, since it's still true that > the forward links wouldn't be needed after a crash and restart. But the > argument as stated is wrong. Ah yes, I stated that wrong. What I meant was that they are not needed after xmin and xmax are older than global xmin. > (There's also the question of forensic requirements, although I'm aware > that it's barely worth bringing that up since nobody else here seems to > put much weight on that.) I do care about that. In this scheme, you would always have the updater/deleter XMAX on the tuple itself, which IMO is more useful for forensic purposes than a multixid. You lose the CID and CTID in the tuple (for tuples that are updated and locked at the same time), but if you keep the TED around longer, you have all the information still there. On the whole, I don't think this is much worse than the current situation. - Heikki
On Mon, May 11, 2015 at 5:20 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > The main problem with the infamous multixid changes was that it made > pg_multixact a permanent, critical, piece of data. Without it, you cannot > decipher whether some rows have been deleted or not. The 9.3 changes > uncovered pre-existing issues with vacuuming and wraparound, but the fact > that multixids are now critical turned those the otherwise relatively > harmless bugs into data loss. Agreed. > We have pg_clog, which is a similar critical data structure. That's a pain > too - you need VACUUM and you can't easily move tables from one cluster to > another for example - but we've learned to live with it. But we certainly > don't need any more such data structures. Yes. > So the lesson here is that having a permanent pg_multixact is not nice, and > we should get rid of it. Here's how to do that: > > Looking at the tuple header, the CID and CTID fields are only needed, when > either xmin or xmax is running. Almost: in a HOT-updated tuple, CTID is > required even after xmax has committed, but since it's a HOT update, the new > tuple is always on the same page so you only need the offsetnumber part. > That leaves us with 8 bytes that are always available for storing > "ephemeral" information. By ephemeral, I mean that it is only needed when > xmin or xmax is in-progress. After that, e.g. after a shutdown, it's never > looked at. > > Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed by > a 64-bit pointer, which means that it never wraps around. That 64-bit > pointer is stored in the tuple header, in those 8 ephemeral bytes currently > used for CID and CTID. Whenever a tuple is deleted/updated and locked at the > same time, a TED entry is created for it, in the new SLRU, and the pointer > to the entry is put on the tuple. In the TED entry, we can use as many bytes > as we need to store the ephemeral data. It would include the CID (or > possibly both CMIN and CMAX separately, now that we have the space), CTID, > and the locking XIDs. The list of locking XIDs could be stored there > directly, replacing multixids completely, or we could store a multixid > there, and use the current pg_multixact system to decode them. Or we could > store the multixact offset in the TED, replacing the multixact offset SLRU, > but keep the multixact member SLRU as is. > > The XMAX stored on the tuple header would always be a real transaction ID, > not a multixid. Hence locked-only tuples don't need to be frozen afterwards. > > The beauty of this would be that the TED entries can be zapped at restart, > just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be > WAL-logged, and we are free to change its on-disk layout even in a minor > release. > > Further optimizations are possible. If the TED entry fits in 8 bytes, it can > be stored directly in the tuple header. Like today, if a tuple is locked but > not deleted/updated, you only need to store the locker XID, and you can > store the locking XID directly on the tuple. Or if it's deleted and locked, > CTID is not needed, only CID and locker XID, so you can store those direcly > on the tuple. Plus some spare bits to indicate what is stored. And if the > XMIN is older than global-xmin, you could also steal the XMIN field for > storing TED data, making it possible to store 12 bytes directly in the tuple > header. Plus some spare bits again to indicate that you've done that. > > Now, given where we are, how do we get there? Upgrade is a pain, because > even if we no longer generate any new multixids, we'll have to be able to > decode them after pg_upgrade. Perhaps condense pg_multixact into a simpler > pg_clog-style bitmap at pg_upgrade, to make it small and simple to read, but > it would nevertheless be a fair amount of code just to deal with pg_upgraded > databases. > > I think this is worth doing, even after we've fixed all the acute multixid > bugs, because this would be more robust in the long run. It would also > remove the need to do anti-wraparound multixid vacuums, and the newly-added > tuning knobs related to that. One danger is that in rearranging all of this stuff we may introduce lots of new bugs. I do agree that making multixacts need to survive a server crash was not a good idea. We liked freezing xmin so much, we decided to freeze xmax, too? Uggh. As painful as that's been, though, we're 18 months into it at this point. If we do another reorganization, are going to end up back at month 0, where pretty much everybody had corruption all the time rather than only some people on some workloads? Maybe not, but it's certainly something to worry about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > The beauty of this would be that the TED entries can be zapped at restart, > just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be > WAL-logged, and we are free to change its on-disk layout even in a minor > release. What about prepared transactions? They can lock rows FOR SHARE that survive server restarts. -- Thomas Munro http://www.enterprisedb.com
On Mon, Jun 1, 2015 at 8:53 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > What about prepared transactions? They can lock rows FOR SHARE that > survive server restarts. And they can make update chains that are still uncommitted after a restart. I think we should think separately about what information we want to store in the tuple ideally and only then worry about how to encode it concisely as an optimization exercise. If you just grow every tuple by 64-bits would this scheme be trivial? Perhaps it would be worth implementing that way first and then worrying about how to recuperate that space later? -- greg
On 1 June 2015 at 20:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
--
On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The beauty of this would be that the TED entries can be zapped at restart,
> just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> WAL-logged, and we are free to change its on-disk layout even in a minor
> release.
What about prepared transactions? They can lock rows FOR SHARE that
survive server restarts.
Interesting comment. I'm not aware that we do.
If we do support row locking that survives server restart, how did it work before 9.3?
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-06-05 10:45:09 +0100, Simon Riggs wrote: > On 1 June 2015 at 20:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > > On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> > > wrote: > > > The beauty of this would be that the TED entries can be zapped at > > restart, > > > just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be > > > WAL-logged, and we are free to change its on-disk layout even in a minor > > > release. > > > > What about prepared transactions? They can lock rows FOR SHARE that > > survive server restarts. > > > > Interesting comment. I'm not aware that we do. > > If we do support row locking that survives server restart, how did it work > before 9.3? Multixacts were persistent before 9.3 as well. A good number of the bugs existed then as well, but their effect was much more limited. The difference is that now multixacts don't just have to survive till the last locker isn't running anymore (which was determined by a horizon), but that they have to live till they're vacuumed away, since xmax might be stored in the multixact.
On 11 May 2015 at 22:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
That way if we need to make Offsets SLRU persistent it won't bloat.
--
I'd like to discuss how we should've implemented the infamous 9.3 multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight is always 20/20 - I'll readily admit that I didn't understand the problems until well after the release - so this isn't meant to bash what's been done. Rather, let's think of the future.
The main problem with the infamous multixid changes was that it made pg_multixact a permanent, critical, piece of data. Without it, you cannot decipher whether some rows have been deleted or not. The 9.3 changes uncovered pre-existing issues with vacuuming and wraparound, but the fact that multixids are now critical turned those the otherwise relatively harmless bugs into data loss.
We have pg_clog, which is a similar critical data structure. That's a pain too - you need VACUUM and you can't easily move tables from one cluster to another for example - but we've learned to live with it. But we certainly don't need any more such data structures.
So the lesson here is that having a permanent pg_multixact is not nice, and we should get rid of it. Here's how to do that:
I think we should think back to exactly what we are trying to store, why and for how long. A clear definition of what we are trying to achieve is essential to solving the problem.
In my understanding we need to store
* at most one xid - the Updating Xid
* potentially many Locking Xids
The current design has two SLRUs and puts all of those xids in the Members SLRU, causing it to need to be persistent.
The problems come from having significant numbers of locking xids. My understanding is that any change in the number of lockers requires the full array to be rewritten. So with N lockers we end up with 2N-1 arrays, each array has an average of N/2 members, or N^2 entries, i.e. an O(N^2) algorithm, which makes it a bad thing to persist. Assuming that design continues mostly unchanged in its core points...
An alternate proposal:
1. Store only the Locking xids in the Members SLRU
2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the Locking xids in the Members SLRU.
This means the Offsets SLRU will be around twice the size it was before BUT since we reduce the size of each Members array by one, there is a balanced saving there, so this change is disk-space-neutral.
We then leave the Members SLRU as non-persistent, just as it was <9.3
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 June 2015 at 11:02, Andres Freund <andres@anarazel.de> wrote:
Phew! Had me worried for a minute.
--
On 2015-06-05 10:45:09 +0100, Simon Riggs wrote:
> On 1 June 2015 at 20:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> > On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote:
> > > The beauty of this would be that the TED entries can be zapped at
> > restart,
> > > just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> > > WAL-logged, and we are free to change its on-disk layout even in a minor
> > > release.
> >
> > What about prepared transactions? They can lock rows FOR SHARE that
> > survive server restarts.
> >
>
> Interesting comment. I'm not aware that we do.
>
> If we do support row locking that survives server restart, how did it work
> before 9.3?
Multixacts were persistent before 9.3 as well. A good number of the bugs
existed then as well, but their effect was much more limited. The
difference is that now multixacts don't just have to survive till the
last locker isn't running anymore (which was determined by a horizon),
but that they have to live till they're vacuumed away, since xmax might
be stored in the multixact.
Phew! Had me worried for a minute.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 5, 2015 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I think we should think back to exactly what we are trying to store, why and > for how long. A clear definition of what we are trying to achieve is > essential to solving the problem. > > In my understanding we need to store > * at most one xid - the Updating Xid > * potentially many Locking Xids > > The current design has two SLRUs and puts all of those xids in the Members > SLRU, causing it to need to be persistent. > > The problems come from having significant numbers of locking xids. My > understanding is that any change in the number of lockers requires the full > array to be rewritten. So with N lockers we end up with 2N-1 arrays, each > array has an average of N/2 members, or N^2 entries, i.e. an O(N^2) > algorithm, which makes it a bad thing to persist. Assuming that design > continues mostly unchanged in its core points... > > An alternate proposal: > > 1. Store only the Locking xids in the Members SLRU > 2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the > Locking xids in the Members SLRU. > > This means the Offsets SLRU will be around twice the size it was before BUT > since we reduce the size of each Members array by one, there is a balanced > saving there, so this change is disk-space-neutral. > > That way if we need to make Offsets SLRU persistent it won't bloat. > We then leave the Members SLRU as non-persistent, just as it was <9.3 Hmm, this is a neat idea. It would have been easier to implement if we'd thought of it before we released 9.3, though. At this point, I guess we'd have to either have a pg_upgrade compatibility break, or teach pg_upgrade to rejigger the old files into the new file format, or some other fix that's not immediately apparent to me. And it also sounds like a fair amount of work. But it might be worth it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndQuadrant.com> writes: > On 11 May 2015 at 22:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> So the lesson here is that having a permanent pg_multixact is not nice, >> and we should get rid of it. Here's how to do that: > An alternate proposal: > 1. Store only the Locking xids in the Members SLRU > 2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the > Locking xids in the Members SLRU. > This means the Offsets SLRU will be around twice the size it was before BUT > since we reduce the size of each Members array by one, there is a balanced > saving there, so this change is disk-space-neutral. > That way if we need to make Offsets SLRU persistent it won't bloat. > We then leave the Members SLRU as non-persistent, just as it was <9.3 I don't think you can do that, because it supposes that locking XIDs need not be remembered across a crash. Don't prepared transactions break that assumption? regards, tom lane
On Fri, Jun 5, 2015 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That way if we need to make Offsets SLRU persistent it won't bloat. >> We then leave the Members SLRU as non-persistent, just as it was <9.3 > > I don't think you can do that, because it supposes that locking XIDs need > not be remembered across a crash. Don't prepared transactions break that > assumption? Well, that issue existed before 9.3, too. It's possible our old releases weren't entirely correct either, but the big change in 9.3 is that we have to keep MultiXacts around until they are frozen, rather than just until their member transactions are no longer running. If I understand correctly, Simon's proposal would mean that pg_multixact/offsets would still need to survive until freezing, but pg_multixact/members would only need to survive until the member transactions were no longer running. That might span a crash or restart, in the case of prepared transactions, but we could clean up the member offsets when the prepared transactions were committed, rather than having to scan every table in the cluster first. That only eliminates half the need for multixact vacuuming, but it's something. It would be a great deal nicer if we didn't have to keep ANY of the transactional data for a tuple around once it's all-visible. Heikki defined ephemeral as "only needed when xmin or xmax is in-progress", but if we extended that definition slightly to "only needed when xmin or xmax is in-progress or commited but not all-visible" then the amount non-ephemeral data in the tuple header is 5 bytes (infomasks + t_hoff). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > It would be a great deal nicer if we didn't have to keep ANY of the > transactional data for a tuple around once it's all-visible. Heikki > defined ephemeral as "only needed when xmin or xmax is in-progress", > but if we extended that definition slightly to "only needed when xmin > or xmax is in-progress or commited but not all-visible" then the > amount non-ephemeral data in the tuple header is 5 bytes (infomasks + > t_hoff). OK, I was wrong here: if you only have that stuff, you can't distinguish between a tuple that is visible to everyone and a tuple that is visible to no one. I think the minimal amount of data we need in order to distinguish visibility once no relevant transactions are in progress is one XID: either XMIN, if the tuple was never updated at all or only be the inserting transaction or one of its subxacts; or XMAX, if the inserting transaction committed. The other visibility information -- including (1) the other of XMIN and XMAX, (2) CMIN and CMAX, and (3) the CTID -- are only interesting the transactions involved are no longer running and, if they committed, visible to all running transactions. Heikki's proposal is basically to merge the 4-byte CID field and the first 4 bytes of the CTID that currently store the block number into one 8-byte field that can store a pointer into this new TED structure. And after mulling it over, that sounds pretty good to me. It's true (as has been pointed out by several people) that the TED will need to be persistent because of prepared transactions. But it would still be a big improvement over the status quo, because: (1) We would no longer need to freeze MultiXacts. TED wouldn't need to be frozen either. You'd just truncate it whenever RecentGlobalXmin advances. (2) If the TED becomes horribly corrupted, you can recover by committing or aborting any prepared transactions, shutting the system down, and truncating it, with no loss of data integrity. Nothing in the TED is required to determine whether tuples are visible to an unrelated transaction - you only need it (a) to determine whether tuples are visible to a particular command within a transaction that has inserted, updated, or deleted the tuple and (b) determine whether tuples are locked. (3) As a bonus, we'd eliminate combo CIDs, because the TED could have space to separately store CMIN and CMAX. Combo CIDs required special handling for logical decoding, and they are one of the nastier barriers to making parallelism support writes (because they are stored in backend-local memory of unbounded size and therefore can't easily be shared with workers), so it wouldn't be very sad if they went away. I'm not quite sure how to decide whether something like this worth (a) the work and (b) the risk of creating new bugs, but the more I think about it, the more the principal of the thing seems sound to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 24 June 2015 at 14:57, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It would be a great deal nicer if we didn't have to keep ANY of the
> transactional data for a tuple around once it's all-visible. Heikki
> defined ephemeral as "only needed when xmin or xmax is in-progress",
> but if we extended that definition slightly to "only needed when xmin
> or xmax is in-progress or commited but not all-visible" then the
> amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
> t_hoff).
OK, I was wrong here: if you only have that stuff, you can't
distinguish between a tuple that is visible to everyone and a tuple
that is visible to no one. I think the minimal amount of data we need
in order to distinguish visibility once no relevant transactions are
in progress is one XID: either XMIN, if the tuple was never updated at
all or only be the inserting transaction or one of its subxacts; or
XMAX, if the inserting transaction committed. The other visibility
information -- including (1) the other of XMIN and XMAX, (2) CMIN and
CMAX, and (3) the CTID -- are only interesting the transactions
involved are no longer running and, if they committed, visible to all
running transactions.
Heikki's proposal is basically to merge the 4-byte CID field and the
first 4 bytes of the CTID that currently store the block number into
one 8-byte field that can store a pointer into this new TED structure.
And after mulling it over, that sounds pretty good to me. It's true
(as has been pointed out by several people) that the TED will need to
be persistent because of prepared transactions. But it would still be
a big improvement over the status quo, because:
(1) We would no longer need to freeze MultiXacts. TED wouldn't need
to be frozen either. You'd just truncate it whenever RecentGlobalXmin
advances.
(2) If the TED becomes horribly corrupted, you can recover by
committing or aborting any prepared transactions, shutting the system
down, and truncating it, with no loss of data integrity. Nothing in
the TED is required to determine whether tuples are visible to an
unrelated transaction - you only need it (a) to determine whether
tuples are visible to a particular command within a transaction that
has inserted, updated, or deleted the tuple and (b) determine whether
tuples are locked.
(3) As a bonus, we'd eliminate combo CIDs, because the TED could have
space to separately store CMIN and CMAX. Combo CIDs required special
handling for logical decoding, and they are one of the nastier
barriers to making parallelism support writes (because they are stored
in backend-local memory of unbounded size and therefore can't easily
be shared with workers), so it wouldn't be very sad if they went away.
I'm not quite sure how to decide whether something like this worth (a)
the work and (b) the risk of creating new bugs, but the more I think
about it, the more the principal of the thing seems sound to me.
Splitting multitrans into persistent (xmax) and ephemeral (TED) is something I already proposed so I support the concept; TED is a much better suggestion, so I support TED.
Your addition of removing combocids is good also, since everything is public.
I think we need to see a detailed design and we also need to understand the size of this new beast. I'm worried it might become very big, very quickly causing problems for us in other ways. We would need to be certain that truncation can actually occur reasonably frequently and that there are no edge cases that cause it to bloat.
Though TED sounds nice, the way to avoid another round of on-disk bugs is by using a new kind of testing, not simply by moving the bits around.
It might be argued that we are increasing the diagnostic/forensic capabilities by making CIDs more public. We can use that...
The good thing I see from TED is it allows us to test the on-disk outcome of concurrent activity. Currently we have isolationtester, but that is not married in any way to the on-disk state allowing us the situation where isolationtester can pass yet we have corrupted on-disk state. We should specify the on-disk tuple representation as a state machine and work out how to recheck the new on-disk state matches the state transition that we performed.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jun 24, 2015 at 11:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Though TED sounds nice, the way to avoid another round of on-disk bugs is by > using a new kind of testing, not simply by moving the bits around. I agree that we can do much better at testing than we traditionally have done, and I think pretty much everyone in the room for the developer unconference session on testing at PGCon was also in agreement with that. I really like the idea of taking purpose-built testing frameworks - like the one that Heikki created for the WAL format changes - and polishing them to the point where they can go into core. That's more work, of course, but very beneficial. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 24 June 2015 at 16:30, Simon Riggs <simon@2ndquadrant.com> wrote:
--
Though TED sounds nice, the way to avoid another round of on-disk bugs is by using a new kind of testing, not simply by moving the bits around.It might be argued that we are increasing the diagnostic/forensic capabilities by making CIDs more public. We can use that...The good thing I see from TED is it allows us to test the on-disk outcome of concurrent activity. Currently we have isolationtester, but that is not married in any way to the on-disk state allowing us the situation where isolationtester can pass yet we have corrupted on-disk state. We should specify the on-disk tuple representation as a state machine and work out how to recheck the new on-disk state matches the state transition that we performed.
To put some more flesh on this idea...
What I'm suggesting is moving from a 2-session isolationtester to a 3-session isolationtester
1. Session 1
2. Session 2
3. After-action confirmation that the planned state change exists correctly on disk, rather than simply having the correct behaviour
The absence of the third step in our current testing is what has led us to various bugs in the past (IMHO)
A fourth step would be to define the isolationtests in such a way that we can run them as burn-in tests for billions of executions.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 June 2015 at 00:52, Robert Haas <robertmhaas@gmail.com> wrote: > I agree that we can do much better at testing than we traditionally > have done, and I think pretty much everyone in the room for the > developer unconference session on testing at PGCon was also in > agreement with that. I really like the idea of taking purpose-built > testing frameworks - like the one that Heikki created for the WAL > format changes - and polishing them to the point where they can go > into core. That's more work, of course, but very beneficial. Something that'd really help with that, IMO, would be weakening the requirement that everything be C or be core Perl. Instead require that configure detect whether or not facilities for some tests are present, and have them fail with a clean warning indicating they were skipped for lack of pre-requisites at 'make' time. I don't see that as significantly different to having some buildfarm animals not bother to configure or test the PLs, SSL, etc. I understand why adding to the mix required for the core server build isn't acceptable, but hopefully separate test suites can be more flexible. A free-for-all of languages and tools doesn't make sense, but I'd like to see, say, python and the 'unittest' module added, and to see acceptance of tests using psycopg2 to stream and decode WAL from a slot. Thoughts? -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 9, 2015 at 2:02 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 25 June 2015 at 00:52, Robert Haas <robertmhaas@gmail.com> wrote: >> I agree that we can do much better at testing than we traditionally >> have done, and I think pretty much everyone in the room for the >> developer unconference session on testing at PGCon was also in >> agreement with that. I really like the idea of taking purpose-built >> testing frameworks - like the one that Heikki created for the WAL >> format changes - and polishing them to the point where they can go >> into core. That's more work, of course, but very beneficial. > > Something that'd really help with that, IMO, would be weakening the > requirement that everything be C or be core Perl. Instead require that > configure detect whether or not facilities for some tests are present, > and have them fail with a clean warning indicating they were skipped > for lack of pre-requisites at 'make' time. > > I don't see that as significantly different to having some buildfarm > animals not bother to configure or test the PLs, SSL, etc. I > understand why adding to the mix required for the core server build > isn't acceptable, but hopefully separate test suites can be more > flexible. A free-for-all of languages and tools doesn't make sense, > but I'd like to see, say, python and the 'unittest' module added, and > to see acceptance of tests using psycopg2 to stream and decode WAL > from a slot. > > Thoughts? I actually kind of hate this sort of thing. It's already the case that some of the TAP tests are skilled if you don't have the prerequisites, and while that solves the problem of spurious test failures, it also means that the tests which have those dependencies are run in fewer and fewer places. Now I'm not dead set against changing anything at all here, but I want to point out that we just added the TAP framework and already there are proposals (like Kevin's snapshot too old patch) to require DBD::Pg for some tests, which a lot of people aren't going to have, and I know you and probably a few other people would like python, which is fair enough, but other people are going to want other things, and pretty soon we we end up with a situation - if we're not already there - where for a developer or packager to get a machine to the point where it runs "all the tests" is a major undertaking. Accepting more tools also increases the breadth of knowledge expected of committers and patch authors. If I commit a patch and it turns out something breaks, I'm expected to fix that. Now if it's in C or Perl, I can. If it's in Python, I can't. If we add Python to the list of things we test with, there will (I'm sure) be other committers who can fix the C and Python stuff, but have no clue about the Perl. The more frameworks we support, the worse this gets. And woe betide a patch author whose feature requires adjustment of existing tests - if that person actually notices the problem before commit, as opposed to the buildfarm finding it, that person now needs to learn enough of every language in which we support tests to update the existing tests and add any new ones which are needed. I'm starting to get the hang of isolation tests, but I still don't *really* understand TAP tests, really, and anything new we add is a whole new obstacle. I think this is sort of an inherent problem with long-running projects. Everybody knows, for example, that the toolchain we're using to build our documentation is pretty crufty. But if we switched to something else that used a different input format, then every back-patchable doc fix would have to be done twice, once for the old toolchain and once for the new, which means that at least some people would have to understand both, for a good 5-6 years. We'd also have to get that new toolchain working on every platform we support, and all of our packagers would have to switch to it, and of course we'd want it to support all the output formats we have today. Yet, if we were starting over, I think there's about zero chance we'd pick this approach. It's just that switching now is a job and a half, and we don't want to take that on likely. Similarly with testing frameworks - except those are even worse, because nobody's saying that any of the stuff we have now will ever go away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Nov 9, 2015 at 2:02 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> Something that'd really help with that, IMO, would be weakening the >> requirement that everything be C or be core Perl. Instead require that >> configure detect whether or not facilities for some tests are present, >> and have them fail with a clean warning indicating they were skipped >> for lack of pre-requisites at 'make' time. > I actually kind of hate this sort of thing. FWIW, I agree with Robert's position. Expanding the infrastructure required for tests may make life simpler for the initial author of a test, but we pay for that over and over again when you consider the long run. I think insisting on having e.g. DBD::Pg is simply exporting the author's work to other people; there isn't anything that that enables that you can't do without it. We have added test infrastructure when there simply wasn't any other way to test something. Both the isolation framework and the TAP framework represent substantial investments of that sort. I'm fine with both of those. But the bar to adding new requirements ought to be pretty darn high, and not just "I was too lazy to write it without". regards, tom lane
On 10 November 2015 at 02:26, Robert Haas <robertmhaas@gmail.com> wrote: >> I'd like to see, say, python and the 'unittest' module added, and >> to see acceptance of tests using psycopg2 to stream and decode WAL >> from a slot. > > I actually kind of hate this sort of thing. For what it's worth, I don't actually like it either. However, I suspect that the current restricted tools are a significant impediment to test tooling and infrastructure improvement. > Now I'm not dead set against changing anything at all here, but I want > to point out that we just added the TAP framework and already there > are proposals (like Kevin's snapshot too old patch) to require DBD::Pg > for some tests, which a lot of people aren't going to have The alternative seems to be driving psql -qAtz as a coprocess over a pipeline as a poor man's database driver. I really doubt that's any kind of improvement when it comes to the complexity of maintaining tests, fixing them when they break, etc. I'd rather just write tests in C against libpq despite the verbosity, annoyance of maintaining them, etc. Part of why I'm somewhat interested in seeing python and psycopg2 added is that I'd personally quite like to see widely used client drivers covered by the buildfarm to some extent specifically to see if we break them. Though really, it'd make sense to add the most important drivers like psqlODBC and PgJDBC first if doing that.... and that'd be a whole new barrel of fun. > and I know > you and probably a few other people would like python, which is fair > enough, but other people are going to want other things, and pretty > soon we we end up with a situation - if we're not already there - > where for a developer or packager to get a machine to the point where > it runs "all the tests" is a major undertaking. That's the bit I find odd, I guess. It's *trivial* to add DBD::Pg on any remotely sane host. Similarly Python and psycopg2. One package each, or a very simple compile. It's nothing in comparison to setting up to building our documentation... or getting pretty much any of the basic libraries in place on Windows. (However, the streaming replication support for psycopg2 hasn't landed in mainline yet, so you can't just grab it from PGDG repos). Adding more tools does not make the existing ones easier, of course, so that's not really an argument in favour. I just suspect the cost of adding them is being somewhat overblown. That said ... I don't maintain a 1990s-era MIPS box, an old ARM 6, or whatever. It's a pity in a way that the Java toolchain and build process is so heavy, because otherwise it'd make sense to just use Java and PgJDBC. PgJDBC development happens closer to core Pg than most drivers, within the main Pg community. It's also crucial for PostgreSQL's success, with huge adoption and a lot of people relying on it. JUnit is one of the least awful test frameworks I've worked with. Java is widely understood and relatively easy to follow even if you don't know it well. Unfortunately I see around zero chance of getting a JVM, maven or ant, and a bajillion libraries onto buildfarm members. > Accepting more tools also increases the breadth of knowledge expected > of committers and patch authors. That's the real problem IMO. The cognitive and knowledge load. It's exactly the same reason why I just explained to someone else (the not-UTF32 network run by the secret gov't conspiracy/aliens dude, as it turns out) why the codebase requires English identifiers, variable names, comments, etc. While I'd argue that the cost of adding Python localized to some tests is lower than the cost of adding (say) code in Russian to a contrib, it's not zero. So I guess I've already argued against myself in another form. Still, the alternative seems to be writing new bespoke tools on top of libpq. Need something to analyse and decode a WAL stream? Write it in C on top of libpq, add it to the test infrastructure. Probably write new tests as plugins in C, too, or use an embedded Perl runtime, or XS. Or concoct yet another custom language / test description, like isolationtester, for the purpose. Those tools aren't much easier to get rid of, once added, than new languages etc are. Eventually it'd be good to be able to actually test things like streaming replication in an automated manner. We're going to hit the limitations of 'do some stuff and diff the output files' pretty quickly there. Adding a new scripting language isn't exactly going to fix that though. > Similarly with testing frameworks > - except those are even worse, because nobody's saying that any of the > stuff we have now will ever go away. Yeah. I guess that's the issue, isn't it. When something gets added we're generally stuck with it forever. We still have PL/TCL! Even if it were desirable (and ignoring the horrific backpatching pain that'd result), nobody's going to leap up and volunteer to rewrite all the Perl like the Windows build system so we could remove Perl. It's there essentially forever. The concern that if we added python+psycopg2 we'd be stuck with it too is more than valid. Drat. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 9, 2015 at 9:57 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 10 November 2015 at 02:26, Robert Haas <robertmhaas@gmail.com> wrote: >>> I'd like to see, say, python and the 'unittest' module added, and >>> to see acceptance of tests using psycopg2 to stream and decode WAL >>> from a slot. >> >> I actually kind of hate this sort of thing. > > For what it's worth, I don't actually like it either. However, I > suspect that the current restricted tools are a significant impediment > to test tooling and infrastructure improvement. But that's exactly the argument that was made to justify adding the TAP testing framework. It didn't work? >> Now I'm not dead set against changing anything at all here, but I want >> to point out that we just added the TAP framework and already there >> are proposals (like Kevin's snapshot too old patch) to require DBD::Pg >> for some tests, which a lot of people aren't going to have > > The alternative seems to be driving psql -qAtz as a coprocess over a > pipeline as a poor man's database driver. I really doubt that's any > kind of improvement when it comes to the complexity of maintaining > tests, fixing them when they break, etc. I'd rather just write tests > in C against libpq despite the verbosity, annoyance of maintaining > them, etc. I thought about an idea like this: have some sort of dummy program, and then have a module written in Perl that provides an interface to it, so that we don't have to depend on DBD::Pg. I'm not at all convinced that's a bad plan. > Part of why I'm somewhat interested in seeing python and psycopg2 > added is that I'd personally quite like to see widely used client > drivers covered by the buildfarm to some extent specifically to see if > we break them. Though really, it'd make sense to add the most > important drivers like psqlODBC and PgJDBC first if doing that.... and > that'd be a whole new barrel of fun. EnterpriseDB has something like this, and it does catch some bugs, but it is an unbelievable amount of work to keep it working. When something fails, it's unlikely that any individual developer has exactly the right configuration on their machine to reproduce the failure. Now maybe we could come up with something that provides similar test coverage but is less work to maintain - I'm certainly not going to rule out the possibility that it can be done better than EnterpriseDB has done it. However, I bet it will still be a lot of work, and unlike at EntepriseDB where we've got entire teams of people who spend major amounts of time on this kind of stuff, PostgreSQL relies on its committers to keep the buildfarm green. And if I commit something and it breaks the Win32 version of some test that is written in Python and relies on pgsqlODBC, there is no way in heck I'm committing myself to reproducing that environment. Even if it's Linux64 and JDBC I'm not doing it. If you want to build such a thing and maintain it in perpetuity, that is fine, and you don't need my permission, and I and I'm sure many others will greatly appreciate both the effort and any bugs that get found and fixed that way. But when you talk about putting this kind of thing to core you are asking the committers to be responsible for it, and I'm still not very confident we've got the TAP tests in a state where maintaining them is going to be a reasonable level of effort. Those tests failed on my machine for months, and the only reason they eventually got fixed is because I knew enough Perl to track the problem down to some screwy coding in one of the Perl modules MacPorts installs. If the same thing happens with Python, it would be extremely difficult for me to find it, because I don't actually know Python. >> Accepting more tools also increases the breadth of knowledge expected >> of committers and patch authors. > > That's the real problem IMO. The cognitive and knowledge load. That's definitely a big part of it, but I think you're selling short the amount of havoc that build dependencies cause. On a Linux system where you can install whatever you need via 'yum install', sure, it's easy. It's not so easy on Windows. It's less easy on other UNIX systems without modern package managers. But that's just the tip of the iceberg. Installing your operating system's DBD::Pg package is going to install a version of DBD::Pg compiled against the system's libpq, not the latest PostgreSQL libpq that you just built. Now perhaps, you will say, that's OK: that way we test cross-version libpq compatibility, too! And on a certain level I can't argue with that. But now you could have a situation where the tests pass for one user and fail for another user even though both are compiling from the same PostgreSQL commit and the same DBD::Pg version, and that kind of stuff is a lot of work to track down. And it may not really be a bug: maybe the person who added the test intended the test to exercise some new libpq feature which is indeed supported by DBD::Pg, but your build of DBD::Pg doesn't have that because it was compiled against an older version of libpq that didn't have that feature, so in your environment, the test fails. >> Similarly with testing frameworks >> - except those are even worse, because nobody's saying that any of the >> stuff we have now will ever go away. > > Yeah. I guess that's the issue, isn't it. When something gets added > we're generally stuck with it forever. We still have PL/TCL! > > Even if it were desirable (and ignoring the horrific backpatching pain > that'd result), nobody's going to leap up and volunteer to rewrite all > the Perl like the Windows build system so we could remove Perl. It's > there essentially forever. The concern that if we added > python+psycopg2 we'd be stuck with it too is more than valid. > > Drat. Yes, it sucks. I think no matter how we try to improve testing, it's likely to be a lot of work. But personally I think it's better to put that work into tools that live inside our tree and that we have full control over, rather than adding dependencies. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company