Обсуждение: Reduction in WAL for UPDATEs
It seems possible to reduce overall WAL volume by roughly 25% on common
workloads by optimising the way UPDATE statements generate WAL. I've got
a practical proposal that looks like it can be completed in a few more
days work and fits very well with HOT UPDATEs.
This is likely to have beneficial performance effects for many
workloads, as well as having clear benefits for archive/backup by
reducing overall WAL volume.
Background
----------
A recent profile of the sources of WAL on a major OLTP benchmark:
- records is the count of WAL records of that type
- % records is the overall distribution by record type
- avg space is per record
- % space is the overall distribution by WAL volume
         tag           | records | % records | avg space | % space 
------------------------+---------+-----------+-----------+---------HEAP_UPDATE            |  347822 |     25.91 |
298.72|   53.30BTREE_INSERT_LEAF      |  586577 |     43.70 |     68.01 |   20.47HEAP_INSERT            |  184712 |
13.76|    122.91 |   11.65BTREE_SPLIT_R          |    1435 |      0.11 |   7086.18 |    5.22HEAP_LOCK              |
159959|     11.92 |     58.77 |    4.82HEAP_UPDATE+INIT       |   12157 |      0.91 |    414.92 |    2.59XACT_COMMIT
       |   29531 |      2.20 |     40.00 |    0.61BTREE_DELETE           |    2806 |      0.21 |    379.45 |
0.55HEAP_DELETE           |   13276 |      0.99 |     46.16 |    0.31BTREE_SPLIT_L          |      75 |      0.01 |
7234.32|    0.28HEAP_INSERT+INIT       |    2305 |      0.17 |    126.69 |    0.15BTREE_INSERT_UPPER     |    1497 |
 0.11 |     73.96 |    0.06BTREE_SPLIT_L_ROOT     |       1 |      0.00 |   6882.00 |    0.00CLOG_ZEROPAGE          |
   1 |      0.00 |     32.00 |    0.00XLOG_CHECKPOINT_ONLINE |       6 |      0.00 |     72.00 |    0.00BTREE_NEWROOT
      |       1 |      0.00 |     76.00 |    0.00XACT_ABORT             |     155 |      0.01 |     40.00 |    0.00
 
(17 rows)
[Thanks to Greg]
UPDATEs clearly generate most of the WAL. In this benchmark and in many
other cases, around 10-25% of the data on a row changes at each UPDATE,
which would be sufficient to reduce the overall WAL volume by 30-40%.
Not all of that potential is practically realisable, see below.
[Note also that this profile was performed *before* Heikki's recent
optimisation of WAL generated during b-tree splits.]
Concept
-------
Previously, we've discussed the possibility that WAL can be reduced when
performing UPDATEs. Currently, when we UPDATE a row we send the whole
new row to WAL, whether or not the columns have changed.
If we send only the changed data to WAL then we would need to be able to
reconstruct the new tuple during recovery. The changed data is referred
to as a "tuple diff" or just diff.
Reconstruction would read the old tuple and apply the diff to construct
a new tuple, which would then be applied to the data block. This is only
practical when the old and new tuple are on the same block, since
otherwise this would be likely to increase the I/O required to perform
recovery - which is already I/O bound. So this fits nicely with HOT and
builds upon the knowledge that with HOT most UPDATEs occur on the same
block.
Reconstruction is complicated by the absence of a relcache that
accurately describes the TupleDesc at the exact point of the UPDATE. 
However, the tupledesc is available when we *create* the diff, so we can
take advantage of that to make sure we optimise for NULL <-> not NULL
transitions during UPDATE. So we use the tupledesc to construct a
tupledesc-agnostic format which is a set of byte-by-byte instructions
rather than being specified in terms of attributes or custom types.
The diff instructions allow us to express operations very simply.
e.g.
CREATE TABLE foo (col1 integer, col2 integer,     col3 varchar(50), col4 varhcar(50));
INSERT INTO foo (1, 1, repeat('abc',50));
UPDATE foo SET col2 = 100 WHERE col1 = 1;
will generate diff instructions (assuming 4 byte alignment for now)
COPY    4    (bytes from old to new tuple)
IGNORE    4    (bytes on old tuple)
ADD    4    (bytes from new tuple)
COPY    156    (bytes from old to new tuple)
These diff instructions are based upon the concept of delta instructions
from the VCDIFF standard. http://www.faqs.org/rfcs/rfc3284.html
This is tailored to this specific implementation, so although we use
delta instructions we don't use the VCDIFF format since it caters for
longer data which in our case will already have been TOASTed away.
With a terse instruction set the diff format can encode the diff
instructions in a small number of bytes, considerably reducing the WAL
volume yet without requiring complex diffing algorithms. 
The simplicity of the diff algorithm is important because this
introduces additional CPU and potentially contention also, since the
diff is calculated while the block is locked. As a result, it is
proposed that the diff would only be calculated when the new tuple
length is in excess of a hard-coded limit, probably 128 bytes - a very
simple heuristic. That seems high, but the current WAL volume is
disproportionately generated by very wide tables and this would give a
balance between overhead of generating the diff and the overhead from
the WAL volume itself.
Discussion Points:
- diff starts at 128 bytes? - (no additional parameters suggested)
- diff still taken whether or not we do full_page_writes
Implementation
--------------
Implement a new XLOG_HEAP2_UPDATE_DIFF xlog record and appropriate
handling, rather than attempting to augment the existing
XLOG_HEAP_UPDATE code since that also handles MOVE.
I'm expecting there to be about a 100 suggestions for how to optimise
the encoding, so please remember the balance between encoding
time/effect and the WAL compression ratio. The good thing about the use
of the diff format is that it neatly separates creating the diff from
decoding it.
Initially, my suggestion is to do column-by-column byte comparison to
establish which bytes have changed. Its possible to enhance that to do
sub-attribute diffs, but I'm not suggesting that in this initial
proposal.
Alignment is not an issue for creating or using the diff.
When there is no BEFORE trigger, we can optimise that further by
examining only the columns mentioned in the SET clause. That would
further speed up the diff, but its fairly fiddly and I haven't spent
time yet looking into that possibility. (Suggestions welcome!).
No performance testing has been performed as yet since I haven't written
much of the code so far.
Diff Format
-----------
The diff algorithm requirements seem to be:
- accuracy (must be non-lossy)
- speed
- space saving
- cope well with PostgreSQL var length changing values and NULLs
Looking at VCDIFF format and thinking some, I need the following
primitives in the diff storage format:
- COPY - copy the next N bytes from old to new record
- ADD - add the Record byte at the current new record location
- IGNORE - ignore the next N bytes of the old record, nothing copied
The format is similar, but not exactly same as HRL encoded bitmap data
using a header and a body array.
A header has length, number of instructions and a set of indicator bits.
The indicator bit specifies whether the instruction is an ADD or not.
ADD    1 set in header bit
uint8    num bytes to insert into new tupledata bytes
COPY    0 set in header bit
uint8    lo 7 bits number of bytes to COPY from old to new    hi bit = 1 to indicate COPY
IGNORE    0 set in header bituint8    lo 7 bits number of bytes to IGNORE in old tuple    hi bit = 0 to indicate
IGNORE
In the above example shown earlier we would store only 4 instruction
bytes, plus 4 changed data bytes, plus 8 byte diff header = 16 bytes,
rather than the 156 bytes.
Note that we are diffing the tuple body, though storing the tuple header
in full. We can tweak that some more in the future, maybe. Note also hat
the format is very terse and the words ADD, COPY and IGNORE are
conceptual only.
Adjacent columns that have similar instructions can be represented by a
single instruction, i.e. 4 integer columns that will be COPYd to the new
tuple can be represented as COPY 16. As a result of that, I'm proposing
that the header has sufficient instruction bits for at most 32
instructions. That helps keep the format simple, but it also allows us
to introduce a further heuristic: skip calculating complex diffs because
they probably aren't going to save much space anyhow.
Very long columns may need multiple instructions to represent them, but
that wastes very little space and should not present a problem.
Comments, with a view to immediate implementation?
--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com
			
		"Simon Riggs" <simon@2ndquadrant.com> writes:
> It seems possible to reduce overall WAL volume by roughly 25% on common
> workloads by optimising the way UPDATE statements generate WAL.
This seems a huge amount of work to optimize *one* benchmark.  If it
weren't so narrowly focused on the properties of a particular benchmark
(mostly UPDATE, mostly a few columns in wide tuples), I'd get more
excited.  The extra time spent holding exclusive lock on the page
doesn't sound attractive either ...
        regards, tom lane
			
		On Tue, 2007-03-27 at 20:48 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > It seems possible to reduce overall WAL volume by roughly 25% on common > > workloads by optimising the way UPDATE statements generate WAL. > > This seems a huge amount of work to optimize *one* benchmark. Please don't beat me with that. I wouldn't suggest it if I didn't think it would help real users. The analysis of the WAL volume was done using a benchmark, but only as a guide to indicate likely usage patterns. There aren't many real world heavy UPDATE scenarios to analyze right now because people have previously actively avoided such usage. > If it > weren't so narrowly focused on the properties of a particular benchmark > (mostly UPDATE, mostly a few columns in wide tuples), I'd get more > excited. Updating the current balance on a Customer Account is one of the main focus areas for HOT. Those records are typically at least 250 bytes long, so we can save ~200 bytes of WAL per UPDATE for the most frequent types of UPDATE. Sure, not all UPDATEs would be optimised, but then they are much less frequent. As I mentioned, the WAL volume is disproportionately generated by UPDATEs of longer rows, so optimising WAL for just a few tables can make a big difference to the overall volume. > The extra time spent holding exclusive lock on the page > doesn't sound attractive either ... Agreed, thats why I set a fairly high bar for when this would kick in. The fewer rows on a page, the less contention. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, Mar 28, 2007 at 08:07:14AM +0100, Simon Riggs wrote: > On Tue, 2007-03-27 at 20:48 -0400, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > It seems possible to reduce overall WAL volume by roughly 25% on common > > > workloads by optimising the way UPDATE statements generate WAL. > > > > This seems a huge amount of work to optimize *one* benchmark. > > Please don't beat me with that. I wouldn't suggest it if I didn't think > it would help real users. The analysis of the WAL volume was done using > a benchmark, but only as a guide to indicate likely usage patterns. > There aren't many real world heavy UPDATE scenarios to analyze right now > because people have previously actively avoided such usage. > > > If it > > weren't so narrowly focused on the properties of a particular benchmark > > (mostly UPDATE, mostly a few columns in wide tuples), I'd get more > > excited. > > Updating the current balance on a Customer Account is one of the main > focus areas for HOT. Those records are typically at least 250 bytes > long, so we can save ~200 bytes of WAL per UPDATE for the most frequent > types of UPDATE. Sure, not all UPDATEs would be optimised, but then they > are much less frequent. > > As I mentioned, the WAL volume is disproportionately generated by > UPDATEs of longer rows, so optimising WAL for just a few tables can make > a big difference to the overall volume. > > > The extra time spent holding exclusive lock on the page > > doesn't sound attractive either ... > > Agreed, thats why I set a fairly high bar for when this would kick in. > The fewer rows on a page, the less contention. > We use DSPAM as one of our anti-spam options. Its UPDATE pattern is to increment a spam counter or a not-spam counter while keeping the user and token information the same. This would benefit from this optimization. Currently we are forced to use MySQL with MyISM tables to support the update load, although PostgreSQL 8.2 performance is right at the I/O break-even point for switching databases. With HOT and more optimized UPDATE I/O, 8.3 would give us enough I/O headroom to switch to PostgreSQL. Ken Marshall
Kenneth Marshall <ktm@rice.edu> writes:
> We use DSPAM as one of our anti-spam options. Its UPDATE pattern is to
> increment a spam counter or a not-spam counter while keeping the user and
> token information the same. This would benefit from this optimization.
Would it?  How wide is the "user and token" information?
        regards, tom lane
			
		Kenneth Marshall wrote: > We use DSPAM as one of our anti-spam options. Its UPDATE pattern is to > increment a spam counter or a not-spam counter while keeping the user and > token information the same. This would benefit from this optimization. > Currently we are forced to use MySQL with MyISM tables to support the > update load, although PostgreSQL 8.2 performance is right at the I/O > break-even point for switching databases. With HOT and more optimized > UPDATE I/O, 8.3 would give us enough I/O headroom to switch to PostgreSQL. Interesting. I've switched from MySQL to PostgreSQL for dspam, because of concurrency issues with MyISAM which caused bad performance. I am eager to see how much HOT speeds of my setup, though ;-) BTW, the "COMMIT NOWAIT" feature Simon Riggs proposed should provide a huge speedup too, since dspam runs one transaction for each token it has to update. greetings, Florian Pflug
Kenneth Marshall <ktm@rice.edu> writes:
> On Wed, Mar 28, 2007 at 09:46:30AM -0400, Tom Lane wrote:
>> Would it?  How wide is the "user and token" information?
> Sorry about the waste of time. I just noticed that the proposal is
> only for rows over 128 bytes. The token definition is:
> CREATE TABLE dspam_token_data (
>   uid smallint,
>   token bigint,
>   spam_hits int,
>   innocent_hits int,
>   last_hit date,
> );
> which is below the cutoff for the proposal.
Yeah, this illustrates my concern that the proposal is too narrowly
focused on a specific benchmark.
        regards, tom lane
			
		On Wed, 2007-03-28 at 15:51 +0200, Florian G. Pflug wrote: > BTW, the "COMMIT NOWAIT" feature Simon Riggs proposed should provide > a huge speedup too, since dspam runs one transaction for each token > it has to update. I've switched to doing the COMMIT NOWAIT as a priority now, but do plan to do both for 8.3. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Yeah, this illustrates my concern that the proposal is too narrowly > focused on a specific benchmark. A lot of the recently proposed changes don't really fit in the "optimizations" category very well at all. I think of them more as "avoiding pitfalls". Currently Postgres works quite well if your application is designed around its performance profile. But as soon as you do something "strange" you run the risk of running into various pitfalls. If you keep a long-running transaction open you suddenly find your tables bloating. If your table grows too large vacuum takes too long to complete and your tables bloat. If you update the same record many times instead of batching updates and performing a single update your table bloats. This one is similar, if you keep a bunch of static data attached to some small dynamic data your WAL and table bloats. Certainly you could have engineered your system not to fall into this pitfall, but only if you knew about it and only if it was worth the effort and other possible costs of doing so. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
> 
>> Yeah, this illustrates my concern that the proposal is too narrowly
>> focused on a specific benchmark.
> 
> A lot of the recently proposed changes don't really fit in the "optimizations"
> category very well at all. I think of them more as "avoiding pitfalls".
> 
> Currently Postgres works quite well if your application is designed around its
> performance profile. But as soon as you do something "strange" you run the
> risk of running into various pitfalls.
I would go a step further. Once you get into real life scenarios with 
real life work loads, you run into various pitfalls.
> 
> If you keep a long-running transaction open you suddenly find your tables
> bloating. If your table grows too large vacuum takes too long to complete and
> your tables bloat. If you update the same record many times instead of
> batching updates and performing a single update your table bloats. 
Long-running transaction is a big problem. I wish I knew how to solve it.
> 
> This one is similar, if you keep a bunch of static data attached to some small
> dynamic data your WAL and table bloats. Certainly you could have engineered
> your system not to fall into this pitfall, but only if you knew about it and
> only if it was worth the effort and other possible costs of doing so.
> 
It seems to me the solution could be useful. We have lots of tables that 
fall into the category that the test table presented.
Sincerely,
Joshua D. Drake
-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
			
		Gregory Stark <stark@enterprisedb.com> writes:
> A lot of the recently proposed changes don't really fit in the
> "optimizations" category very well at all. I think of them more as
> "avoiding pitfalls".
Well, we can't put a major amount of complexity into the system for
each possible "pitfall".
> This one is similar, if you keep a bunch of static data attached to
> some small dynamic data your WAL and table bloats.
Actually, PG does extremely well on that in the situation where the
static data is *really* wide, ie, wide enough to be toasted out-of-line.
Simon's proposal will only help for an intermediate range of situations
where the row is wide but not very wide.
It strikes me that a more useful solution might come from the recent
discussions about offering more user control of per-column toasting
decisions.  Or maybe we just need to revisit the default toast
thresholds --- AFAIR there has never been any significant study of
the particular values that Jan picked originally.
        regards, tom lane
			
		On Wed, 2007-03-28 at 10:51 -0400, Tom Lane wrote: > Kenneth Marshall <ktm@rice.edu> writes: > > On Wed, Mar 28, 2007 at 09:46:30AM -0400, Tom Lane wrote: > >> Would it? How wide is the "user and token" information? > > > Sorry about the waste of time. I just noticed that the proposal is > > only for rows over 128 bytes. The token definition is: > > > CREATE TABLE dspam_token_data ( > > uid smallint, > > token bigint, > > spam_hits int, > > innocent_hits int, > > last_hit date, > > ); > > > which is below the cutoff for the proposal. More to the point this looks like it has already been optimised to reduce the row length on a heavily updated table. > Yeah, this illustrates my concern that the proposal is too narrowly > focused on a specific benchmark. Not really. I specifically labelled that recommendation as a discussion point, so if you don't like the limit, please say so. My reasoning for having a limit at all is that block contention goes up at least as fast as the inverse of row length since the best case is when rows are randomly distributed and updated. What other aspect of the proposal has anything whatsoever to do with this single benchmark you think I'm over-fitting to? You and I discussed this in Toronto, so I'm surprised by your comments. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Tue, 2007-03-27 at 20:48 -0400, Tom Lane wrote: >> "Simon Riggs" <simon@2ndquadrant.com> writes: >>> It seems possible to reduce overall WAL volume by roughly 25% on common >>> workloads by optimising the way UPDATE statements generate WAL. >> This seems a huge amount of work to optimize *one* benchmark. > > Please don't beat me with that. I wouldn't suggest it if I didn't think > it would help real users. The analysis of the WAL volume was done using > a benchmark, but only as a guide to indicate likely usage patterns. > There aren't many real world heavy UPDATE scenarios to analyze right now > because people have previously actively avoided such usage. > >> If it >> weren't so narrowly focused on the properties of a particular benchmark >> (mostly UPDATE, mostly a few columns in wide tuples), I'd get more >> excited. > As a reference unless there is some further restriction I'm not understanding I've seen a lot of scenarios with this profile. An online multiplayer game, around 60,000 active users (out of 250k registered) did 6m page view per day, most dynamic. Most of the interactions between players, or between players and objects in the system resulted in updates to tables perhaps 512-1kb wide. A fair number of strings etc. Session information was tracked for active users, similar description but even more updates. I'd echo the other poster, that at the time we didn't fully know postgresql's performance profile (this was 6 years ago now) to code to, and ended up using MySQL because it worked and deadlines were tight. I can think of a lot of other scenarios as well that would be helped. Echoing another poster, it would be nice to round of a couple of the gotcha edges. This may be one of them. - August
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Actually, PG does extremely well on that in the situation where the > static data is *really* wide, ie, wide enough to be toasted out-of-line. > Simon's proposal will only help for an intermediate range of situations > where the row is wide but not very wide. The reason I think this is idea is exciting is that later I would suggest applying it to HOT updates. Having to keep a spare tuple's worth of space in every page is pretty annoying. But if we could get by with the average half-tuple dead space to do an update-- or even several updates--it would make a huge difference. > It strikes me that a more useful solution might come from the recent > discussions about offering more user control of per-column toasting > decisions. Or maybe we just need to revisit the default toast > thresholds --- AFAIR there has never been any significant study of > the particular values that Jan picked originally. I agree that these values need a second look. I think a TOAST_TUPLE_THRESHOLD well smaller than the current value would still easily pay its way. With a little caution to avoid wasting too much effort on the last few bytes I suspect even as low as 400-500 bytes is probably worthwhile. Also, it may make sense to take into account what percentage of the overall tuple a field is. It doesn't make much sense to start toasting fields in a table that consists of fourty 40-byte varcahars for example. Whereas it probably does make sense to toast a single 500-byte varchar in a table if the rest of the table consists of just ten integers. But considering how large the toast pointer itself is, how expensive it is to fetch it, and that we need one for each attribute, it still won't be able to usefully handle anything under 32 bytes or so. That's still a lot more than a single byte indicating that the field is unchanged. It's not uncommon to have tables that are wide because they have lots of small data in them. In fact looking at this code now, is there a bug here? I don't see anything in there stopping us from trying to toast varlenas that are smaller than a toast pointer. Have I just missed something obvious? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes:
> The reason I think this is idea is exciting is that later I would suggest
> applying it to HOT updates. Having to keep a spare tuple's worth of space in
> every page is pretty annoying. But if we could get by with the average
> half-tuple dead space to do an update-- or even several updates--it would make
> a huge difference.
Uh ... what?  This proposal is about reducing WAL volume, not about
changing the live data storage.
>> Or maybe we just need to revisit the default toast
>> thresholds --- AFAIR there has never been any significant study of
>> the particular values that Jan picked originally.
> I agree that these values need a second look. I think a TOAST_TUPLE_THRESHOLD
> well smaller than the current value would still easily pay its way. With a
> little caution to avoid wasting too much effort on the last few bytes I
> suspect even as low as 400-500 bytes is probably worthwhile.
Maybe.  One thing I was just thinking about is that it's silly to have
the threshold constrained so strongly by a desire that tuples in toast
tables not be toastable.  It would be trivial to tweak the heapam.c
routines so that they simply don't invoke the toaster when relkind is
't', and then we could have independent choices of toast-tuple size and
main-tuple size.  This would be particularly good because in the current
scheme you can't modify toast-tuple size without an initdb, but if that
were decoupled there'd be no reason not to allow changes in the
main-tuple thresholds.
> In fact looking at this code now, is there a bug here? I don't see anything in
> there stopping us from trying to toast varlenas that are smaller than a toast
> pointer. Have I just missed something obvious?
Note the initialization of biggest_size.
        regards, tom lane
			
		> > I agree that these values need a second look. I think a > > TOAST_TUPLE_THRESHOLD well smaller than the current value would still > > easily pay its way. With a little caution to avoid wasting too much > > effort on the last few bytes I suspect even as low as > 400-500 bytes is probably worthwhile. But a seq scan (or non cached access) would suddenly mutate to multiple random accesses, so this is not a win-win situation. Btw: Do we consider the existance of toasted columns in the seq-scan cost estimation ? Andreas
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:
> Btw: Do we consider the existance of toasted columns in the seq-scan
> cost estimation ?
Not at present.  There was some discussion of this but it seems like
a fair amount of work --- we don't currently track statistics on how
many of a column's entries are toasted or how big they are.  For that
matter it would be entirely unreasonable to pin the cost on "seq scan";
you'd need to look in close detail at exactly where and how the Vars get
used, and distinguish simply copying a Var from actual use of its value.
The planner is mostly uninterested in the evaluation costs of plan node
targetlists at the moment, and IIRC that's something not so easily
changed...
        regards, tom lane
			
		On Wed, 2007-03-28 at 11:17 -0400, Tom Lane wrote: > > This one is similar, if you keep a bunch of static data attached to > > some small dynamic data your WAL and table bloats. > > Actually, PG does extremely well on that in the situation where the > static data is *really* wide, ie, wide enough to be toasted out-of-line. > Simon's proposal will only help for an intermediate range of situations > where the row is wide but not very wide. Trouble is, thats lots of commonly updated tables. Thin tables generate only small amounts of WAL, while very wide tables are optimised for UPDATE already. But right now most relational tables that represent Major Entities, i.e. objects in the real world, have row lengths in the range 100-2000 bytes. Page hit counters, Customer Accounts, Financials-to-date, Event bookings, Seats sold. > It strikes me that a more useful solution might come from the recent > discussions about offering more user control of per-column toasting > decisions. Or maybe we just need to revisit the default toast > thresholds --- AFAIR there has never been any significant study of > the particular values that Jan picked originally. That's effectively definable vertical partitioning. The user has to know about this and do something about it themselves. By default we compress before we move out, so you'd need to know that also. I've investigated that route briefly but there does seem to be a surprising overhead in splitting off small pieces of data. Currently TOAST seems to be optimised for when we have more than one chunk of data. I agree there's something worth looking at there, but even so I don't see a direct correspondence between seldom updated (=> WAL reduction is beneficial) and seldom used (=> TOAST is beneficial) columns. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com