Frequent Update Project: Design Overview of HOT Updates

Поиск
Список
Период
Сортировка
От Simon Riggs
Тема Frequent Update Project: Design Overview of HOT Updates
Дата
Msg-id 1163092396.3634.461.camel@silverbirch.site
обсуждение исходный текст
Ответы Re: Frequent Update Project: Design Overview of HOT Updates  (Martijn van Oosterhout <kleptog@svana.org>)
Re: Frequent Update Project: Design Overview of HOT Updates  (Josh Berkus <josh@agliodbs.com>)
Re: Frequent Update Project: Design Overview of HOT Updates  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Design Overview of HOT Updates
------------------------------

The objective is to increase the speed of the UPDATE case, while
minimizing the overall negative effects of the UPDATE. We refer to the
general requirement as *Frequent Update Optimization*, though this
design proposal is for Heap Overflow Tuple (HOT) Updates. It is similar
in some ways to the design for SITC already proposed, though has a
number of additional features drawn from other designs to make it a
practical and effective implementation. 

EnterpriseDB have a working, performant prototype of this design. There
are still a number of issues to resolve and the intention is to follow
an open community process to find the best way forward. All required
detail will be provided for the work conducted so far.

Current PGSQL behaviour is for UPDATEs to create a new tuple version
within the heap, so acts from many perspectives as if it were an INSERT.
All of the tuple versions are chained together, so that whichever of the
tuples is visible to your Snapshot, you can walk the chain to find the
most recent tuple version to update.

The HOT proposal splits the heap into two areas: the main heap and an
overflow relation, both of which are regular, transactional relations.
INSERT and DELETE effect only the main heap exactly as they do now.
UPDATE places new tuple versions into the overflow relation in most
cases, maintaining all of the current capabilities of the MVCC model:
all tuple versions remain accessible, plus the chain of updated tuple
versions is maintained. 

UPDATEs do not insert into indexes at all - no index pointers exist at
all into the overflow relation. So during a stream of UPDATEs the main
heap and indexes stay the same size, while the indexes are used
read-only, avoiding additional database writes and the need for eventual
index rebuilding.

The current tuple version within the main heap is referred to as the
"root tuple" from now on. When reading the main heap, if the root tuple
is not visible we "walk the chain" into the overflow relation until we
find a tuple version that is visible - we follow the t_ctid pointer from
tuple to tuple, testing for MVCC visibility as we go.

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer. Intuitively this
sounds like a bad design, though an additional feature turns that from
bad to good performance:

- when anybody walks the chain into the overflow relation, if we find
that the root tuple is vacuumable we copy back the first visible tuple
version over the now-dead root tuple.

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time. Even under heavy concurrent
UPDATEs the modal chain length remains 1, with the chain length varying
in approximately Poisson distribution.

The overflow relation is similar in concept to a TOAST table, so we
might describe this approach as TOAST-for-updated-versions. The code
implementation is similar also, with reasonably similar modularity. HOT
does sound radical, but no more so than TOAST was when first discussed.

We can locate any tuple version and thereby allow MVCC to work
correctly, while at the same time preserving the crucial property of the
Postgres non-overwriting storage manager. This works very well for
indexed tuple access since the index and main heap never grow, thus
maintaining access speeds. HOT supports both Read Committed and
Serializable transaction isolation levels, with transactions of any
length, just as with current MVCC.

Performance results against the previously described test cases are
approximately:

1. pgbench (TPC-B) - Around 200-300% better

2. DBT-2 (TPC-C) - Signficicant improvement - possibly 10% improvement,
somewhat difficult to measure exactly because of the way the test itself
works.

3. truckin - Around 500% improvement

The performance gain remains better than REL8_2 base even in the
presence of longer running transactions.

[There is also considerable synergy with changes to b-tree indexes that
are both useful in their own right, and even better when HOT is added,
more on that on a separate thread.]

We expect that people will want to measure this for themselves, so we
don't publish all of the detailed internal tests here since YMMV.

Using HOT
---------

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple. [We'll be able to do that more efficiently when
we have plan invalidation]

If we perform an update that meets the HOT criteria then we put the
new version into the overflow relation; we describe this as a HOT
UPDATE. If we perform an update that does not meet the criteria, then we
carry on with the existing/old MVCC behaviour; we describe this as a
non-HOT UPDATE.

So with HOT we must support both HOT and non-HOT UPDATEs. Each tuple
whether it is in the main heap or the overflow relation can point to
another tuple with the t_ctid, which is extended by using a previously
unused bit to flag whether the destination is a block from the main heap
or the overflow relation.

Just to reiterate, since a HOT update doesn't touch index values that
means that all members of a tuple chain have identical primary keys and
index values. So we only ever need to grovel through the overflow
relation *if* we have already met the index search criteria - so we
don't have to re-check the index criteria for each tuple version we
touch. All normal updates of a table will be read-only on the index,
plus writes to heap/overflow.

Simple Example of Tuple Chaining
--------------------------------

Lets look at a simple case:

1. INSERT INTO table (1,....)
Tuple is inserted into main heap (Version1 or V1)
  Heap Page P1            Overflow Relation  ----------------        -----------------  |              |        |
       |  |     ______   |        |               |  |    |  V1  |  |        |               |  |    |______|  |
|              |  |              |        |               |  |--------------|        |---------------|    
 
2. UPDATE table SET nonindexcol = 6 WHERE pkcol = 1;
Same length tuple, so V2 written using overflow
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |   |------>|  V2  |     |  |    |  V1  |------|    |  |______|     |  |    |______|  |
   |               |  |              |        |               |  |--------------|        |---------------|    
 
3. UPDATE table SET nonindexcol = 7 WHERE pkcol = 1;
Same length tuple, so V3 written using overflow
We update v2.t_ctid to maintain the forward tuple chain
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |   |------>|  V2  |     |  |    |  V1  |------|    |  |______|--|  |  |    |______|  |
   |   ______   |  |  |              |        |  |  V3  |<-|  |  |              |        |  |______|     |
|--------------|       |---------------|
 


Simple Example of Copying-back
------------------------------

4. UPDATE table SET nonindexcol = 8 WHERE pkcol = 1;
Same length tuple, so V4 will be written using overflow.
While selecting for UPDATE, we look at V1, we see it is now vacuumable,
so we first copy back the first visible tuple over V1 before proceeding.
[Assume V1 is dead, V2 is still visible]
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |        |  |  V2* |     |  |    |  V2  |------|    |  |______|     |  |    |______|  |
|   |   ______      |  |              |   |------>|  V3  |     |  |              |        |  |______|     |
|--------------|       |---------------| 

The copy back operation leaves V2* as the now-unwritable old copy of V2.
We update it also to break the chain to V3. Now orphaned, it can be
removed later.

5. Now we continue with the UPDATE as planned
Same length tuple, so V4 will be written using overflow.
We update v3.t_ctid to maintain the forward tuple chain
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |        |  |  V2* |     |  |    |  V2  |------|    |  |______|     |  |    |______|  |
|   |   ______      |  |              |   |------>|  V3  |     |  |              |        |  |______|--|  |  |
   |        |   ______   |  |  |              |        |  |  V4  |<-|  |  |              |        |  |______|     |
|--------------|       |---------------|
 
    
Copying Back
------------

The copy back operation is required to keep the tuple chains short and
efficient. When we copy back we leave a tuple version and possibly a
long part of the tuple chain orphaned. These are then marked for removal
by the next VACUUM.

To support copying back, we add an additional header fields onto
every tuple in the overflow relation (only). We generate WAL
appropriately for this action.

VACUUMing becomes significantly easier with HOT than with non-HOT
UPDATEs. Specifically, since there are no indexes on the overflow
relation we should be able to VACUUM it more easily and possibly
piece-by-piece (i.e. "retail vacuum").

Behavioural Characteristics
---------------------------

With HOT, it is easily possible that the chain of prior versions spans
many blocks. The chain always starts with the block of the root tuple
but possibly includes many overflow relation blocks.

A SeqScan of a HOT table will turn into a large number of
random accesses to the overflow relation, which could be considerably
more expensive than sequential I/O. Clearly very large tables would not
be able to be HOT enabled without additional cost, so we make HOT a
table-level option: WITH (freqUPDATE = on)   [discuss...]

This means that a SeqScan of a heap with heavy concurrent update
activity *could* be fairly costly. Most times it won't be, because the
appropriate overflow relation blocks will most likely be in-memory. That
is the price we pay for having this optimization - a suitable warning
would apply, though this would not be a surprise to anybody since Oracle
people know that reading heavily-updated data takes more CPU and
DB2/SQLServer know that it takes ages to get around each of the row
level locks. So this isn't bad *at all* in comparison.

We hope/expect/recommend HOT to be used in conjunction with UPDATE
RETURNING, so that fewer SELECTs and non-index operations will be used.

[There's an extra prize if you can think of a *correct* way of
SeqScanning with HOT more efficiently; we've not focused on that yet.]

In the presence of a long running transaction, the overflow relation
could become very large since VACUUM becomes less effective. (This same
situation occurs with current MVCC). The overflow relation will then
start to page out to disk and could grow large. Oracle has this problem
also and this is why Oracle 10g has moved away from fixed-size Rollback
Segments to the use of an overflow Tablespace. Some size controls are
available to limit the growth there. That would be a useful optimization
for HOT also and one much easier than any such enhancement of existing
MVCC.

Next Steps
----------

Deeper technical details will be published shortly; the design goes much
deeper than the discussion here, but we wanted to present the overall
summary first.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Simon Riggs"
Дата:
Сообщение: Re: Introducing an advanced Frequent Update Optimization
Следующее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: Frequent Update Project: Design Overview of HOT Updates