Heap truncation without AccessExclusiveLock (9.4)

Поиск
Список
Период
Сортировка
Truncating a heap at the end of vacuum, to release unused space back to
the OS, currently requires taking an AccessExclusiveLock. Although it's 
only held for a short duration, it can be enough to cause a hiccup in 
query processing while it's held. Also, if there is a continuous stream 
of queries on the table, autovacuum never succeeds in acquiring the 
lock, and thus the table never gets truncated.

I'd like to eliminate the need for AccessExclusiveLock while truncating.

Design
------

In shared memory, keep two watermarks: a "soft" truncation watermark, 
and a "hard" truncation watermark. If there is no truncation in 
progress, the values are not set and everything works like today.

The soft watermark is the relation size (ie. number of pages) that 
vacuum wants to truncate the relation to. Backends can read pages above 
the soft watermark normally, but should refrain from inserting new 
tuples there. However, it's OK to update a page above the soft 
watermark, including adding new tuples, if the page is not completely 
empty (vacuum will check and not truncate away non-empty pages). If a 
backend nevertheless has to insert a new tuple to an empty page above 
the soft watermark, for example if there is no more free space in any 
lower-numbered pages, it must grab the extension lock, and update the 
soft watermark while holding it.

The hard watermark is the point above which there is guaranteed to be no 
tuples. A backend must not try to read or write any pages above the hard 
watermark - it should be thought of as the end of file for all practical 
purposes. If a backend needs to write above the hard watermark, ie. to 
extend the relation, it must first grab the extension lock, and raise 
the hard watermark.

The hard watermark is always >= the soft watermark.

Shared memory space is limited, but we only need the watermarks for any 
in-progress truncations. Let's keep them in shared memory, in a small 
fixed-size array. That limits the number of concurrent truncations that 
can be in-progress, but that should be ok. To not slow down common 
backend operations, the values (or lack thereof) are cached in relcache. 
To sync the relcache when the values change, there will be a new shared 
cache invalidation event to force backends to refresh the cached 
watermark values. A backend (vacuum) can ensure that all backends see 
the new value by first updating the value in shared memory, sending the 
sinval message, and waiting until everyone has received it.

With the watermarks, truncation works like this:

1. Set soft watermark to the point where we think we can truncate the 
relation. Wait until everyone sees it (send sinval message, wait).

2. Scan the pages to verify they are still empty.

3. Grab extension lock. Set hard watermark to current soft watermark (a 
backend might have inserted a tuple and raised the soft watermark while 
we were scanning). Release lock.

4. Wait until everyone sees the new hard watermark.

5. Grab extension lock.

6. Check (or wait) that there are no pinned buffers above the current 
hard watermark. (a backend might have a scan in progress that started 
before any of this, still holding a buffer pinned, even though it's empty.)

7. Truncate relation to the current hard watermark.

8. Release extension lock.


If a backend inserts a new tuple before step 2, the vacuum scan will see 
it. If it's inserted after step 2, the backend's cached soft watermark 
is already up-to-date, and thus the backend will update the soft 
watermark before the insert. Thus after the vacuum scan has finished the 
scan at step 2, all pages above the current soft watermark must still be 
empty.


Implementation details
----------------------

There are three kinds of access to a heap page:

A) As a target for new tuple.
B) Following an index pointer, ctid or similar.
C) A sequential scan (and bitmap heap scan?)


To refrain from inserting new tuples to non-empty pages above the soft 
watermark (A), RelationGetBufferForTuple() is modified to check the soft 
watermark (and raise it if necessary).

An index scan (B) should never try to read beyond the high watermark, 
because there are no tuples above it, and thus there should be no 
pointers to pages above it either.

A sequential scan (C) must refrain from reading beyond the hard 
watermark. This can be implemented by always checking the (cached) high 
watermark value before stepping to next page.


Truncation during hot standby is a lot simpler: set soft and hard 
watermarks to the truncation point, wait until everyone sees the new 
values, and truncate the relation.


Does anyone see a flaw in this?

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: postgres_fdw foreign tables and serial columns
Следующее
От: Amit Langote
Дата:
Сообщение: Re: Logging of PAM Authentication Failure