Обсуждение: visibility map

Поиск
Список
Период
Сортировка

visibility map

От
Robert Haas
Дата:
visibilitymap.c begins with a long and useful comment - but this part
seems to have a bit of split personality disorder.
* Currently, the visibility map is not 100% correct all the time.* During updates, the bit in the visibility map is
clearedafter releasing* the lock on the heap page. During the window between releasing the lock* and clearing the bit
inthe visibility map, the bit in the visibility map* is set, but the new insertion or deletion is not yet visible to
other*backends.** That might actually be OK for the index scans, though. The newly inserted* tuple wouldn't have an
indexpointer yet, so all tuples reachable from an* index would still be visible to all other backends, and deletions
wouldn't*be visible to other backends yet.  (But HOT breaks that argument, no?)
 

I believe that the answer to the parenthesized question here is "yes"
(in which case we might want to just delete this paragraph).
* There's another hole in the way the PD_ALL_VISIBLE flag is set. When* vacuum observes that all tuples are visible to
all,it sets the flag on* the heap page, and also sets the bit in the visibility map. If we then* crash, and only the
visibilitymap page was flushed to disk, we'll have* a bit set in the visibility map, but the corresponding flag on the
heap*page is not set. If the heap page is then updated, the updater won't* know to clear the bit in the visibility map.
(Isn't that prevented by* the LSN interlock?)
 

I *think* that the answer to this parenthesized question is "no".
When we vacuum a page, we set the LSN on both the heap page and the
visibility map page.  Therefore, neither of them can get written to
disk until the WAL record is flushed, but they could get flushed in
either order.  So the visibility map page could get flushed before the
heap page, as the non-parenthesized portion of the comment indicates.
However, at least in theory, it seems like we could fix this up during
redo.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: visibility map

От
Heikki Linnakangas
Дата:
On 14/06/10 06:08, Robert Haas wrote:
> visibilitymap.c begins with a long and useful comment - but this part
> seems to have a bit of split personality disorder.
>
>   * Currently, the visibility map is not 100% correct all the time.
>   * During updates, the bit in the visibility map is cleared after releasing
>   * the lock on the heap page. During the window between releasing the lock
>   * and clearing the bit in the visibility map, the bit in the visibility map
>   * is set, but the new insertion or deletion is not yet visible to other
>   * backends.
>   *
>   * That might actually be OK for the index scans, though. The newly inserted
>   * tuple wouldn't have an index pointer yet, so all tuples reachable from an
>   * index would still be visible to all other backends, and deletions wouldn't
>   * be visible to other backends yet.  (But HOT breaks that argument, no?)
>
> I believe that the answer to the parenthesized question here is "yes"
> (in which case we might want to just delete this paragraph).

A HOT update can only update non-indexed columns, so I think we're still 
OK with HOT. To an index-only scan, it doesn't matter which tuple in a 
HOT update chain you consider as live, because they both must all the 
same value in the indexed columns. Subtle..

>   * There's another hole in the way the PD_ALL_VISIBLE flag is set. When
>   * vacuum observes that all tuples are visible to all, it sets the flag on
>   * the heap page, and also sets the bit in the visibility map. If we then
>   * crash, and only the visibility map page was flushed to disk, we'll have
>   * a bit set in the visibility map, but the corresponding flag on the heap
>   * page is not set. If the heap page is then updated, the updater won't
>   * know to clear the bit in the visibility map.  (Isn't that prevented by
>   * the LSN interlock?)
>
> I *think* that the answer to this parenthesized question is "no".
> When we vacuum a page, we set the LSN on both the heap page and the
> visibility map page.  Therefore, neither of them can get written to
> disk until the WAL record is flushed, but they could get flushed in
> either order.  So the visibility map page could get flushed before the
> heap page, as the non-parenthesized portion of the comment indicates.

Right.

> However, at least in theory, it seems like we could fix this up during
> redo.

Setting a bit in the visibility map is currently not WAL-logged, but yes 
once we add WAL-logging, that's straightforward to fix.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility map

От
Robert Haas
Дата:
On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> I *think* that the answer to this parenthesized question is "no".
>> When we vacuum a page, we set the LSN on both the heap page and the
>> visibility map page.  Therefore, neither of them can get written to
>> disk until the WAL record is flushed, but they could get flushed in
>> either order.  So the visibility map page could get flushed before the
>> heap page, as the non-parenthesized portion of the comment indicates.
>
> Right.
>
>> However, at least in theory, it seems like we could fix this up during
>> redo.
>
> Setting a bit in the visibility map is currently not WAL-logged, but yes
> once we add WAL-logging, that's straightforward to fix.

Eh, so.  Suppose - for the sake of argument - we do the following:

1. Allocate an additional infomask(2) bit that means "xmin is frozen,
no need to call XidInMVCCSnapshot()".  When we freeze a tuple, we set
this bit in lieu of overwriting xmin.  Note that freezing pages is
already WAL-logged, so redo is possible.

2. Modify VACUUM so that, when the page is observed to be all-visible,
it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the
visibility map bit, writing a single XLOG record for the whole
operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same
vacuum already removed tuples; otherwise and/or when no tuples were
removed writing XLOG_HEAP2_FREEZE or some new record type).  This
loses no forensic information because of (1).  (If the page is NOT
observed to be all-visible, we freeze individual tuples only when they
hit the current age thresholds.)

Setting the visibility map bit is now crash-safe.

Please poke holes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: visibility map

От
高增琦
Дата:
Can we just log the change of VM in log_heap_clean() for redo?<br />Thanks<br /><br clear="all" />--<br />GaoZengqi<br
/><ahref="mailto:pgf00a@gmail.com">pgf00a@gmail.com</a><br /><a
href="mailto:zengqigao@gmail.com">zengqigao@gmail.com</a><br/><br /><br /><div class="gmail_quote">On Tue, Nov 23, 2010
at3:24 AM, Robert Haas <span dir="ltr"><<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left:1ex;"> On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas<br /> <<a
href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>>wrote:<br /> >> I
*think*that the answer to this parenthesized question is "no".<br /> >> When we vacuum a page, we set the LSN on
boththe heap page and the<br /> >> visibility map page.  Therefore, neither of them can get written to<br />
>>disk until the WAL record is flushed, but they could get flushed in<br /> >> either order.  So the
visibilitymap page could get flushed before the<br /> >> heap page, as the non-parenthesized portion of the
commentindicates.<br /> ><br /> > Right.<br /> ><br /> >> However, at least in theory, it seems like we
couldfix this up during<br /> >> redo.<br /> ><br /> > Setting a bit in the visibility map is currently not
WAL-logged,but yes<br /> > once we add WAL-logging, that's straightforward to fix.<br /><br /> Eh, so.  Suppose -
forthe sake of argument - we do the following:<br /><br /> 1. Allocate an additional infomask(2) bit that means "xmin
isfrozen,<br /> no need to call XidInMVCCSnapshot()".  When we freeze a tuple, we set<br /> this bit in lieu of
overwritingxmin.  Note that freezing pages is<br /> already WAL-logged, so redo is possible.<br /><br /> 2. Modify
VACUUMso that, when the page is observed to be all-visible,<br /> it will freeze all tuples on the page, set
PD_ALL_VISIBLE,and set the<br /> visibility map bit, writing a single XLOG record for the whole<br /> operation
(possiblypiggybacking on XLOG_HEAP2_CLEAN if the same<br /> vacuum already removed tuples; otherwise and/or when no
tupleswere<br /> removed writing XLOG_HEAP2_FREEZE or some new record type).  This<br /> loses no forensic information
becauseof (1).  (If the page is NOT<br /> observed to be all-visible, we freeze individual tuples only when they<br />
hitthe current age thresholds.)<br /><br /> Setting the visibility map bit is now crash-safe.<br /><br /> Please poke
holes.<br/><br /> --<br /> Robert Haas<br /> EnterpriseDB: <a href="http://www.enterprisedb.com"
target="_blank">http://www.enterprisedb.com</a><br/> The Enterprise PostgreSQL Company<br /><font color="#888888"><br
/>--<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></font></blockquote></div><br /> 

Re: visibility map

От
Heikki Linnakangas
Дата:
On 22.11.2010 21:24, Robert Haas wrote:
> Eh, so.  Suppose - for the sake of argument - we do the following:
>
> 1. Allocate an additional infomask(2) bit that means "xmin is frozen,
> no need to call XidInMVCCSnapshot()".  When we freeze a tuple, we set
> this bit in lieu of overwriting xmin.  Note that freezing pages is
> already WAL-logged, so redo is possible.
>
> 2. Modify VACUUM so that, when the page is observed to be all-visible,
> it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the
> visibility map bit, writing a single XLOG record for the whole
> operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same
> vacuum already removed tuples; otherwise and/or when no tuples were
> removed writing XLOG_HEAP2_FREEZE or some new record type).  This
> loses no forensic information because of (1).  (If the page is NOT
> observed to be all-visible, we freeze individual tuples only when they
> hit the current age thresholds.)
>
> Setting the visibility map bit is now crash-safe.

That's an interesting idea. You pickyback setting the vm bit on the 
freeze WAL record, on the assumption that you have to write the freeze 
record anyway. However, if that assumption doesn't hold, because the 
tuples are deleted before they reach vacuum_freeze_min_age, it's no 
better than the naive approach of WAL-logging the vm bit set separately. 
Whether that's acceptable or not, I don't know.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility map

От
Robert Haas
Дата:
On Tue, Nov 23, 2010 at 3:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> That's an interesting idea. You pickyback setting the vm bit on the freeze
> WAL record, on the assumption that you have to write the freeze record
> anyway. However, if that assumption doesn't hold, because the tuples are
> deleted before they reach vacuum_freeze_min_age, it's no better than the
> naive approach of WAL-logging the vm bit set separately. Whether that's
> acceptable or not, I don't know.

I don't know, either.  I was trying to think of the cases where this
would generate a net increase in WAL before I sent the email, but
couldn't fully wrap my brain around it at the time.  Thanks for
summarizing.

Here's another design to poke holes in:

1. Imagine that the visibility map is divided into granules.  For the
sake of argument let's suppose there are 8K bits per granule; thus
each granule covers 64M of the underlying heap and 1K of space in the
visibility map itself.

2. In shared memory, create a new array called the visibility vacuum
array (VVA), each element of which has room for a backend ID, a
relfilenode, a granule number, and an LSN.  Before setting bits in the
visibility map, a backend is required to allocate a slot in this
array, XLOG the slot allocation, and fill in its backend ID,
relfilenode number, and the granule number whose bits it will be
manipulating, plus the LSN of the slot allocation XLOG record.  It
then sets as many bits within that granule as it likes.  When done, it
sets the backend ID of the VVA slot to InvalidBackendId but does not
remove it from the array immediately; such a slot is said to have been
"released".

3. When visibility map bits are set, the LSN of the page is set to the
new-VVA-slot XLOG record, so that the visibility map page can't hit
the disk before the new-VVA-slot XLOG record.  Also, the contents of
the VVA, sans backend IDs, are XLOG'd at each checkpoint.  Thus, on
redo, we can compute a list of all VVA slots for which visibility-bit
changes might already be on disk; we go through and clear both the
visibility map bit and the PD_ALL_VISIBLE bits on the underlying
pages.

4. To free a VVA slot that has been released, we must xlogflush as far
as the record that allocated the slot and sync the visibility map and
heap segments containing that granule.  Thus, all slots released
before a checkpoint starts can be freed after it completes.
Alternatively, an individual backend can free a previously-released
slot by perfoming the xlog flush and syncs itself.  (This might
require a few more bookkeeping details to be stored in the VVA, but it
seems manageable.)

One problem with this design is that the visibility map bits never get
set on standby servers.  If we don't XLOG setting the bit then I
suppose that doesn't happen now either, but it's more sucky (that's
the technical term) if you're relying on it for index-only scans
(which are also relevant on the standby, either during HS or if
promoted) versus if you're only relying on it for vacuum (which
doesn't happen on the standby anyway unless and until it's promoted).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company