Обсуждение: visibility map
visibilitymap.c begins with a long and useful comment - but this part seems to have a bit of split personality disorder. * Currently, the visibility map is not 100% correct all the time.* During updates, the bit in the visibility map is clearedafter releasing* the lock on the heap page. During the window between releasing the lock* and clearing the bit inthe visibility map, the bit in the visibility map* is set, but the new insertion or deletion is not yet visible to other*backends.** That might actually be OK for the index scans, though. The newly inserted* tuple wouldn't have an indexpointer yet, so all tuples reachable from an* index would still be visible to all other backends, and deletions wouldn't*be visible to other backends yet. (But HOT breaks that argument, no?) I believe that the answer to the parenthesized question here is "yes" (in which case we might want to just delete this paragraph). * There's another hole in the way the PD_ALL_VISIBLE flag is set. When* vacuum observes that all tuples are visible to all,it sets the flag on* the heap page, and also sets the bit in the visibility map. If we then* crash, and only the visibilitymap page was flushed to disk, we'll have* a bit set in the visibility map, but the corresponding flag on the heap*page is not set. If the heap page is then updated, the updater won't* know to clear the bit in the visibility map. (Isn't that prevented by* the LSN interlock?) I *think* that the answer to this parenthesized question is "no". When we vacuum a page, we set the LSN on both the heap page and the visibility map page. Therefore, neither of them can get written to disk until the WAL record is flushed, but they could get flushed in either order. So the visibility map page could get flushed before the heap page, as the non-parenthesized portion of the comment indicates. However, at least in theory, it seems like we could fix this up during redo. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 14/06/10 06:08, Robert Haas wrote: > visibilitymap.c begins with a long and useful comment - but this part > seems to have a bit of split personality disorder. > > * Currently, the visibility map is not 100% correct all the time. > * During updates, the bit in the visibility map is cleared after releasing > * the lock on the heap page. During the window between releasing the lock > * and clearing the bit in the visibility map, the bit in the visibility map > * is set, but the new insertion or deletion is not yet visible to other > * backends. > * > * That might actually be OK for the index scans, though. The newly inserted > * tuple wouldn't have an index pointer yet, so all tuples reachable from an > * index would still be visible to all other backends, and deletions wouldn't > * be visible to other backends yet. (But HOT breaks that argument, no?) > > I believe that the answer to the parenthesized question here is "yes" > (in which case we might want to just delete this paragraph). A HOT update can only update non-indexed columns, so I think we're still OK with HOT. To an index-only scan, it doesn't matter which tuple in a HOT update chain you consider as live, because they both must all the same value in the indexed columns. Subtle.. > * There's another hole in the way the PD_ALL_VISIBLE flag is set. When > * vacuum observes that all tuples are visible to all, it sets the flag on > * the heap page, and also sets the bit in the visibility map. If we then > * crash, and only the visibility map page was flushed to disk, we'll have > * a bit set in the visibility map, but the corresponding flag on the heap > * page is not set. If the heap page is then updated, the updater won't > * know to clear the bit in the visibility map. (Isn't that prevented by > * the LSN interlock?) > > I *think* that the answer to this parenthesized question is "no". > When we vacuum a page, we set the LSN on both the heap page and the > visibility map page. Therefore, neither of them can get written to > disk until the WAL record is flushed, but they could get flushed in > either order. So the visibility map page could get flushed before the > heap page, as the non-parenthesized portion of the comment indicates. Right. > However, at least in theory, it seems like we could fix this up during > redo. Setting a bit in the visibility map is currently not WAL-logged, but yes once we add WAL-logging, that's straightforward to fix. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> I *think* that the answer to this parenthesized question is "no". >> When we vacuum a page, we set the LSN on both the heap page and the >> visibility map page. Therefore, neither of them can get written to >> disk until the WAL record is flushed, but they could get flushed in >> either order. So the visibility map page could get flushed before the >> heap page, as the non-parenthesized portion of the comment indicates. > > Right. > >> However, at least in theory, it seems like we could fix this up during >> redo. > > Setting a bit in the visibility map is currently not WAL-logged, but yes > once we add WAL-logging, that's straightforward to fix. Eh, so. Suppose - for the sake of argument - we do the following: 1. Allocate an additional infomask(2) bit that means "xmin is frozen, no need to call XidInMVCCSnapshot()". When we freeze a tuple, we set this bit in lieu of overwriting xmin. Note that freezing pages is already WAL-logged, so redo is possible. 2. Modify VACUUM so that, when the page is observed to be all-visible, it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the visibility map bit, writing a single XLOG record for the whole operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same vacuum already removed tuples; otherwise and/or when no tuples were removed writing XLOG_HEAP2_FREEZE or some new record type). This loses no forensic information because of (1). (If the page is NOT observed to be all-visible, we freeze individual tuples only when they hit the current age thresholds.) Setting the visibility map bit is now crash-safe. Please poke holes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Can we just log the change of VM in log_heap_clean() for redo?<br />Thanks<br /><br clear="all" />--<br />GaoZengqi<br /><ahref="mailto:pgf00a@gmail.com">pgf00a@gmail.com</a><br /><a href="mailto:zengqigao@gmail.com">zengqigao@gmail.com</a><br/><br /><br /><div class="gmail_quote">On Tue, Nov 23, 2010 at3:24 AM, Robert Haas <span dir="ltr"><<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left:1ex;"> On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas<br /> <<a href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>>wrote:<br /> >> I *think*that the answer to this parenthesized question is "no".<br /> >> When we vacuum a page, we set the LSN on boththe heap page and the<br /> >> visibility map page. Therefore, neither of them can get written to<br /> >>disk until the WAL record is flushed, but they could get flushed in<br /> >> either order. So the visibilitymap page could get flushed before the<br /> >> heap page, as the non-parenthesized portion of the commentindicates.<br /> ><br /> > Right.<br /> ><br /> >> However, at least in theory, it seems like we couldfix this up during<br /> >> redo.<br /> ><br /> > Setting a bit in the visibility map is currently not WAL-logged,but yes<br /> > once we add WAL-logging, that's straightforward to fix.<br /><br /> Eh, so. Suppose - forthe sake of argument - we do the following:<br /><br /> 1. Allocate an additional infomask(2) bit that means "xmin isfrozen,<br /> no need to call XidInMVCCSnapshot()". When we freeze a tuple, we set<br /> this bit in lieu of overwritingxmin. Note that freezing pages is<br /> already WAL-logged, so redo is possible.<br /><br /> 2. Modify VACUUMso that, when the page is observed to be all-visible,<br /> it will freeze all tuples on the page, set PD_ALL_VISIBLE,and set the<br /> visibility map bit, writing a single XLOG record for the whole<br /> operation (possiblypiggybacking on XLOG_HEAP2_CLEAN if the same<br /> vacuum already removed tuples; otherwise and/or when no tupleswere<br /> removed writing XLOG_HEAP2_FREEZE or some new record type). This<br /> loses no forensic information becauseof (1). (If the page is NOT<br /> observed to be all-visible, we freeze individual tuples only when they<br /> hitthe current age thresholds.)<br /><br /> Setting the visibility map bit is now crash-safe.<br /><br /> Please poke holes.<br/><br /> --<br /> Robert Haas<br /> EnterpriseDB: <a href="http://www.enterprisedb.com" target="_blank">http://www.enterprisedb.com</a><br/> The Enterprise PostgreSQL Company<br /><font color="#888888"><br />--<br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></font></blockquote></div><br />
On 22.11.2010 21:24, Robert Haas wrote: > Eh, so. Suppose - for the sake of argument - we do the following: > > 1. Allocate an additional infomask(2) bit that means "xmin is frozen, > no need to call XidInMVCCSnapshot()". When we freeze a tuple, we set > this bit in lieu of overwriting xmin. Note that freezing pages is > already WAL-logged, so redo is possible. > > 2. Modify VACUUM so that, when the page is observed to be all-visible, > it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the > visibility map bit, writing a single XLOG record for the whole > operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same > vacuum already removed tuples; otherwise and/or when no tuples were > removed writing XLOG_HEAP2_FREEZE or some new record type). This > loses no forensic information because of (1). (If the page is NOT > observed to be all-visible, we freeze individual tuples only when they > hit the current age thresholds.) > > Setting the visibility map bit is now crash-safe. That's an interesting idea. You pickyback setting the vm bit on the freeze WAL record, on the assumption that you have to write the freeze record anyway. However, if that assumption doesn't hold, because the tuples are deleted before they reach vacuum_freeze_min_age, it's no better than the naive approach of WAL-logging the vm bit set separately. Whether that's acceptable or not, I don't know. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 23, 2010 at 3:42 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > That's an interesting idea. You pickyback setting the vm bit on the freeze > WAL record, on the assumption that you have to write the freeze record > anyway. However, if that assumption doesn't hold, because the tuples are > deleted before they reach vacuum_freeze_min_age, it's no better than the > naive approach of WAL-logging the vm bit set separately. Whether that's > acceptable or not, I don't know. I don't know, either. I was trying to think of the cases where this would generate a net increase in WAL before I sent the email, but couldn't fully wrap my brain around it at the time. Thanks for summarizing. Here's another design to poke holes in: 1. Imagine that the visibility map is divided into granules. For the sake of argument let's suppose there are 8K bits per granule; thus each granule covers 64M of the underlying heap and 1K of space in the visibility map itself. 2. In shared memory, create a new array called the visibility vacuum array (VVA), each element of which has room for a backend ID, a relfilenode, a granule number, and an LSN. Before setting bits in the visibility map, a backend is required to allocate a slot in this array, XLOG the slot allocation, and fill in its backend ID, relfilenode number, and the granule number whose bits it will be manipulating, plus the LSN of the slot allocation XLOG record. It then sets as many bits within that granule as it likes. When done, it sets the backend ID of the VVA slot to InvalidBackendId but does not remove it from the array immediately; such a slot is said to have been "released". 3. When visibility map bits are set, the LSN of the page is set to the new-VVA-slot XLOG record, so that the visibility map page can't hit the disk before the new-VVA-slot XLOG record. Also, the contents of the VVA, sans backend IDs, are XLOG'd at each checkpoint. Thus, on redo, we can compute a list of all VVA slots for which visibility-bit changes might already be on disk; we go through and clear both the visibility map bit and the PD_ALL_VISIBLE bits on the underlying pages. 4. To free a VVA slot that has been released, we must xlogflush as far as the record that allocated the slot and sync the visibility map and heap segments containing that granule. Thus, all slots released before a checkpoint starts can be freed after it completes. Alternatively, an individual backend can free a previously-released slot by perfoming the xlog flush and syncs itself. (This might require a few more bookkeeping details to be stored in the VVA, but it seems manageable.) One problem with this design is that the visibility map bits never get set on standby servers. If we don't XLOG setting the bit then I suppose that doesn't happen now either, but it's more sucky (that's the technical term) if you're relying on it for index-only scans (which are also relevant on the standby, either during HS or if promoted) versus if you're only relying on it for vacuum (which doesn't happen on the standby anyway unless and until it's promoted). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company