Обсуждение: 64.4.2. Bottom-up Index Deletion

Поиск
Список
Период
Сортировка

64.4.2. Bottom-up Index Deletion

От
PG Doc comments form
Дата:
The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/14/btree-implementation.html
Description:

on this url 64.4.2. Bottom-up Index Deletion,
https://www.postgresql.org/docs/14/btree-implementation.html#BTREE-DELETION


Would be nice to add a note: old tuple versions in the index referencing the
same logical row cannot be deleted by bottom up index deletion process when
older transactions that might require the old state the row are still
running That is what the LP_DEAD bit is for I believe? 

hopefully my understanding is correct.

Re: 64.4.2. Bottom-up Index Deletion

От
Peter Geoghegan
Дата:
Hi Hussein,

Apologies for the very delayed response. I'm aware that you've taken
an interest in this subject as part of your YouTube channel. Thanks
for publicizing the work!

On Tue, Jul 12, 2022 at 7:14 PM PG Doc comments form
<noreply@postgresql.org> wrote:
> Would be nice to add a note: old tuple versions in the index referencing the
> same logical row cannot be deleted by bottom up index deletion process when
> older transactions that might require the old state the row are still
> running

It's really hard to write documentation for something like this,
because it's difficult to decide what your audience really needs to
know. I agree that it's important to get this specific point across,
though. In fact I thought that I already conveyed the same idea at
this point:

"All indexes will need a successor physical index tuple that points to
the latest version in the table. Each new tuple within each index will
generally need to coexist with the original “updated” tuple for a
short period of time (typically until shortly after the UPDATE
transaction commits)."

The implication is that we need the old version to coexist until after
the updater transaction commits and is seen by every possible MVCC
snapshot as having committed -- nobody sees the old version anymore.
Maybe we could augment the existing sentences I have highlighted?
Could it be more explicit?

> That is what the LP_DEAD bit is for I believe?

Sort of. You could say that the LP_DEAD bit being set means "this
index tuple is definitely useless to everybody, and so is definitely
eligible for deletion when the page fills up" -- so in that sense
you're right. However, it doesn't work the other way around -- not
every index tuple that is deleted (or that is eligible to be deleted)
will have its LP_DEAD bit set. Bottom-up index deletion only happens
when we have no LP_DEAD bits set on the leaf page (we always prefer to
delete them via simple deletion instead).

An index tuple's LP_DEAD bit is set by index scans that happen to
notice that the tuple is not needed by any possible MVCC snapshot.
This allows it to be ignored by other index scans immediately. It also
allows simple deletion to physically remove the tuple later on (at the
point that the page fills). Of course the LP_DEAD bit can only be set
when such an index scan actually happens because some SQL queries
happen to be executed by the user app.

The nice thing about bottom-up index deletion is that it reliably
places the burden of performing deletion on the specific non-HOT
updaters that are responsible for bloating the index -- they are
required to clean up their own mess, making it much less likely that
index bloat will be very destabilizing. So the main difference between
bottom-up deletion and simple deletion is the information that drives
the whole process. The physical modifications to the index page are
actually identical (the LP_DEAD bit is set by index scans, not the
deletion process itself).

--
Peter Geoghegan



Re: 64.4.2. Bottom-up Index Deletion

От
"David G. Johnston"
Дата:
On Mon, Nov 7, 2022 at 5:20 PM Peter Geoghegan <pg@bowt.ie> wrote:
Hi Hussein,

Apologies for the very delayed response. I'm aware that you've taken
an interest in this subject as part of your YouTube channel. Thanks
for publicizing the work!

On Tue, Jul 12, 2022 at 7:14 PM PG Doc comments form
<noreply@postgresql.org> wrote:
> Would be nice to add a note: old tuple versions in the index referencing the
> same logical row cannot be deleted by bottom up index deletion process when
> older transactions that might require the old state the row are still
> running

It's really hard to write documentation for something like this,
because it's difficult to decide what your audience really needs to
know. I agree that it's important to get this specific point across,
though. In fact I thought that I already conveyed the same idea at
this point:

"All indexes will need a successor physical index tuple that points to
the latest version in the table. Each new tuple within each index will
generally need to coexist with the original “updated” tuple for a
short period of time (typically until shortly after the UPDATE
transaction commits)."

The implication is that we need the old version to coexist until after
the updater transaction commits and is seen by every possible MVCC
snapshot as having committed -- nobody sees the old version anymore.
Maybe we could augment the existing sentences I have highlighted?
Could it be more explicit?

I'm having trouble finding any major issues with the present wording.  Though it seems to be assuming the reader holds sufficient MVCC knowledge to understand the import of "until shortly after the UPDATE transaction commits".  Maybe a bit more explicitness is in order.

On the point of "will generally need to coexist" - I don't see why we are being wishy-washy here, though.

When updating a row where bottom-up deletion is chosen the most recent tuple cannot be removed to make room for the new tuple; in particular, because the current update may not commit.

I'm also not inherently understanding how the bottom-up pass can know a tuple is safe to remove based upon visibility information when that information is not present in the index AND it doesn't rely upon LP_DEAD.

A bit nit-picky but I think relevant to the above confusion:

"B-Tree indexes incrementally delete" - is it really the index self-modifying or is it an active user session taking some time to perform each pass?  Describing it as, say: 

"The updating session will locate all the logically equivalent tuples (on the same page) via the index and check them for global visibility, removing those that it finds that are both older than the most recent tuple and no longer visible to all other sessions."

David J.

Re: 64.4.2. Bottom-up Index Deletion

От
Peter Geoghegan
Дата:
On Wed, Nov 9, 2022 at 2:20 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Maybe a bit more explicitness is in order.

Yeah, maybe.

> On the point of "will generally need to coexist" - I don't see why we are being wishy-washy here, though.

Because sometimes it will take a relatively long time (say in the
presence of a long running xact/snapshot). With an aborted transaction
there will be no wait for the tuple to become eligible to
remove/delete at all (even in the presence of a long running
xact/shapshot).

> When updating a row where bottom-up deletion is chosen the most recent tuple cannot be removed to make room for the
newtuple; in particular, because the current update may not commit. 

Sort of. Could also be multiple still-needed tuples.

In fact it's possible that it'll be a tuple that became garbage as a
result of a DELETE statement run during a committed transaction -- we
need only visit the same heap page during the processing that takes
place on the heapam side.

> I'm also not inherently understanding how the bottom-up pass can know a tuple is safe to remove based upon visibility
informationwhen that information is not present in the index AND it doesn't rely upon LP_DEAD. 

It goes to the heap to get it. B-Tree does this based on speculative
criteria (mostly we visit heap pages pointed to by the most duplicate
index tuples first). The process starts out without knowing for sure
whether or not it'll work out. This is okay because we give up as soon
as any single heap page is visited without it yielding at least one
deletable tuple. And because it happens at the point of a would-be
page split caused by a non-HOT updater -- a crucial juncture for the
leaf page.

> "B-Tree indexes incrementally delete" - is it really the index self-modifying or is it an active user session taking
sometime to perform each pass?  Describing it as, say: 
>
> "The updating session will locate all the logically equivalent tuples (on the same page) via the index and check them
forglobal visibility, removing those that it finds that are both older than the most recent tuple and no longer visible
toall other sessions." 

But it actually just deletes whatever tuples are determined to be safe
to delete. The B-Tree side of this has some very vague idea about how
updates and MVCC version churn works, but it is quite unsure of
itself.

You can think of the whole process as a process of ranking heap blocks
pointed to by tuples on the affected B-Tree leaf page. We are not
exactly maximizing the amount of space freed on the leaf page. It's
more a case of minimizing the probability of having to split the page,
while avoiding ever spending more than a small amount of effort on
truly hopeless cases. We can avoid a page split for a very long time
just by deleting only a small number of index tuples in many cases (it
all depends on the workload).

I don't think that these subtleties need to be documented. But it's
difficult to know where to draw the line.

--
Peter Geoghegan



Re: 64.4.2. Bottom-up Index Deletion

От
"David G. Johnston"
Дата:
On Wed, Nov 9, 2022 at 4:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Nov 9, 2022 at 2:20 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Maybe a bit more explicitness is in order.

Yeah, maybe.

> On the point of "will generally need to coexist" - I don't see why we are being wishy-washy here, though.

Because sometimes it will take a relatively long time (say in the
presence of a long running xact/snapshot). With an aborted transaction
there will be no wait for the tuple to become eligible to
remove/delete at all (even in the presence of a long running
xact/shapshot).

Ok, so the "most recent live tuple" - this in an update, there has to be one.  Can that live tuple that is being updated ever be removed by this process?

I'm not sure what the long-running transaction has to do with this though - we are in an update so we've locked an active tuple and that is the only tuple this coexistence is talking about.  The other sentence tries to cover the remaining ones with a general "best effort" hand-wave which is OK.
 

> When updating a row where bottom-up deletion is chosen the most recent tuple cannot be removed to make room for the new tuple; in particular, because the current update may not commit.

Sort of. Could also be multiple still-needed tuples.

That there are others doesn't support or invalidate the sentence though.  "most recent live tuple" - though (i.e., the one being updated).
 

In fact it's possible that it'll be a tuple that became garbage as a
result of a DELETE statement run during a committed transaction -- we
need only visit the same heap page during the processing that takes
place on the heapam side.

You lost me here.  Which tuple is "it" - the one we are updating got deleted under us?


> I'm also not inherently understanding how the bottom-up pass can know a tuple is safe to remove based upon visibility information when that information is not present in the index AND it doesn't rely upon LP_DEAD.

It goes to the heap to get it.

ok

> "B-Tree indexes incrementally delete" - is it really the index self-modifying or is it an active user session taking some time to perform each pass?  Describing it as, say:
>
> "The updating session will locate all the logically equivalent tuples (on the same page) via the index and check them for global visibility, removing those that it finds that are both older than the most recent tuple and no longer visible to all other sessions."

But it actually just deletes whatever tuples are determined to be safe
to delete. The B-Tree side of this has some very vague idea about how
updates and MVCC version churn works, but it is quite unsure of
itself.

Ah, OK, I somehow got into my head that only duplicates of the row being updated mattered (I think the "deduplication process" drove me to this, and your comment "The nice thing about bottom-up index deletion is that it reliably places the burden of performing deletion on the specific non-HOT updaters that are responsible for bloating the index.")

So, the page that the update has to be placed on gets picked for cleanup but the actual fetching of rows and such doesn't actually care what that tuple is.  Though its ancestors are likely to be picked.


I don't think that these subtleties need to be documented. But it's
difficult to know where to draw the line.


Knowing for certain whether the cleanup scope is "just prior versions of my update" or "any old versions on the page my update just happened to be placed on" seems user-facing worthy.

I'm almost thinking the last sentence "Each new tuple within each index....for a short period of time..." is doing more harm than good.  Removing it makes things clearer.

"...latest version in the table".

(next paragraph)

"The no-longer-visible versions of these index entries are removed by three separate processes.  Vacuum (see chapter), simple index tuple deletion (see note), and bottom-up deletion (covered in the rest of this section).  Bottom-up index deletion is an incremental version churn deletion process, triggered by an anticipated "version churn page split".  This only happens.....(the rest can probably remain the same, though I suggest moving the note to the end of the section)."

Adding "no-longer-visible" at the start should hopefully be a sufficient reminder for someone familiar with MVCC; or a clue that more reading of pre-requisite features is needed before understanding this internal part of the system.

David J.

Re: 64.4.2. Bottom-up Index Deletion

От
Peter Geoghegan
Дата:
On Wed, Nov 9, 2022 at 4:06 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Ok, so the "most recent live tuple" - this in an update, there has to be one.  Can that live tuple that is being
updatedever be removed by this process? 

No.

> I'm not sure what the long-running transaction has to do with this though - we are in an update so we've locked an
activetuple and that is the only tuple this coexistence is talking about. 

We cannot remove "recently dead" tuples, which are definitely dead
(they were rendered garbage by a transaction that has already
committed), but are still required by some snapshots. We know that
they're bound to become garbage that is safe to remove sooner or later
-- but it's not safe just yet.

> You lost me here.  Which tuple is "it" - the one we are updating got deleted under us?

No. We delete any and all tuples that are found to be deletable. We
don't care about the true reasons at that point. We're perfectly
content to be the beneficiary of a certain kind of dumb luck if things
go that way (though we'll typically see only a few garbage tuples that
originate from DELETEs or aborted inserting xacts). In the end, we're
really just ranking heap blocks, and then visiting them in the hopes
of finding that we can avoid the page split. We're choosing among heap
blocks, not index tuples (once we visit a heap block we'll delete any
index tuples that point to the heap block and are seen to be eligible
to be deleted).

> So, the page that the update has to be placed on gets picked for cleanup but the actual fetching of rows and such
doesn'tactually care what that tuple is.  Though its ancestors are likely to be picked. 

Yeah, heap pages pointed to by relatively many duplicate index tuples
tend to be visited first.

So-called "dumb luck" is more likely than you'd think. There often
won't be very many heap pages in total -- in which case we're bound to
stumble upon any garbage tuples that weren't actually rendered garbage
by an updater in passing.

> Knowing for certain whether the cleanup scope is "just prior versions of my update" or "any old versions on the page
myupdate just happened to be placed on" seems user-facing worthy. 

We don't particularly care about what the incoming updater that
happens to have triggered the bottom-up deletion pass was doing. It's
not special. We don't particularly favor the row being updated over
other similar looking rows on the same leaf page. We'll usually have
to visit 2 or 3 heap pages total (per bottom-up pass) in any case.

> I'm almost thinking the last sentence "Each new tuple within each index....for a short period of time..." is doing
moreharm than good.  Removing it makes things clearer. 

I'm unsure. Does anybody else have an opinion on this?

> Bottom-up index deletion is an incremental version churn deletion process, triggered by an anticipated "version churn
pagesplit".  This only happens.....(the rest can probably remain the same, though I suggest moving the note to the end
ofthe section)." 

Probably the most important point is that bottom-up deletion
compensates for cases where the HOT optimization wasn't used, though
only for those indexes that weren't logically modified by the UPDATE.
It more or less takes over from HOT. But it doesn't necessarily help
at all with DELETEs (not reliably).

> Adding "no-longer-visible" at the start should hopefully be a sufficient reminder for someone familiar with MVCC; or
aclue that more reading of pre-requisite features is needed before understanding this internal part of the system. 

Yeah, it's not clear what the prerequisites for reading this material
should be. It is part of the B-Tree internals chapter, though, so it
is very advanced material.

--
Peter Geoghegan