Обсуждение: Rename dead_tuples to dead_items in vacuumlazy.c

Поиск
Список
Период
Сортировка

Rename dead_tuples to dead_items in vacuumlazy.c

От
Peter Geoghegan
Дата:
Attached patch performs polishing within vacuumlazy.c, as follow-up
work to the refactoring work in Postgres 14. This mainly consists of
changing references of dead tuples to dead items, which reflects the
fact that VACUUM no longer deals with TIDs that might point to
remaining heap tuples with storage -- the TIDs in the array must now
strictly point to LP_DEAD stub line pointers that remain in the heap,
following pruning.

I've also simplified header comments, and comments above the main
entry point functions. These comments made much more sense back when
lazy_scan_heap() was simpler, but wasn't yet broken up into smaller,
better-scoped functions.

If there are no objections, I'll move on this soon. It's mostly just
mechanical changes.

-- 
Peter Geoghegan

Вложения

Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Dilip Kumar
Дата:
On Wed, Nov 24, 2021 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached patch performs polishing within vacuumlazy.c, as follow-up
> work to the refactoring work in Postgres 14. This mainly consists of
> changing references of dead tuples to dead items, which reflects the
> fact that VACUUM no longer deals with TIDs that might point to
> remaining heap tuples with storage -- the TIDs in the array must now
> strictly point to LP_DEAD stub line pointers that remain in the heap,
> following pruning.
>
> I've also simplified header comments, and comments above the main
> entry point functions. These comments made much more sense back when
> lazy_scan_heap() was simpler, but wasn't yet broken up into smaller,
> better-scoped functions.
>
> If there are no objections, I'll move on this soon. It's mostly just
> mechanical changes.

-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_ITEMS 5
+#define PROGRESS_VACUUM_NUM_DEAD_ITEMS 6

Wouldn't this be more logical to change to DEAD_TIDS instead of DEAD_ITEMS?

+ /* Sorted list of TIDs to delete from indexes */
+ ItemPointerData dead[FLEXIBLE_ARRAY_MEMBER];

Instead of just dead, why not deadtid or deaditem?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Masahiko Sawada
Дата:
On Wed, Nov 24, 2021 at 2:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached patch performs polishing within vacuumlazy.c, as follow-up
> work to the refactoring work in Postgres 14. This mainly consists of
> changing references of dead tuples to dead items, which reflects the
> fact that VACUUM no longer deals with TIDs that might point to
> remaining heap tuples with storage -- the TIDs in the array must now
> strictly point to LP_DEAD stub line pointers that remain in the heap,
> following pruning.

+1

> If there are no objections, I'll move on this soon. It's mostly just
> mechanical changes.

The patch renames dead tuples to dead items at some places and  to
dead TIDs at some places. For instance, it renames dead tuples to dead
TIDs here:

- * Return the maximum number of dead tuples we can record.
+ * Computes the number of dead TIDs that VACUUM will have to store in the
+ * worst case, where all line pointers are allocated, and all are LP_DEAD

whereas renames to dead items here:

-         * extra cost of bsearch(), especially if dead tuples on the heap are
+         * extra cost of bsearch(), especially if dead items on the heap are

I think it's more consistent if we change it to one side. I prefer "dead items".

---
There is one more place where we can rename "dead tuples":

    /*
     * Allocate the space for dead tuples.  Note that this handles parallel
     * VACUUM initialization as part of allocating shared memory space used
     * for dead_items.
     */

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Robert Haas
Дата:
On Wed, Nov 24, 2021 at 7:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I think it's more consistent if we change it to one side. I prefer "dead items".

I feel like "items" is quite a generic word, so I think I would prefer
TIDs. But it's probably not a big deal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Alvaro Herrera
Дата:
On 2021-Nov-24, Robert Haas wrote:

> On Wed, Nov 24, 2021 at 7:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I think it's more consistent if we change it to one side. I prefer
> > "dead items".
> 
> I feel like "items" is quite a generic word, so I think I would prefer
> TIDs. But it's probably not a big deal.

Is there clarity on what each term means? 

Since this patch only changes things that are specific to heap
vacuuming, it seems OK to rely the convention that "item" means "heap
item" (not just any generic item).  However, I'm not sure that we fully
agree exactly what a heap item is.  Maybe if we agree to a single non
ambiguous definition for each of those terms we can agree what
terminology to use.

It seems to me we have the following terms:

- tuple
- line pointer
- [heap] item
- TID

My mental model is that "tuple" (in the narrow context of heap vacuum)
is the variable-size on-disk representation of a row in a page; "line
pointer" is the fixed-size struct at the bottom of each page that
contains location, size and flags of a tuple: struct ItemIdData.  The
TID is the address of a line pointer -- an ItemPointerData.

What is an item?  Is an item the same as a line pointer?  That seems
confusing.  I think "item" means the tuple as a whole.  In that light,
using the term TID for some of the things that the patch renames to
"item" seems more appropriate.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Robert Haas
Дата:
On Wed, Nov 24, 2021 at 9:37 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> My mental model is that "tuple" (in the narrow context of heap vacuum)
> is the variable-size on-disk representation of a row in a page; "line
> pointer" is the fixed-size struct at the bottom of each page that
> contains location, size and flags of a tuple: struct ItemIdData.  The
> TID is the address of a line pointer -- an ItemPointerData.
>
> What is an item?  Is an item the same as a line pointer?  That seems
> confusing.  I think "item" means the tuple as a whole.  In that light,
> using the term TID for some of the things that the patch renames to
> "item" seems more appropriate.

Hmm. I think in my model an item and an item pointer and a line
pointer are all the same thing, but a TID is different. When I talk
about a TID, I mean the location of an item pointer, not its contents.
So a TID is what tells me that I want block 5 and the 4th slot in the
item pointer array. The item pointer tells me that the associate tuple
is at a certain position in the page and has a certain length.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Alvaro Herrera
Дата:
On 2021-Nov-24, Robert Haas wrote:

> Hmm. I think in my model an item and an item pointer and a line
> pointer are all the same thing, but a TID is different. When I talk
> about a TID, I mean the location of an item pointer, not its contents.
> So a TID is what tells me that I want block 5 and the 4th slot in the
> item pointer array. The item pointer tells me that the associate tuple
> is at a certain position in the page and has a certain length.

OK, but you can have item pointers that don't have any item.
LP_REDIRECT, LP_DEAD, LP_UNUSED item pointers don't have items.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Nunca confiaré en un traidor.  Ni siquiera si el traidor lo he creado yo"
(Barón Vladimir Harkonnen)



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Alvaro Herrera
Дата:
On 2021-Nov-24, Alvaro Herrera wrote:

> On 2021-Nov-24, Robert Haas wrote:
> 
> > Hmm. I think in my model an item and an item pointer and a line
> > pointer are all the same thing, but a TID is different. When I talk
> > about a TID, I mean the location of an item pointer, not its contents.
> > So a TID is what tells me that I want block 5 and the 4th slot in the
> > item pointer array. The item pointer tells me that the associate tuple
> > is at a certain position in the page and has a certain length.
> 
> OK, but you can have item pointers that don't have any item.
> LP_REDIRECT, LP_DEAD, LP_UNUSED item pointers don't have items.

Sorry to reply to myself, but I realized that I forgot to return to the
main point of this thread.  If we agree that "an LP_DEAD item pointer
does not point to any item" (an assertion that gives a precise meaning
to both those terms), then a patch that renames "tuples" to "items" is
not doing anything useful IMO, because those two terms are synonyms.

Now maybe Peter doesn't agree with the definitions I suggest, in which 
case I would like to know what his definitions are.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it."         (ncm, http://lwn.net/Articles/174769/)



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Robert Haas
Дата:
On Wed, Nov 24, 2021 at 9:51 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Nov-24, Robert Haas wrote:
> > Hmm. I think in my model an item and an item pointer and a line
> > pointer are all the same thing, but a TID is different. When I talk
> > about a TID, I mean the location of an item pointer, not its contents.
> > So a TID is what tells me that I want block 5 and the 4th slot in the
> > item pointer array. The item pointer tells me that the associate tuple
> > is at a certain position in the page and has a certain length.
>
> OK, but you can have item pointers that don't have any item.
> LP_REDIRECT, LP_DEAD, LP_UNUSED item pointers don't have items.

I guess so. I said before that I thought an item and an item pointer
were the same, but on reflection, that doesn't entirely make sense.
But I don't know that I like making item and tuple synonymous either.
I think perhaps the term "item" by itself is not very clear.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Peter Geoghegan
Дата:
On Wed, Nov 24, 2021 at 7:16 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Sorry to reply to myself, but I realized that I forgot to return to the
> main point of this thread.  If we agree that "an LP_DEAD item pointer
> does not point to any item" (an assertion that gives a precise meaning
> to both those terms), then a patch that renames "tuples" to "items" is
> not doing anything useful IMO, because those two terms are synonyms.

TIDs (ItemPointerData structs) are of course not the same thing as
line pointers (ItemIdData structs). There is a tendency to refer to
the latter as "item pointers" all the same, which was confusing. I
personally corrected/normalized this in commit ae7291ac in 2019. I
think that it's worth being careful about precisely because they're
closely related (but distinct) concepts. And so FWIW "LP_DEAD item
pointer" is not a thing. I agree that an LP_DEAD item pointer has no
tuple storage, and so you could say that it points to nothing (though
only in heapam). I probably would just say that it has no tuple
storage, though.

> Now maybe Peter doesn't agree with the definitions I suggest, in which
> case I would like to know what his definitions are.

I agree with others that the term "item" is vague, but I don't think
that that's necessarily a bad thing here -- I deliberately changed the
comments to say either "TIDs" or "LP_DEAD items", emphasizing whatever
the important aspect seemed to be in each context (they're LP_DEAD
items to the heap structure, TIDs to index structures).

I'm not attached to the term "item". To me the truly important point
is what these items are *not*: they're not tuples. The renaming is
intended to enforce the concepts that I went into at the end of the
commit message for commit 8523492d. Now the pruning steps in
lazy_scan_prune always avoiding keeping around a DEAD tuple with tuple
storage on return to lazy_scan_heap (only LP_DEAD items can remain),
since (as of that commit) lazy_scan_prune alone is responsible for
things involving the "logical database".

This means that index vacuuming and heap vacuuming can now be thought
of as removing garbage items from physical data structures (they're
purely "physical database" concepts), and nothing else. They don't
need recovery conflicts. How could they? Where are you supposed to get
the XIDs for that from, when you've only got LP_DEAD items?

This is also related to the idea that pruning by VACUUM isn't
necessarily all that special compared to earlier pruning or concurrent
opportunistic pruning. As I go into on the other recent thread on
removing special cases in vacuumlazy.c, ISTM that we ought to do
everything except pruning itself (and freezing tuples, which
effectively depends on pruning) without even acquiring a cleanup lock.
Which is actually quite a lot of things.

-- 
Peter Geoghegan



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Alvaro Herrera
Дата:
On 2021-Nov-24, Peter Geoghegan wrote:

> TIDs (ItemPointerData structs) are of course not the same thing as
> line pointers (ItemIdData structs). There is a tendency to refer to
> the latter as "item pointers" all the same, which was confusing. I
> personally corrected/normalized this in commit ae7291ac in 2019. I
> think that it's worth being careful about precisely because they're
> closely related (but distinct) concepts. And so FWIW "LP_DEAD item
> pointer" is not a thing. I agree that an LP_DEAD item pointer has no
> tuple storage, and so you could say that it points to nothing (though
> only in heapam). I probably would just say that it has no tuple
> storage, though.

OK, this makes a lot more sense.  I wasn't aware of ae7291ac (and I
wasn't aware of the significance of 8523492d either, but that's not
really relevant here.)

> I agree with others that the term "item" is vague, but I don't think
> that that's necessarily a bad thing here -- I deliberately changed the
> comments to say either "TIDs" or "LP_DEAD items", emphasizing whatever
> the important aspect seemed to be in each context (they're LP_DEAD
> items to the heap structure, TIDs to index structures).

I think we could say "LP_DEAD line pointer" and that would be perfectly
clear.  Given how nuanced we have to be if we want to be clear about
this, I would rather not use "LP_DEAD item"; that seems slightly
contradictory, since the item is the storage and such a line pointer
does not have storage.  Perhaps change that define in progress.h to
PROGRESS_VACUUM_NUM_DEAD_LPS, and, in the first comment in vacuumlazy.c,
use wording such as

+ * The major space usage for LAZY VACUUM is storage for the array of TIDs
+ * of dead line pointers that are to be removed from indexes.

or

+ * The major space usage for LAZY VACUUM is storage for the array of TIDs
+ * of LP_DEAD line pointers that are to be removed from indexes.

(The point being that TIDs are not dead themselves, only the line
pointers that they refer to.)

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."                               (Nathaniel Smith)



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Peter Geoghegan
Дата:
On Wed, Nov 24, 2021 at 9:53 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> OK, this makes a lot more sense.  I wasn't aware of ae7291ac (and I
> wasn't aware of the significance of 8523492d either, but that's not
> really relevant here.)

Thanks for hearing me out about the significance of 8523492d.

Having the right formalisms seems to really matter here, because they
enable decoupling, which is generally very useful. This makes it easy
to understand (just for example) that index vacuuming and heap
vacuuming are just additive, optional steps (in principle) -- an idea
that will become even more important once we get Robert's pending TID
conveyor belt design. I believe that that design goes one step further
than what we have today, by making index vacuuming and heap vacuuming
occur in a distinct operation to VACUUM proper (VACUUM would only need
to set up the LP_DEAD item list for index vacuuming and heap
vacuuming, which may or may not happen immediately after).

An interesting question (at least to me) is: within a non-aggressive
VACUUM, what remaining steps are *not* technically optional?

I am pretty sure that they're all optional in principle (or will be
soon), because soon we will be able to independently advance
relfrozenxid without freezing all tuples with XIDs before our original
FreezeLimit (FreezeLimit should only be used to decide which tuples to
freeze, not to decide on a new relfrozenxid). Soon almost everything
will be decoupled, without changing the basic invariants that we've
had for many years. This flexibility seems really important to me.

That just leaves avoiding pruning without necessarily avoiding
ostensibly related processing for indexes. We can already
independently prune without doing index/heap vacuuming (the bypass
indexes optimization). We will also be able to do the opposite thing,
with my new patch: we can perform index/heap vacuuming *without*
pruning ourselves. This makes sense in the case where we cannot
acquire a cleanup lock on a heap page with preexisting LP_DEAD items.

> I think we could say "LP_DEAD line pointer" and that would be perfectly
> clear.  Given how nuanced we have to be if we want to be clear about
> this, I would rather not use "LP_DEAD item"; that seems slightly
> contradictory, since the item is the storage and such a line pointer
> does not have storage.  Perhaps change that define in progress.h to
> PROGRESS_VACUUM_NUM_DEAD_LPS, and, in the first comment in vacuumlazy.c,
> use wording such as

I agree with all that, I think. But it's still not clear what the
variable dead_tuples should be renamed to within the structure that
you lay out (I imagine that you agree with me that dead_tuples is now
a bad name). This one detail affects more individual lines of code
than the restructuring of comments.

-- 
Peter Geoghegan



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Peter Geoghegan
Дата:
On Wed, Nov 24, 2021 at 4:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> The patch renames dead tuples to dead items at some places and  to
> dead TIDs at some places.

> I think it's more consistent if we change it to one side. I prefer "dead items".

I just pushed a version of the patch that still uses both terms when
talking about dead_items. But the final commit actually makes it clear
why, in comments above the LVDeadItems struct itself: LVDeadItems is
used by both index vacuuming and heap vacuuming.

Thanks
-- 
Peter Geoghegan



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Masahiko Sawada
Дата:
On Tue, Nov 30, 2021 at 3:00 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Nov 24, 2021 at 4:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > The patch renames dead tuples to dead items at some places and  to
> > dead TIDs at some places.
>
> > I think it's more consistent if we change it to one side. I prefer "dead items".
>
> I just pushed a version of the patch that still uses both terms when
> talking about dead_items.

Thanks! I'll change my parallel vacuum refactoring patch accordingly.

Regarding the commit, I think that there still is one place in
lazyvacuum.c where we can change "dead tuples” to "dead items”:

    /*
     * Allocate the space for dead tuples.  Note that this handles parallel
     * VACUUM initialization as part of allocating shared memory space used
     * for dead_items.
     */
    dead_items_alloc(vacrel, params->nworkers);
    dead_items = vacrel->dead_items;

Also, the commit doesn't change both PROGRESS_VACUUM_MAX_DEAD_TUPLES
and PROGRESS_VACUUM_NUM_DEAD_TUPLES. Did you leave them on purpose?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Peter Geoghegan
Дата:
On Mon, Nov 29, 2021 at 7:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thanks! I'll change my parallel vacuum refactoring patch accordingly.

Thanks again for working on that.

> Regarding the commit, I think that there still is one place in
> lazyvacuum.c where we can change "dead tuples” to "dead items”:
>
>     /*
>      * Allocate the space for dead tuples.  Note that this handles parallel
>      * VACUUM initialization as part of allocating shared memory space used
>      * for dead_items.
>      */
>     dead_items_alloc(vacrel, params->nworkers);
>     dead_items = vacrel->dead_items;

Oops. Pushed a fixup for that just now.

> Also, the commit doesn't change both PROGRESS_VACUUM_MAX_DEAD_TUPLES
> and PROGRESS_VACUUM_NUM_DEAD_TUPLES. Did you leave them on purpose?

That was deliberate.

It would be a bit strange to alter these constants without also
updating the corresponding column names for the
pg_stat_progress_vacuum system view. But if I kept the definition from
system_views.sql in sync, then I would break user scripts -- for
reasons that users don't care about. That didn't seem like the right
approach.

Also, the system as a whole still assumes "DEAD tuples and LP_DEAD
items are the same, and are just as much of a problem in the table as
they are in each index". As you know, this is not really true, which
is an important problem for us. Fixing it (perhaps as part of adding
something like Robert's conveyor belt design) will likely require
revising this model quite fundamentally (e.g, the vacthresh
calculation in autovacuum.c:relation_needs_vacanalyze() would be
replaced). When this happens, we'll probably need to update system
views that have columns with names like "dead_tuples" -- because maybe
we no longer specifically count dead items/tuples at all. I strongly
suspect that the approach to statistics that we take for pg_statistic
optimizer stats just doesn't work for dead items/tuples -- statistical
sampling only produces useful statistics for the optimizer because
certain delicate assumptions are met (even these assumptions only
really work with a properly normalized database schema).

Maybe revising the model used for autovacuum scheduling wouldn't
include changing pg_stat_progress_vacuum, since that isn't technically
"part of the model" --- I'm not sure. But it's not something that I am
in a hurry to fix.

--
Peter Geoghegan



Re: Rename dead_tuples to dead_items in vacuumlazy.c

От
Masahiko Sawada
Дата:
On Wed, Dec 1, 2021 at 4:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Nov 29, 2021 at 7:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Thanks! I'll change my parallel vacuum refactoring patch accordingly.
>
> Thanks again for working on that.
>
> > Regarding the commit, I think that there still is one place in
> > lazyvacuum.c where we can change "dead tuples” to "dead items”:
> >
> >     /*
> >      * Allocate the space for dead tuples.  Note that this handles parallel
> >      * VACUUM initialization as part of allocating shared memory space used
> >      * for dead_items.
> >      */
> >     dead_items_alloc(vacrel, params->nworkers);
> >     dead_items = vacrel->dead_items;
>
> Oops. Pushed a fixup for that just now.

Thanks!

>
> > Also, the commit doesn't change both PROGRESS_VACUUM_MAX_DEAD_TUPLES
> > and PROGRESS_VACUUM_NUM_DEAD_TUPLES. Did you leave them on purpose?
>
> That was deliberate.
>
> It would be a bit strange to alter these constants without also
> updating the corresponding column names for the
> pg_stat_progress_vacuum system view. But if I kept the definition from
> system_views.sql in sync, then I would break user scripts -- for
> reasons that users don't care about. That didn't seem like the right
> approach.

Agreed.

>
> Also, the system as a whole still assumes "DEAD tuples and LP_DEAD
> items are the same, and are just as much of a problem in the table as
> they are in each index". As you know, this is not really true, which
> is an important problem for us. Fixing it (perhaps as part of adding
> something like Robert's conveyor belt design) will likely require
> revising this model quite fundamentally (e.g, the vacthresh
> calculation in autovacuum.c:relation_needs_vacanalyze() would be
> replaced). When this happens, we'll probably need to update system
> views that have columns with names like "dead_tuples" -- because maybe
> we no longer specifically count dead items/tuples at all. I strongly
> suspect that the approach to statistics that we take for pg_statistic
> optimizer stats just doesn't work for dead items/tuples -- statistical
> sampling only produces useful statistics for the optimizer because
> certain delicate assumptions are met (even these assumptions only
> really work with a properly normalized database schema).
>
> Maybe revising the model used for autovacuum scheduling wouldn't
> include changing pg_stat_progress_vacuum, since that isn't technically
> "part of the model" --- I'm not sure. But it's not something that I am
> in a hurry to fix.

Understood.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/