Обсуждение: relfilenode statistics

Поиск

Список

Период

Сортировка

relfilenode statistics

От

Bertrand Drouvot

Дата:

25 мая 2024 г., 07:52:02

Hi hackers,

Please find attached a POC patch to implement $SUBJECT.

Adding relfilenode statistics has been proposed in [1]. The idea is to allow
tracking dirtied blocks, written blocks,... on a per relation basis.

The attached patch is not in a fully "polished" state yet: there is more places
we should add relfilenode counters, create more APIS to retrieve the relfilenode
stats....

But I think that it is in a state that can be used to discuss the approach it
is implementing (so that we can agree or not on it) before moving forward.

The approach that is implemented in this patch is the following:

- A new PGSTAT_KIND_RELFILENODE is added
- A new attribute (aka relfile) has been added to PgStat_HashKey so that we
can record (dboid, spcOid and relfile) to identify a relfilenode entry
- pgstat_create_transactional() is used in RelationCreateStorage()
- pgstat_drop_transactional() is used in RelationDropStorage()
- RelationPreserveStorage() will remove the entry from the list of dropped stats

The current approach to deal with table rewrite is to:

- copy the relfilenode stats in table_relation_set_new_filelocator() from
the relfilenode stats entry to the shared table stats entry
- in the pg_statio_all_tables view: add the table stats entry (that contains
"previous" relfilenode stats (due to the above) that were linked to this relation
) to the current relfilenode stats linked to the relation

An example is done in the attached patch for the new heap_blks_written field
in pg_statio_all_tables. Outcome is:

"
postgres=# create table bdt (a int);
CREATE TABLE
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt';
 heap_blks_written
-------------------
                 0
(1 row)

postgres=# insert into bdt select generate_series(1,10000);
INSERT 0 10000
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt';
 heap_blks_written
-------------------
                 0
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt';
 heap_blks_written
-------------------
                45
(1 row)

postgres=# truncate table bdt;
TRUNCATE TABLE
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt';
 heap_blks_written
-------------------
                45
(1 row)

postgres=# insert into bdt select generate_series(1,10000);
INSERT 0 10000
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt';
 heap_blks_written
-------------------
                45
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt';
 heap_blks_written
-------------------
                90
(1 row)
"

Some remarks:

- My first attempt has been to call the pgstat_create_transactional() and
pgstat_drop_transactional() at the same places it is done for the relations but
that did not work well (mainly due to corner cases in case of rewrite).

- Please don't take care of the pgstat_count_buffer_read() and 
pgstat_count_buffer_hit() calls in pgstat_report_relfilenode_buffer_read()
and pgstat_report_relfilenode_buffer_hit(). Those stats will follow the same
flow as the one done and explained above for the new heap_blks_written one (
should we agree on it).

Looking forward to your comments, feedback.

Regards,

[1]: https://www.postgresql.org/message-id/20231113204439.r4lmys73tessqmak%40awork3.anarazel.de

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

v1-0001-Provide-relfilenode-statistics.patch

Re: relfilenode statistics

От

Robert Haas

Дата:

27 мая 2024 г., 13:10:13

Hi Bertrand,

It would be helpful to me if the reasons why we're splitting out
relfilenodestats could be more clearly spelled out. I see Andres's
comment in the thread to which you linked, but it's pretty vague about
why we should do this ("it's not nice") and whether we should do this
("I wonder if this is an argument for") and maybe that's all fine if
Andres is going to be the one to review and commit this, but even if
then it would be nice if the rest of us could follow along from home,
and right now I can't.

The commit message is often a good place to spell this kind of thing
out, because then it's included with every version of the patch you
post, and may be of some use to the eventual committer in writing
their commit message. The body of the email where you post the patch
set can be fine, too.

...Robert

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

03 июня 2024 г., 11:11:46

Hi Robert,

On Mon, May 27, 2024 at 09:10:13AM -0400, Robert Haas wrote:
> Hi Bertrand,
> 
> It would be helpful to me if the reasons why we're splitting out
> relfilenodestats could be more clearly spelled out. I see Andres's
> comment in the thread to which you linked, but it's pretty vague about
> why we should do this ("it's not nice") and whether we should do this
> ("I wonder if this is an argument for") and maybe that's all fine if
> Andres is going to be the one to review and commit this, but even if
> then it would be nice if the rest of us could follow along from home,
> and right now I can't.

Thanks for the feedback! 

You’re completely right, my previous message is missing clear explanation as to
why I think that relfilenode stats could be useful. Let me try to fix this.

The main argument is that we currently don’t have writes counters for relations.
The reason is that we don’t have the relation OID when writing buffers out.
Tracking writes per relfilenode would allow us to track/consolidate writes per
relation (example in the v1 patch and in the message up-thread).

I think that adding instrumentation in this area (writes counters) could be
beneficial (like it is for the ones we currently have for reads).

Second argument is that this is also beneficial for the "Split index and
table statistics into different types of stats" thread (mentioned in the previous
message). It would allow us to avoid additional branches in some situations (like
the one mentioned by Andres in the link I provided up-thread).

If we agree that the main argument makes sense to think about having relfilenode
stats then I think using them as proposed in the second argument makes sense too:

We’d move the current buffer read and buffer hit counters from the relation stats
to the relfilenode stats (while still being able to retrieve them from the 
pg_statio_all_tables/indexes views: see the example for the new heap_blks_written
stat added in the patch). Generally speaking, I think that tracking counters at
a common level (i.e relfilenode level instead of table or index level) is
beneficial (avoid storing/allocating space for the same counters in multiple
structs) and sounds more intuitive to me.

Also I think this is open door for new ideas: for example, with relfilenode
statistics in place, we could probably also start thinking about tracking
checksum errors per relfllenode.

> The commit message is often a good place to spell this kind of thing
> out, because then it's included with every version of the patch you
> post, and may be of some use to the eventual committer in writing
> their commit message. The body of the email where you post the patch
> set can be fine, too.
> 

Yeah, I’ll update the commit message in V2 with better explanations once I get
feedback on V1 (should we decide to move on with the relfilenode stats idea).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Robert Haas

Дата:

04 июня 2024 г., 13:26:27

On Mon, Jun 3, 2024 at 7:11 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> The main argument is that we currently don’t have writes counters for relations.
> The reason is that we don’t have the relation OID when writing buffers out.

OK.

> Second argument is that this is also beneficial for the "Split index and
> table statistics into different types of stats" thread (mentioned in the previous
> message). It would allow us to avoid additional branches in some situations (like
> the one mentioned by Andres in the link I provided up-thread).

OK.

> We’d move the current buffer read and buffer hit counters from the relation stats
> to the relfilenode stats (while still being able to retrieve them from the
> pg_statio_all_tables/indexes views: see the example for the new heap_blks_written
> stat added in the patch). Generally speaking, I think that tracking counters at
> a common level (i.e relfilenode level instead of table or index level) is
> beneficial (avoid storing/allocating space for the same counters in multiple
> structs) and sounds more intuitive to me.

Hmm. So if I CLUSTER or VACUUM FULL the relation, the relfilenode
changes. Does that mean I lose all of those stats? Is that a problem?
Or is it good? Or what?

I also thought about the other direction. Suppose I drop the a
relation and create a new one that gets a different relation OID but
the same relfilenode. But I don't think that's a problem: dropping the
relation should forcibly remove the old stats, so there won't be any
conflict in this case.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

05 июня 2024 г., 05:52:33

On Tue, Jun 04, 2024 at 09:26:27AM -0400, Robert Haas wrote:
> On Mon, Jun 3, 2024 at 7:11 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > We’d move the current buffer read and buffer hit counters from the relation stats
> > to the relfilenode stats (while still being able to retrieve them from the
> > pg_statio_all_tables/indexes views: see the example for the new heap_blks_written
> > stat added in the patch). Generally speaking, I think that tracking counters at
> > a common level (i.e relfilenode level instead of table or index level) is
> > beneficial (avoid storing/allocating space for the same counters in multiple
> > structs) and sounds more intuitive to me.
> 
> Hmm. So if I CLUSTER or VACUUM FULL the relation, the relfilenode
> changes. Does that mean I lose all of those stats? Is that a problem?
> Or is it good? Or what?

I think we should keep the stats in the relation during relfilenode changes.
As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can
see in the example provided up-thread the new heap_blks_written statistic has
been preserved during the TRUNCATE. 

Please note that the v1 POC only takes care of the new heap_blks_written stat and
that the logic used in table_relation_set_new_filelocator() would probably need
to be applied in rebuild_relation() or such to deal with CLUSTER or VACUUM FULL.

For the relation, the new counter "blocks_written" has been added to the
PgStat_StatTabEntry struct (it's not needed in the PgStat_TableCounts one as the
relfilenode stat takes care of it). It's added in PgStat_StatTabEntry only
to copy/preserve the relfilenode stats during rewrite operations and to retrieve
the stats in pg_statio_all_tables.

Then, if later we split the relation stats to index/table stats, we'd have
blocks_written defined in less structs (as compare to doing the split without
relfilenode stat in place).

As mentioned up-thread, the new logic has been implemented in v1 only for the
new blocks_written stat (we'd need to do the same for the existing buffer read /
buffer hit if we agree on the approach implemented in v1).

> I also thought about the other direction. Suppose I drop the a
> relation and create a new one that gets a different relation OID but
> the same relfilenode. But I don't think that's a problem: dropping the
> relation should forcibly remove the old stats, so there won't be any
> conflict in this case.

Yeah.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

06 июня 2024 г., 07:02:41

Hi,

On Mon, Jun 03, 2024 at 11:11:46AM +0000, Bertrand Drouvot wrote:
> Yeah, I’ll update the commit message in V2 with better explanations once I get
> feedback on V1 (should we decide to move on with the relfilenode stats idea).
> 

Please find attached v2, mandatory rebase due to cd312adc56. In passing it
provides a more detailed commit message (also making clear that the goal of this
patch is to start the discussion and agree on the design before moving forward.)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

v2-0001-Provide-relfilenode-statistics.patch

Re: relfilenode statistics

От

Robert Haas

Дата:

06 июня 2024 г., 16:27:49

On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> I think we should keep the stats in the relation during relfilenode changes.
> As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
> table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can
> see in the example provided up-thread the new heap_blks_written statistic has
> been preserved during the TRUNCATE.

Yeah, I think there's something weird about this design. Somehow we're
ending up with both per-relation and per-relfilenode counters:

+                       pg_stat_get_blocks_written(C.oid) +
pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
C.relfilenode) AS heap_blks_written,

I'll defer to Andres if he thinks that's awesome, but to me it does
not seem right to track some blocks written in a per-relation counter
and others in a per-relfilenode counter.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: relfilenode statistics

От

Andres Freund

Дата:

07 июня 2024 г., 03:17:36

Hi,

On 2024-06-06 12:27:49 -0400, Robert Haas wrote:
> On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > I think we should keep the stats in the relation during relfilenode changes.
> > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
> > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can
> > see in the example provided up-thread the new heap_blks_written statistic has
> > been preserved during the TRUNCATE.
>
> Yeah, I think there's something weird about this design. Somehow we're
> ending up with both per-relation and per-relfilenode counters:
>
> +                       pg_stat_get_blocks_written(C.oid) +
> pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
> C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
> C.relfilenode) AS heap_blks_written,
>
> I'll defer to Andres if he thinks that's awesome, but to me it does
> not seem right to track some blocks written in a per-relation counter
> and others in a per-relfilenode counter.

It doesn't immediately sound awesome. Nor really necessary?

If we just want to keep prior stats upon arelation rewrite, we can just copy
the stats from the old relfilenode.  Or we can decide that those stats don't
really make sense anymore, and start from scratch.

I *guess* I could see an occasional benefit in having both counter for "prior
relfilenodes" and "current relfilenode" - except that stats get reset manually
and upon crash anyway, making this less useful than if it were really
"lifetime" stats.

Greetings,

Andres Freund

Re: relfilenode statistics

От

Andres Freund

Дата:

07 июня 2024 г., 03:38:06

Hi,

On 2024-06-03 11:11:46 +0000, Bertrand Drouvot wrote:
> The main argument is that we currently don’t have writes counters for relations.
> The reason is that we don’t have the relation OID when writing buffers out.
> Tracking writes per relfilenode would allow us to track/consolidate writes per
> relation (example in the v1 patch and in the message up-thread).
> 
> I think that adding instrumentation in this area (writes counters) could be
> beneficial (like it is for the ones we currently have for reads).
> 
> Second argument is that this is also beneficial for the "Split index and
> table statistics into different types of stats" thread (mentioned in the previous
> message). It would allow us to avoid additional branches in some situations (like
> the one mentioned by Andres in the link I provided up-thread).

I think there's another *very* significant benefit:

Right now physical replication doesn't populate statistics fields like
n_dead_tup, which can be a huge issue after failovers, because there's little
information about what autovacuum needs to do.

Auto-analyze *partially* can fix it at times, if it's lucky enough to see
enough dead tuples - but that's not a given and even if it works, is often
wildly inaccurate.

Once we put things like n_dead_tup into per-relfilenode stats, we can populate
them during WAL replay. Thus after a promotion autovacuum has much better
data.

This also is important when we crash: We've been talking about storing a
snapshot of the stats alongside each REDO pointer. Combined with updating
stats during crash recovery, we'll have accurate dead-tuple stats once recovey
has finished.

Greetings,

Andres Freund

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

07 июня 2024 г., 09:00:51

Hi,

On Thu, Jun 06, 2024 at 08:38:06PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-03 11:11:46 +0000, Bertrand Drouvot wrote:
> > The main argument is that we currently don’t have writes counters for relations.
> > The reason is that we don’t have the relation OID when writing buffers out.
> > Tracking writes per relfilenode would allow us to track/consolidate writes per
> > relation (example in the v1 patch and in the message up-thread).
> > 
> > I think that adding instrumentation in this area (writes counters) could be
> > beneficial (like it is for the ones we currently have for reads).
> > 
> > Second argument is that this is also beneficial for the "Split index and
> > table statistics into different types of stats" thread (mentioned in the previous
> > message). It would allow us to avoid additional branches in some situations (like
> > the one mentioned by Andres in the link I provided up-thread).
> 
> I think there's another *very* significant benefit:
> 
> Right now physical replication doesn't populate statistics fields like
> n_dead_tup, which can be a huge issue after failovers, because there's little
> information about what autovacuum needs to do.
> 
> Auto-analyze *partially* can fix it at times, if it's lucky enough to see
> enough dead tuples - but that's not a given and even if it works, is often
> wildly inaccurate.
> 
> 
> Once we put things like n_dead_tup into per-relfilenode stats,

Hm - I had in mind to populate relfilenode stats only with stats that are
somehow related to I/O activities. Which ones do you have in mind to put in 
relfilenode stats?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

07 июня 2024 г., 09:24:33

Hi,

On Thu, Jun 06, 2024 at 08:17:36PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-06 12:27:49 -0400, Robert Haas wrote:
> > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > > I think we should keep the stats in the relation during relfilenode changes.
> > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
> > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can
> > > see in the example provided up-thread the new heap_blks_written statistic has
> > > been preserved during the TRUNCATE.
> >
> > Yeah, I think there's something weird about this design. Somehow we're
> > ending up with both per-relation and per-relfilenode counters:
> >
> > +                       pg_stat_get_blocks_written(C.oid) +
> > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
> > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
> > C.relfilenode) AS heap_blks_written,
> >
> > I'll defer to Andres if he thinks that's awesome, but to me it does
> > not seem right to track some blocks written in a per-relation counter
> > and others in a per-relfilenode counter.
> 
> It doesn't immediately sound awesome. Nor really necessary?
> 
> If we just want to keep prior stats upon arelation rewrite, we can just copy
> the stats from the old relfilenode.

Agree, that's another option. But I think that would be in another field like
"cumulative_XXX" to ensure one could still retrieve stats that are "dedicated"
to this particular "new" relfilenode. Thoughts?

> Or we can decide that those stats don't
> really make sense anymore, and start from scratch.
> 
> 
> I *guess* I could see an occasional benefit in having both counter for "prior
> relfilenodes" and "current relfilenode" - except that stats get reset manually
> and upon crash anyway, making this less useful than if it were really
> "lifetime" stats.

Right but currently they are not lost during a relation rewrite. If we decide to
not keep the relfilenode stats during a rewrite then things like heap_blks_read
would stop surviving a rewrite (if we move it to relfilenode stats) while it
currently does. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Robert Haas

Дата:

07 июня 2024 г., 13:24:41

On Thu, Jun 6, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote:
> If we just want to keep prior stats upon arelation rewrite, we can just copy
> the stats from the old relfilenode.  Or we can decide that those stats don't
> really make sense anymore, and start from scratch.

I think we need to think carefully about what we want the user
experience to be here. "Per-relfilenode stats" could mean "sometimes I
don't know the relation OID so I want to use the relfilenumber
instead, without changing the user experience" or it could mean "some
of these stats actually properly pertain to the relfilenode rather
than the relation so I want to associate them with the right object
and that will affect how the user sees things." We need to decide
which it is. If it's the former, then we need to examine whether the
goal of hiding the distinction between relfilenode stats and relation
stats from the user is in fact feasible. If it's the latter, then we
need to make sure the whole patch reflects that design, which would
include e.g. NOT copying stats from the old to the new relfilenode,
and which would also include documenting the behavior in a way that
will be understandable to users.

In my experience, the worst thing you can do in cases like this is be
somewhere in the middle. Then you tend to end up with stuff like: the
difference isn't supposed to be something that the user knows or cares
about, except that they do have to know and care because you haven't
thoroughly covered up the deception, and often they have to reverse
engineer the behavior because you didn't document what was really
happening because you imagined that they wouldn't notice.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

10 июня 2024 г., 08:09:56

Hi,

On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote:
> > If we just want to keep prior stats upon arelation rewrite, we can just copy
> > the stats from the old relfilenode.  Or we can decide that those stats don't
> > really make sense anymore, and start from scratch.
> 
> I think we need to think carefully about what we want the user
> experience to be here. "Per-relfilenode stats" could mean "sometimes I
> don't know the relation OID so I want to use the relfilenumber
> instead, without changing the user experience" or it could mean "some
> of these stats actually properly pertain to the relfilenode rather
> than the relation so I want to associate them with the right object
> and that will affect how the user sees things." We need to decide
> which it is. If it's the former, then we need to examine whether the
> goal of hiding the distinction between relfilenode stats and relation
> stats from the user is in fact feasible. If it's the latter, then we
> need to make sure the whole patch reflects that design, which would
> include e.g. NOT copying stats from the old to the new relfilenode,
> and which would also include documenting the behavior in a way that
> will be understandable to users.

Thanks for sharing your thoughts!

Let's take the current heap_blks_read as an example: it currently survives
a relation rewrite and I guess we don't want to change the existing user
experience for it.

Now say we want to add "heap_blks_written" (like in this POC patch) then I think
that it makes sense for the user to 1) query this new stat from the same place
as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the same
experience as far the relation rewrite is concerned (keep the previous stats).

To achieve the rewrite behavior we could:

1) copy the stats from the OLD relfilenode to the relation (like in the POC patch)
2) copy the stats from the OLD relfilenode to the NEW one (could be in a dedicated
field)

The PROS of 1) is that the behavior is consistent with the current heap_blks_read
and that the user could still see the current relfilenode stats (through a new API)
if he wants to.

> In my experience, the worst thing you can do in cases like this is be
> somewhere in the middle. Then you tend to end up with stuff like: the
> difference isn't supposed to be something that the user knows or cares
> about, except that they do have to know and care because you haven't
> thoroughly covered up the deception, and often they have to reverse
> engineer the behavior because you didn't document what was really
> happening because you imagined that they wouldn't notice.

My idea was to move all that is in pg_statio_all_tables to relfilenode stats
and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure
the user can still retrieve the stats from pg_statio_all_tables in such a way
that it survives a rewrite, 3) provide dedicated APIs to retrieve
relfilenode stats but only for the current relfilenode, 4) document this
behavior. This is what the POC patch is doing for heap_blks_written (would
need to do the same for heap_blks_read and friends) except for the documentation
part. What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Kyotaro Horiguchi

Дата:

11 июня 2024 г., 06:35:23

At Mon, 10 Jun 2024 08:09:56 +0000, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote in 
> Hi,
> 
> On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote:
> > On Thu, Jun 6, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote:
> > > If we just want to keep prior stats upon arelation rewrite, we can just copy
> > > the stats from the old relfilenode.  Or we can decide that those stats don't
> > > really make sense anymore, and start from scratch.
> > 
> > I think we need to think carefully about what we want the user
> > experience to be here. "Per-relfilenode stats" could mean "sometimes I
> > don't know the relation OID so I want to use the relfilenumber
> > instead, without changing the user experience" or it could mean "some
> > of these stats actually properly pertain to the relfilenode rather
> > than the relation so I want to associate them with the right object
> > and that will affect how the user sees things." We need to decide
> > which it is. If it's the former, then we need to examine whether the
> > goal of hiding the distinction between relfilenode stats and relation
> > stats from the user is in fact feasible. If it's the latter, then we
> > need to make sure the whole patch reflects that design, which would
> > include e.g. NOT copying stats from the old to the new relfilenode,
> > and which would also include documenting the behavior in a way that
> > will be understandable to users.
> 
> Thanks for sharing your thoughts!
> 
> Let's take the current heap_blks_read as an example: it currently survives
> a relation rewrite and I guess we don't want to change the existing user
> experience for it.
> 
> Now say we want to add "heap_blks_written" (like in this POC patch) then I think
> that it makes sense for the user to 1) query this new stat from the same place
> as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the same
> experience as far the relation rewrite is concerned (keep the previous stats).
> 
> To achieve the rewrite behavior we could:
> 
> 1) copy the stats from the OLD relfilenode to the relation (like in the POC patch)
> 2) copy the stats from the OLD relfilenode to the NEW one (could be in a dedicated
> field)
> 
> The PROS of 1) is that the behavior is consistent with the current heap_blks_read
> and that the user could still see the current relfilenode stats (through a new API)
> if he wants to.
> 
> > In my experience, the worst thing you can do in cases like this is be
> > somewhere in the middle. Then you tend to end up with stuff like: the
> > difference isn't supposed to be something that the user knows or cares
> > about, except that they do have to know and care because you haven't
> > thoroughly covered up the deception, and often they have to reverse
> > engineer the behavior because you didn't document what was really
> > happening because you imagined that they wouldn't notice.
> 
> My idea was to move all that is in pg_statio_all_tables to relfilenode stats
> and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure
> the user can still retrieve the stats from pg_statio_all_tables in such a way
> that it survives a rewrite, 3) provide dedicated APIs to retrieve
> relfilenode stats but only for the current relfilenode, 4) document this
> behavior. This is what the POC patch is doing for heap_blks_written (would
> need to do the same for heap_blks_read and friends) except for the documentation
> part. What do you think?

In my opinion, it is certainly strange that bufmgr is aware of
relation kinds, but introducing relfilenode stats to avoid this skew
doesn't seem to be the best way, as it invites inconclusive arguments
like the one raised above. The fact that we transfer counters from old
relfilenodes to new ones indicates that we are not really interested
in counts by relfilenode. If that's the case, wouldn't it be simpler
to call pgstat_count_relation_buffer_read() from bufmgr.c and then
branch according to relkind within that function? If you're concerned
about the additional branch, some ingenuity may be needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

12 июня 2024 г., 13:29:54

Hi,

On Tue, Jun 11, 2024 at 03:35:23PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 10 Jun 2024 08:09:56 +0000, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote in 
> > 
> > My idea was to move all that is in pg_statio_all_tables to relfilenode stats
> > and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure
> > the user can still retrieve the stats from pg_statio_all_tables in such a way
> > that it survives a rewrite, 3) provide dedicated APIs to retrieve
> > relfilenode stats but only for the current relfilenode, 4) document this
> > behavior. This is what the POC patch is doing for heap_blks_written (would
> > need to do the same for heap_blks_read and friends) except for the documentation
> > part. What do you think?
> 
> In my opinion,

Thanks for looking at it!

> it is certainly strange that bufmgr is aware of
> relation kinds, but introducing relfilenode stats to avoid this skew
> doesn't seem to be the best way, as it invites inconclusive arguments
> like the one raised above. The fact that we transfer counters from old
> relfilenodes to new ones indicates that we are not really interested
> in counts by relfilenode. If that's the case, wouldn't it be simpler
> to call pgstat_count_relation_buffer_read() from bufmgr.c and then
> branch according to relkind within that function? If you're concerned
> about the additional branch, some ingenuity may be needed.

That may be doable for "read" activities but what about write activities?
Do you mean not relying on relfilenode stats for reads but relying on relfilenode
stats for writes?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Michael Paquier

Дата:

10 июля 2024 г., 06:02:34

On Sat, May 25, 2024 at 07:52:02AM +0000, Bertrand Drouvot wrote:
> But I think that it is in a state that can be used to discuss the approach it
> is implementing (so that we can agree or not on it) before moving
> forward.

I have read through the patch to get an idea of how things are done,
and I am troubled by the approach taken (mentioned down by you), but
that's invasive compared to how pgstats wants to be transparent with
its stats kinds.

+   Oid         objoid;         /* object ID, either table or function
or tablespace. */
+   RelFileNumber relfile;      /* relfilenumber for RelFileLocator. */
 } PgStat_HashKey;

This adds a relfilenode component to the central hash key used for the
dshash of pgstats, which is something most stats types don't care
about.  That looks like the incorrect thing to do to me, particularly
seeing a couple of lines down that a stats kind is assigned so the
HashKey uniqueness is ensured by the KindInfo:
+   [PGSTAT_KIND_RELFILENODE] = {
+       .name = "relfilenode",

FWIW, I have on my stack of patches something to switch the objoid to
8 bytes, actually, which is something that would be required for
pg_stat_statements as query IDs are wider than that and affect all
databases, FWIW.  Relfilenodes are 4 bytes, okay still Robert has
proposed a couple of years ago a patch set to bump that to 56 bits,
change reverted in a448e49bcbe4.  The objoid is also not something
specific to OIDs, see replication slots with their idx for example.

What you would be looking instead is to use the relfilenode as an
objoid and keep track of the OID of the original relation in each
PgStat_StatRelFileNodeEntry so as it is possible to know where a past
relfilenode was used?  That makes looking back at the past relation's
elfilenodes stats more complicated as it would be necessary to keep a
list of the past relfilenodes for a relation, as well.  Perhaps with
some kind of cache that maintains a mapping between the relation and
its relfilenode history?
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

10 июля 2024 г., 13:38:06

Hi,

On Wed, Jul 10, 2024 at 03:02:34PM +0900, Michael Paquier wrote:
> On Sat, May 25, 2024 at 07:52:02AM +0000, Bertrand Drouvot wrote:
> > But I think that it is in a state that can be used to discuss the approach it
> > is implementing (so that we can agree or not on it) before moving
> > forward.
> 
> I have read through the patch to get an idea of how things are done,

Thanks!

> and I am troubled by the approach taken (mentioned down by you), but
> that's invasive compared to how pgstats wants to be transparent with
> its stats kinds.
> 
> +   Oid         objoid;         /* object ID, either table or function
> or tablespace. */
> +   RelFileNumber relfile;      /* relfilenumber for RelFileLocator. */
>  } PgStat_HashKey;
> 
> This adds a relfilenode component to the central hash key used for the
> dshash of pgstats, which is something most stats types don't care
> about.

That's right but that's an existing behavior without the patch as:

PGSTAT_KIND_DATABASE does not care care about the objoid
PGSTAT_KIND_REPLSLOT does not care care about the dboid
PGSTAT_KIND_SUBSCRIPTION does not care care about the dboid

That's 3 kinds out of the 5 non fixed stats kind.

Not saying it's good, just saying that's an existing behavior.

> That looks like the incorrect thing to do to me, particularly
> seeing a couple of lines down that a stats kind is assigned so the
> HashKey uniqueness is ensured by the KindInfo:
> +   [PGSTAT_KIND_RELFILENODE] = {
> +       .name = "relfilenode",

You mean, just rely on kind, dboid and relfile to ensure uniqueness?

I'm not sure that would work as there is this comment in relfilelocator.h:

"
 * Notice that relNumber is only unique within a database in a particular
 * tablespace.
"

So, I think it makes sense to link the hashkey to all the RelFileLocator
fields, means:

dboid (linked to RelFileLocator's dbOid)
objoid (linked to RelFileLocator's spcOid)
relfile (linked to RelFileLocator's relNumber)

> FWIW, I have on my stack of patches something to switch the objoid to
> 8 bytes, actually, which is something that would be required for
> pg_stat_statements as query IDs are wider than that and affect all
> databases, FWIW.  Relfilenodes are 4 bytes, okay still Robert has
> proposed a couple of years ago a patch set to bump that to 56 bits,
> change reverted in a448e49bcbe4.

Right, but it really looks like this extra field is needed to ensure
uniqueness (see above).

> What you would be looking instead is to use the relfilenode as an
> objoid

Not sure that works, as it looks like uniqueness won't be ensured (see above).

> and keep track of the OID of the original relation in each
> PgStat_StatRelFileNodeEntry so as it is possible to know where a past
> relfilenode was used?  That makes looking back at the past relation's
> elfilenodes stats more complicated as it would be necessary to keep a
> list of the past relfilenodes for a relation, as well.  Perhaps with
> some kind of cache that maintains a mapping between the relation and
> its relfilenode history?

Yeah, I also thought about keeping a list of "previous" relfilenodes stats for a
relation but that would lead to:

1. Keep previous relfilnode stats 
2. A more complicated way to look at relation stats (as you said)
3. Extra memory usage

I think the only reason "previous" relfilenode stats are needed is to provide
accurate stats for the relation. Outside of this need, I don't think we would
want to retrieve "individual" previous relfilenode stats in the past.

That's why the POC patch "simply" copies the stats to the relation during a
rewrite (before getting rid of the "previous" relfilenode stats).

What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Michael Paquier

Дата:

11 июля 2024 г., 04:58:19

On Wed, Jul 10, 2024 at 01:38:06PM +0000, Bertrand Drouvot wrote:
> On Wed, Jul 10, 2024 at 03:02:34PM +0900, Michael Paquier wrote:
>> and I am troubled by the approach taken (mentioned down by you), but
>> that's invasive compared to how pgstats wants to be transparent with
>> its stats kinds.
>>
>> +   Oid         objoid;         /* object ID, either table or function
>> or tablespace. */
>> +   RelFileNumber relfile;      /* relfilenumber for RelFileLocator. */
>>  } PgStat_HashKey;
>>
>> This adds a relfilenode component to the central hash key used for the
>> dshash of pgstats, which is something most stats types don't care
>> about.
>
> That's right but that's an existing behavior without the patch as:
>
> PGSTAT_KIND_DATABASE does not care care about the objoid
> PGSTAT_KIND_REPLSLOT does not care care about the dboid
> PGSTAT_KIND_SUBSCRIPTION does not care care about the dboid
>
> That's 3 kinds out of the 5 non fixed stats kind.

I'd like to think that this is just going to increase across time.

>> That looks like the incorrect thing to do to me, particularly
>> seeing a couple of lines down that a stats kind is assigned so the
>> HashKey uniqueness is ensured by the KindInfo:
>> +   [PGSTAT_KIND_RELFILENODE] = {
>> +       .name = "relfilenode",
>
> You mean, just rely on kind, dboid and relfile to ensure uniqueness?

Or table OID for the objid, with a hardcoded number of past
relfilenodes stats stored, to limit bloating the dshash with too much
past stats.  See below.

> So, I think it makes sense to link the hashkey to all the RelFileLocator
> fields, means:
>
> dboid (linked to RelFileLocator's dbOid)
> objoid (linked to RelFileLocator's spcOid)
> relfile (linked to RelFileLocator's relNumber)

Hmm.  How about using the table OID as objoid, but store in the stats
of the new KindInfo an array of entries with the relfilenodes (current
and past, perhaps with more data than the relfilenode to ensure the
uniqueness tracking) and each of its stats?  The number of past
relfilenodes would be fixed, meaning that there would be a strict
control with the retention of the past stats.  When a table is
dropped, removing its relfilenode stats would be as cheap as when its
PGSTAT_KIND_RELATION is dropped.

> Yeah, I also thought about keeping a list of "previous" relfilenodes stats for a
> relation but that would lead to:
>
> 1. Keep previous relfilnode stats
> 2. A more complicated way to look at relation stats (as you said)
> 3. Extra memory usage
>
> I think the only reason "previous" relfilenode stats are needed is to provide
> accurate stats for the relation. Outside of this need, I don't think we would
> want to retrieve "individual" previous relfilenode stats in the past.
>
> That's why the POC patch "simply" copies the stats to the relation during a
> rewrite (before getting rid of the "previous" relfilenode stats).

Hmm.  Okay.
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

11 июля 2024 г., 06:10:23

Hi,

On Thu, Jul 11, 2024 at 01:58:19PM +0900, Michael Paquier wrote:
> On Wed, Jul 10, 2024 at 01:38:06PM +0000, Bertrand Drouvot wrote:
> > So, I think it makes sense to link the hashkey to all the RelFileLocator
> > fields, means:
> > 
> > dboid (linked to RelFileLocator's dbOid)
> > objoid (linked to RelFileLocator's spcOid)
> > relfile (linked to RelFileLocator's relNumber)
> 
> Hmm.  How about using the table OID as objoid,

The issue is that we don't have the relation OID when writing buffers out (that's
one of the reason explained in [1]).

[1]: https://www.postgresql.org/message-id/Zl2k8u4HDTUW6QlC%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

05 августа 2024 г., 05:28:22

Hi,

On Thu, Jul 11, 2024 at 06:10:23AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Jul 11, 2024 at 01:58:19PM +0900, Michael Paquier wrote:
> > On Wed, Jul 10, 2024 at 01:38:06PM +0000, Bertrand Drouvot wrote:
> > > So, I think it makes sense to link the hashkey to all the RelFileLocator
> > > fields, means:
> > > 
> > > dboid (linked to RelFileLocator's dbOid)
> > > objoid (linked to RelFileLocator's spcOid)
> > > relfile (linked to RelFileLocator's relNumber)
> > 
> > Hmm.  How about using the table OID as objoid,
> 
> The issue is that we don't have the relation OID when writing buffers out (that's
> one of the reason explained in [1]).
> 
> [1]: https://www.postgresql.org/message-id/Zl2k8u4HDTUW6QlC%40ip-10-97-1-34.eu-west-3.compute.internal
> 
> Regards,
> 

Please find attached a mandatory rebase due to the recent changes around
statistics.

As mentioned up-thread:

The attached patch is not in a fully "polished" state yet: there is more places
we should add relfilenode counters, create more APIS to retrieve the relfilenode
stats....

It is in a state that can be used to discuss the approach it is implementing (as
we have done so far) before moving forward.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

v3-0001-Provide-relfilenode-statistics.patch

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

04 ноября 2024 г., 12:27:38

Hi,

On Tue, Sep 10, 2024 at 05:30:32AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Sep 05, 2024 at 04:48:36AM +0000, Bertrand Drouvot wrote:
> > Please find attached a mandatory rebase.
> > 
> > In passing, checking if based on the previous discussion (and given that we
> > don't have the relation OID when writing buffers out) you see another approach
> > that the one this patch is implementing?
> 
> Attached v5, mandatory rebase due to recent changes in the stats area.

Attached v6, mandatory rebase due to b14e9ce7d5.

Note that 0001 is the same as the one proposed in [0] and needs to be applied
here to make the stats machinery working as expected with the relfile added in
the stats hash key (though it deserves its own dedicated thread as explained in [0]).

Don't look at 0001 and 0002 as I think we need more design discussion.

=== Sum up the feedback received up-thread

I re-read this thread and it appears that there is 3 main remarks:

R1: Andres did propose to add stuff like "n_dead_tup" (see [1]), to provide
even more benefits.

R2: Robert mentioned ([2]) that we need to decide between "sometimes I
don't know the relation OID so I want to use the relfilenumber
instead, without changing the user experience" and "some
of these stats actually properly pertain to the relfilenode rather
than the relation so I want to associate them with the right object
and that will affect how the user sees things".

R3: Michael had concerns about adding a new field (the relfile) in the hash key,
see [3].

=== My thoughts:

While my initial idea was that the relfilenode stats would deal only with I/O
activities it also looks like that it would be benficial to add sutff like
"n_dead_tup".

Then I think we should go with the "sometimes I don't know the relation OID
so I want to use the relfilenumber instead, without changing the user experience"
way.

Regarding the concern about adding a new field in the hash key, I think we can't
avoid that as we don't have the relation OID when writing buffers out.

=== Moving forward

I would go for trying to store everything that is "relation" related into the
relfilenode stats (that will then include n_dead_tup among other things) and
try to hide the distinction between relfilenode stats and relation stats from
the user.

Thoughts of moving forward that way?

[0]: https://www.postgresql.org/message-id/Zyb7RW1y9dVfO0UH%40ip-10-97-1-34.eu-west-3.compute.internal
[1]: https://www.postgresql.org/message-id/20240607033806.6gwgolihss72cj6r%40awork3.anarazel.de
[2]: https://www.postgresql.org/message-id/CA%2BTgmoZtwT6h%3DnyuQ1J9GNSrRyhf0fv7Ai6FzO%3DbH0C9Bf6tew%40mail.gmail.com
[3]: https://www.postgresql.org/message-id/Zo9j69GhexDpeV4k%40paquier.xyz

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Robert Haas

Дата:

04 ноября 2024 г., 22:51:10

On Mon, Nov 4, 2024 at 4:27 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Then I think we should go with the "sometimes I don't know the relation OID
> so I want to use the relfilenumber instead, without changing the user experience"
> way.

So does the latest version of the patch implement that principal
uniformly throughout?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

05 ноября 2024 г., 09:05:58

Hi,

On Mon, Nov 04, 2024 at 02:51:10PM -0500, Robert Haas wrote:
> On Mon, Nov 4, 2024 at 4:27 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > Then I think we should go with the "sometimes I don't know the relation OID
> > so I want to use the relfilenumber instead, without changing the user experience"
> > way.
> 
> So does the latest version of the patch implement that principal
> uniformly throughout?

No, please don't look at v6-0001 and 0002 (as mentioned up-thread). The purpose
here is mainly to get an agreement on the design before moving forward.

Does it sound ok to you to move with the above principal? (I'm +1 on it).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Robert Haas

Дата:

05 ноября 2024 г., 17:44:36

On Tue, Nov 5, 2024 at 1:06 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Does it sound ok to you to move with the above principal? (I'm +1 on it).

Yes, provided we can get a clean implementation of it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: relfilenode statistics

От

Kirill Reshke

Дата:

29 ноября 2024 г., 09:23:12

On Tue, 5 Nov 2024 at 11:06, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
>
> Does it sound ok to you to move with the above principal? (I'm +1 on it).
>

Hi! I looked through this thread.
Looks like we are still awaiting a patch which stores more counters
(n_dead_tup, ... etc) into relfilenode stats. So, I assume this should
be moved to the next CF.

I also have a very stupid question:
If we don’t have the relation OID when writing buffers out, can we
just store oid to buffertag mapping somewhere and use it?
I suspect that this is a horrible idea, but what's the exact reason?
Is it that we will break too many abstraction layers for such a minor
matter?

--
Best regards,
Kirill Reshke

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

29 ноября 2024 г., 18:20:02

Hi,

On Fri, Nov 29, 2024 at 11:23:12AM +0500, Kirill Reshke wrote:
> On Tue, 5 Nov 2024 at 11:06, Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> >
> > Does it sound ok to you to move with the above principal? (I'm +1 on it).
> >
> 
> Hi! I looked through this thread.

Thanks for looking at it!

> Looks like we are still awaiting a patch which stores more counters
> (n_dead_tup, ... etc) into relfilenode stats.

Yes.

> If we don’t have the relation OID when writing buffers out, can we
> just store oid to buffertag mapping somewhere and use it?

Do you mean add the relation OID into the BufferTag? While that could probably
be done from a technical point of view (with probably non negligible amount
of refactoring), I can see those cons:

1. We'd increase the BufferDesc size and approaching the 64 bytes limit (cache
line size) that we don't want to exceed (see comment above BufferDesc definition)
2. Probably lot of refactoring
3. This new member would be there "only" for stats and reporting purpose as
it is not needed at all for buffer related operations
4. 3. seems to indicate that's not the right place

Then I think 1. and 2. are not worth it given 3. and 4.

There is probably other cons too though.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Kirill Reshke

Дата:

29 ноября 2024 г., 18:52:13

On Fri, 29 Nov 2024 at 20:20, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> On Fri, Nov 29, 2024 at 11:23:12AM +0500, Kirill Reshke wrote:
> > If we don’t have the relation OID when writing buffers out, can we
> > just store oid to buffertag mapping somewhere and use it?
>
> Do you mean add the relation OID into the BufferTag? While that could probably
> be done from a technical point of view (with probably non negligible amount
> of refactoring), I can see those cons:

Not exactly, what i had in mind was a separate hashmap into shared
memory, mapping buffertag<>oid.

> 2. Probably lot of refactoring
> 3. This new member would be there "only" for stats and reporting purpose as
> it is not needed at all for buffer related operations

To this design, your points 2&3 apply.


--
Best regards,
Kirill Reshke

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

03 декабря 2024 г., 13:31:15

Hi,

On Fri, Nov 29, 2024 at 08:52:13PM +0500, Kirill Reshke wrote:
> On Fri, 29 Nov 2024 at 20:20, Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > On Fri, Nov 29, 2024 at 11:23:12AM +0500, Kirill Reshke wrote:
> > > If we don’t have the relation OID when writing buffers out, can we
> > > just store oid to buffertag mapping somewhere and use it?
> >
> > Do you mean add the relation OID into the BufferTag? While that could probably
> > be done from a technical point of view (with probably non negligible amount
> > of refactoring), I can see those cons:
> 
> Not exactly, what i had in mind was a separate hashmap into shared
> memory, mapping buffertag<>oid.

I see.

> > 2. Probably lot of refactoring
> > 3. This new member would be there "only" for stats and reporting purpose as
> > it is not needed at all for buffer related operations
> 
> To this design, your points 2&3 apply.

That said, it might also help for DropRelationBuffers() where we need to scan
the entire buffer pool (there is an optimization in place though). We could
imagine buffertag as key and the value could be the relation OID and each entry
would have next/prev pointers linking to other BufferTags with same OID.

That's probably much more refactoring (and more invasive) that the initial idea
in this thread but could lead to multiple pros though. I'm not very familar with
the "buffer" area of the code and would also need to study the performance impact
to maintain this new hash map.

Do you and/or others have any thoughts/ideas about it?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

03 января, 19:18:37

Hi,

On Tue, Dec 03, 2024 at 10:31:15AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Nov 29, 2024 at 08:52:13PM +0500, Kirill Reshke wrote:
> > On Fri, 29 Nov 2024 at 20:20, Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > > On Fri, Nov 29, 2024 at 11:23:12AM +0500, Kirill Reshke wrote:
> > > > If we don’t have the relation OID when writing buffers out, can we
> > > > just store oid to buffertag mapping somewhere and use it?
> > >
> > > Do you mean add the relation OID into the BufferTag? While that could probably
> > > be done from a technical point of view (with probably non negligible amount
> > > of refactoring), I can see those cons:
> > 
> > Not exactly, what i had in mind was a separate hashmap into shared
> > memory, mapping buffertag<>oid.
> 
> I see.
> 
> > > 2. Probably lot of refactoring
> > > 3. This new member would be there "only" for stats and reporting purpose as
> > > it is not needed at all for buffer related operations
> > 
> > To this design, your points 2&3 apply.
> 
> That said, it might also help for DropRelationBuffers() where we need to scan
> the entire buffer pool (there is an optimization in place though). We could
> imagine buffertag as key and the value could be the relation OID and each entry
> would have next/prev pointers linking to other BufferTags with same OID.
> 
> That's probably much more refactoring (and more invasive) that the initial idea
> in this thread but could lead to multiple pros though. I'm not very familar with
> the "buffer" area of the code and would also need to study the performance impact
> to maintain this new hash map.
> 
> Do you and/or others have any thoughts/ideas about it?

As mentioned by Andres in [1], relying on the relation OID would not work to
"recover" the stats because we don't have access to the relation oid during crash
recovery. So, I'm going to resume working on the "initial" idea (i.e having the
stats keyed by relfilenode).

[1]: https://www.postgresql.org/message-id/xvetwjsnkhx2gp6np225g2h64f4mfmg6oopkuaiivrpzd2futj%40pflk55su36ho

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Kirill Reshke

Дата:

13 марта, 12:00:52

On Fri, 3 Jan 2025 at 21:18, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> As mentioned by Andres in [1], relying on the relation OID would not work to
> "recover" the stats because we don't have access to the relation oid during crash
> recovery. So, I'm going to resume working on the "initial" idea (i.e having the
> stats keyed by relfilenode).
>
> [1]: https://www.postgresql.org/message-id/xvetwjsnkhx2gp6np225g2h64f4mfmg6oopkuaiivrpzd2futj%40pflk55su36ho
>

Hmm. While it is true that catalog lookups cannot be performed during
crash recovery, is it really necessary to save and retrieve statistics
after a crash? Given that statistics are permitted to be outdated and
server crashes are anticipated to be infrequent, it looks loke losing
a few analysis runs due to server crashes is acceptable.
In any case, I am totally OK with the relfilenode-based method because
it is generally less restricted (to other postgresql parts e.g. wal-
replay ) and simpler.

Also, this patch needs a rebase;)

-- 
Best regards,
Kirill Reshke

Re: relfilenode statistics

От

Michael Paquier

Дата:

16 сентября, 09:44:25

On Thu, Mar 13, 2025 at 02:00:52PM +0500, Kirill Reshke wrote:
> Hmm. While it is true that catalog lookups cannot be performed during
> crash recovery, is it really necessary to save and retrieve statistics
> after a crash?

Yes, losing stats on crash is a *very* annoying thing.  Having no
stats for a relation means that autovacuum gives up entirely on
relations it has no stats of, skipping it entirely until they have
rebuilt and bloat would accumulate.  Being able to recover these stats
from crash recovery is a cheap design, that would improve reliability
by a large degree.

> Given that statistics are permitted to be outdated and
> server crashes are anticipated to be infrequent, it looks loke losing
> a few analysis runs due to server crashes is acceptable.
> In any case, I am totally OK with the relfilenode-based method because
> it is generally less restricted (to other postgresql parts e.g. wal-
> replay ) and simpler.

The startup process is not connected to a database and has no access
to pg_class: the only thing we can know about are the on-disk files,
not their in-catalog OIDs.  FWIW, I think that this patch would be a
huge step forward a more reliable stats system.

True that the patch needs a rebase.  Bertrand has also mentioned that
some points needed more work.
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

30 сентября, 13:13:57

Hi,

On Tue, Sep 16, 2025 at 03:44:25PM +0900, Michael Paquier wrote:
> On Thu, Mar 13, 2025 at 02:00:52PM +0500, Kirill Reshke wrote:
> > Hmm. While it is true that catalog lookups cannot be performed during
> > crash recovery, is it really necessary to save and retrieve statistics
> > after a crash?
> 
> Yes, losing stats on crash is a *very* annoying thing.  Having no
> stats for a relation means that autovacuum gives up entirely on
> relations it has no stats of, skipping it entirely until they have
> rebuilt and bloat would accumulate.  Being able to recover these stats 
> from crash recovery is a cheap design, that would improve reliability
> by a large degree.

+1.

> The startup process is not connected to a database and has no access
> to pg_class: the only thing we can know about are the on-disk files,
> not their in-catalog OIDs.  FWIW, I think that this patch would be a
> huge step forward a more reliable stats system.
> 
> True that the patch needs a rebase.  Bertrand has also mentioned that
> some points needed more work.

Right. I'll come back with a rebase, and a POC proposal on some stats so that
we could agree on the design. Also, it looks like that we have a consensus on 
"sometimes I don't know the relation OID so I want to use the relfilenumber instead,
without changing the user experience" (see [1)).

As far Michael's concern about adding a new field in the hash key, as 8 bytes
is allocated for the object ID, then we can go with:

dboid (linked to RelFileLocator's dbOid)
objoid (linked to RelFileLocator's spcOid and to the RelFileLocator's relNumber)

and avoid adding a new field in the key.

[1]: https://www.postgresql.org/message-id/CA%2BTgmoZ0u6ek_xxYJaGVBk0uEvH5txoYsCwbvxKWe-2xn_G_qg%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Michael Paquier

Дата:

01 октября, 02:05:16

On Tue, Sep 30, 2025 at 10:13:57AM +0000, Bertrand Drouvot wrote:
> As far Michael's concern about adding a new field in the hash key, as 8 bytes
> is allocated for the object ID, then we can go with:
>
> dboid (linked to RelFileLocator's dbOid)
> objoid (linked to RelFileLocator's spcOid and to the RelFileLocator's relNumber)
>
> and avoid adding a new field in the key.

RelFileNumber is a 4-byte Oid, so this mapping should be able to work.

Is there any reason why you would want an efficient filtering of the
contents of the shared hashtable based only on a relnumber or a
tablespace OID?  Perhaps yes, like when a relfilenode is dropped into
a bin for an efficient removal from the shared hashtable so as we
don't need to do a seqscan, I just don't remember all the details of
the patch and if it could act as a bottleneck in some scenarios.
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

01 октября, 17:33:11

Hi,

On Wed, Oct 01, 2025 at 08:05:16AM +0900, Michael Paquier wrote:
> On Tue, Sep 30, 2025 at 10:13:57AM +0000, Bertrand Drouvot wrote:
> > As far Michael's concern about adding a new field in the hash key, as 8 bytes
> > is allocated for the object ID, then we can go with:
> > 
> > dboid (linked to RelFileLocator's dbOid)
> > objoid (linked to RelFileLocator's spcOid and to the RelFileLocator's relNumber)
> > 
> > and avoid adding a new field in the key.
> 
> RelFileNumber is a 4-byte Oid, so this mapping should be able to work.

Right.

> Is there any reason why you would want an efficient filtering of the
> contents of the shared hashtable based only on a relnumber or a
> tablespace OID?

Not that I can think of currently.

> Perhaps yes, like when a relfilenode is dropped into
> a bin for an efficient removal from the shared hashtable so as we
> don't need to do a seqscan, I just don't remember all the details of
> the patch and if it could act as a bottleneck in some scenarios.

I think the first step is to replace (i.e get rid) PGSTAT_KIND_RELATION by a brand
new PGSTAT_KIND_RELFILENODE and move all the existing stats that are currently
under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE.

Let's do this by keeping the pg_stat_all_tables|indexes and pg_statio_all_tables|indexes
on top of the PGSTAT_KIND_RELFILENODE and ensure that a relation rewrite keeps 
those stats. Once done, we could work from there to add new stats (add writes
counters and ensure that some counters (n_dead_tup and friends) are replicated).

Does that make sense to you?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Michael Paquier

Дата:

02 октября, 04:23:11

On Wed, Oct 01, 2025 at 02:33:11PM +0000, Bertrand Drouvot wrote:
> I think the first step is to replace (i.e get rid) PGSTAT_KIND_RELATION by a brand
> new PGSTAT_KIND_RELFILENODE and move all the existing stats that are currently
> under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE.

Likely so, yes.

> Let's do this by keeping the pg_stat_all_tables|indexes and pg_statio_all_tables|indexes
> on top of the PGSTAT_KIND_RELFILENODE and ensure that a relation rewrite keeps
> those stats. Once done, we could work from there to add new stats (add writes
> counters and ensure that some counters (n_dead_tup and friends) are replicated).

Do you think it is OK to define non-transactional pending stats as
being always a subset of the transactional stats?  I don't quite see
if there would be a case to have stats that are only flushed in a
non-transactional path, while being discarded at the stats report done
at transaction commit time.  This means that it may be possible to
structure things so as the pending non-transaction stats structure are
always part of the transactional bits, and that the other way around
is not possible.  Perhaps that influences the design choices, at least
a bit.
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

07 ноября, 14:28:27

Hi,

On Thu, Oct 02, 2025 at 10:23:11AM +0900, Michael Paquier wrote:
> On Wed, Oct 01, 2025 at 02:33:11PM +0000, Bertrand Drouvot wrote:
> > I think the first step is to replace (i.e get rid) PGSTAT_KIND_RELATION by a brand
> > new PGSTAT_KIND_RELFILENODE and move all the existing stats that are currently
> > under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE.
> 
> Likely so, yes.

PFA the new implementation. It does not introduce a new PGSTAT_KIND_RELFILENODE,
instead it keys the PGSTAT_KIND_RELATION by relfile locator. We may want to
rename PGSTAT_KIND_RELATION to PGSTAT_KIND_RELFILENODE as a next step.

The patch is structured that way:

==== 0001

Add stats tests related to rewrite

While there are existing rewrite tests, the stats behavior during rewrites
doesn't have a good coverage. This patch adds some tests to record some stats
after different rewrite scenarios.

That way, we'll be able to test that the stats are still the ones we
expect after rewrites. Note that it generates a new stats_1.out (which is quite
large), so we may want to move those new tests to "isolation" instead.

==== 0002

Key PGSTAT_KIND_RELATION by relfile locator

This patch changes the key used for the PGSTAT_KIND_RELATION statistic kind.
Instead of the relation oid, it now relies on:

- dboid (linked to RelFileLocator's dbOid)
- objoid which is the result of a new macro (namely RelFileLocatorToPgStatObjid())
that computes an objoid based on the RelFileLocator's spcOid and the RelFileLocator's
relNumber.

That will allow us to add new stats (add writes counters) and ensure that some
counters (n_dead_tup and friends) are replicated.

The patch introduces pgstat_reloid_to_relfilelocator() to 1) avoid calling
RelationIdGetRelation() to get the relfilelocator based on the relation oid
and 2) handle the partitioned table case.

Please note that:

- when running pg_stat_have_stats('relation',...) we now need to be connected
to the database that hosts the relation. As pg_stat_have_stats() is not
documented publicly, then the changes done in 029_stats_restart.pl look
enough.

- this patch does not handle rewrites so some tests are failing. It's only
intent is to ease the review and should not be pushed without being
merged with the following patch that handles the rewrites.

- it can be used to test that stats are incremented correctly and that we're
able to retrieve them as long as rewrites are not involved.

==== 0003

handle relation statistics correctly during rewrites

Now that PGSTAT_KIND_RELATION is keyed by refilenode, we need to handle rewrites.

To do so, this patch:

- Adds PgStat_PendingRewrite, a new struct to track rewrite operations within
a transaction, storing the old locator, new locator, and original locator (for
rewrite chains). This allows stats to be copied from the original location to
the final location at commit time.

- Adds a new function, pgstat_mark_rewrite(), called when a table rewrite begins.
It records the rewrite operation in a local list and detects rewrite chains by
checking if the old_locator matches any existing new_locator, preserving the
chain's original_locator.

- Modifies pgstat_copy_relation_stats(), to accept RelFileLocators instead of
Relations, with a new increment parameter to accumulate stats (needed for rewrite
chains with DML between rewrites).

- Ensures that AtEOXact_PgStat_Relations(), AtPrepare_PgStat_Relations(),
pgstat_twophase_postcommit()/postabort() pgstat_drop_relation() handle the
PgStat_PendingRewrite list correctly.

Note that due to the new flush call in pgstat_twophase_postcommit() we can not
call GetCurrentTransactionStopTimestamp() in pgstat_relation_flush_cb(). So,
adding a check to handle this special case and call GetCurrentTimestamp() instead.
Note that we'd call GetCurrentTimestamp() only if there is a rewrite, so that
the GetCurrentTimestamp() extra cost should be negligible. Another solution
could be to trigger the flush from FinishPreparedTransaction() but that's not
worth the extra complexity.

The new pending_rewrites list is traversed in multiple places. The overhead
should be negligible in comparison to a rewrite and the list should not contain
a lot of rewrites in practice.

Another design that I tried was to copy the stats in pgstat_mark_rewrite() but
that lead to difficulties during abort, subtransactions. It looks to me that
the list approach proposed here makes more sense.

We could also imagine adding a function similar to pg_stat_have_stats() that
would take relfile locator as arguments. That could help validate that after
a rewrite the old stats are gone.

> Do you think it is OK to define non-transactional pending stats as
> being always a subset of the transactional stats?  I don't quite see 
> if there would be a case to have stats that are only flushed in a
> non-transactional path, while being discarded at the stats report done
> at transaction commit time.  This means that it may be possible to
> structure things so as the pending non-transaction stats structure are
> always part of the transactional bits, and that the other way around
> is not possible.  Perhaps that influences the design choices, at least
> a bit.

The proposed patch does not change anything it that regard.
It keeps the relation's behavior as it is.

This patch just ensure that a relation rewrite keeps its stats.

Adding new stats (add writes counters) and ensure that some counters
(n_dead_tup and friends) are replicated will be done once this one gets in.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: relfilenode statistics

От

Michael Paquier

Дата:

09 ноября, 02:33:54

On Fri, Nov 07, 2025 at 11:28:27AM +0000, Bertrand Drouvot wrote:
> While there are existing rewrite tests, the stats behavior during rewrites
> doesn't have a good coverage. This patch adds some tests to record some stats
> after different rewrite scenarios.
>
> That way, we'll be able to test that the stats are still the ones we
> expect after rewrites. Note that it generates a new stats_1.out (which is quite
> large), so we may want to move those new tests to "isolation" instead.

Looking at this part of the patch set for now, not looked at the rest
yet.  This new stats_1.out is 2k lines long, introduced for the tests
related to rewrites as an effect of 2PC.  It seems to me that a split
into a new stats_rewrite would be justified for this case, to reduce
the output duplication.
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Michael Paquier

Дата:

10 ноября, 11:53:45

On Sun, Nov 09, 2025 at 08:33:54AM +0900, Michael Paquier wrote:
> Looking at this part of the patch set for now, not looked at the rest
> yet.  This new stats_1.out is 2k lines long, introduced for the tests
> related to rewrites as an effect of 2PC.  It seems to me that a split
> into a new stats_rewrite would be justified for this case, to reduce
> the output duplication.

The first patch had an issue with some of the tests checking for dead
tuples: if an autovacuum kicks in before querying the stats, we would
get a dead tuple number of 0.  So I have expanded the tests a bit to
avoid autovacuum interactions, which should be enough to avoid noise,
did a split into a new file, which should also be fine because we
don't rely on a system-wide stats reset, then applied the result.

The patch is spending a great deal of effort on three fronts:
- making sure that the statistics are copied over after a relation
rewrite.
- making sure that we assign a "correct" object ID, assigning
the fields of RelFileLocator based on a relation ID.  Mapped and
shared relations make the exercise a bit more difficult.  It would be
nice to avoid this kind of duplication with other code paths that
assign a RelFileLocator.
- Partitioned tables, where we don't have a relfilenode but we need to
track statistics.  The patch relies on the relation oid to assign a
key, as far as I've read.

Among the three points, the first one is the most invasive in the
patch, it seems, and do we actually want to keep the stats across
rewrites at all?  The main reason of doing the relfilenode move
would be to rebuild these stats on a WAL-record basis because the
relfile locator is the only thing we know in the startup process, and
once rewritten the state of the data is different.
relation_needs_vacanalyze() then cares about three fields:
- Number of dead tuples, which would be 0 after a rewrite.
- ins_since_vacuum, which would be 0 after a rewrite.
- mod_since_analyze, for analyze, again 0.

I have not checked the recent autovacuum scheduling thread to see if
this set changes there.

Are these numbers worth the effort of copying over at the end?  Was
this particular point discussed?  I've seen this mentioned once here,
but I am wondering what are the arguments in favor of copying the
stats data versus not copying it across rewrites:
https://www.postgresql.org/message-id/20240607031736.7izmr2yirznvidka%40awork3.anarazel.de
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

12 ноября, 20:03:55

Hi,

On Mon, Nov 10, 2025 at 05:53:45PM +0900, Michael Paquier wrote:
> On Sun, Nov 09, 2025 at 08:33:54AM +0900, Michael Paquier wrote:
> > Looking at this part of the patch set for now, not looked at the rest
> > yet.  This new stats_1.out is 2k lines long, introduced for the tests
> > related to rewrites as an effect of 2PC.  It seems to me that a split 
> > into a new stats_rewrite would be justified for this case, to reduce
> > the output duplication.
> 
> did a split into a new file, which should also be fine because we
> don't rely on a system-wide stats reset, then applied the result.

Thanks!

> The patch is spending a great deal of effort on three fronts:
> - making sure that the statistics are copied over after a relation
> rewrite.

Right, in 0003.

> - making sure that we assign a "correct" object ID, assigning
> the fields of RelFileLocator based on a relation ID.  Mapped and
> shared relations make the exercise a bit more difficult.  It would be
> nice to avoid this kind of duplication with other code paths that
> assign a RelFileLocator.

Are you referring to the new pgstat_reloid_to_relfilelocator() function?
If so, I'll try to avoid code duplication with other code paths as suggested.

> - Partitioned tables, where we don't have a relfilenode but we need to
> track statistics.  The patch relies on the relation oid to assign a
> key, as far as I've read.

Right. It's not doing that much in this area. It's needed so that things like
"last_analyze" on a partitioned table is populated (see "Ensure only the
partitioned table is analyzed" in vacuum.sql).

> Among the three points, the first one is the most invasive in the
> patch, it seems, and do we actually want to keep the stats across
> rewrites at all?

Not doing so would mean that all stats related to a relation will be lost after
a rewrite. I think that would be a major regression as compared to the current
behavior.

> The main reason of doing the relfilenode move 
> would be to rebuild these stats on a WAL-record basis because the
> relfile locator is the only thing we know in the startup process, and
> once rewritten the state of the data is different.

> relation_needs_vacanalyze() then cares about three fields:
> - Number of dead tuples, which would be 0 after a rewrite.
> - ins_since_vacuum, which would be 0 after a rewrite.
> - mod_since_analyze, for analyze, again 0.

> 
> I have not checked the recent autovacuum scheduling thread to see if
> this set changes there.
> 
> Are these numbers worth the effort of copying over at the end?

I think so because that would impact all the other relation's stats (not only
the ones linked to relation_needs_vacanalyze()).

> Was
> this particular point discussed?  I've seen this mentioned once here,
> but I am wondering what are the arguments in favor of copying the
> stats data versus not copying it across rewrites:
> https://www.postgresql.org/message-id/20240607031736.7izmr2yirznvidka%40awork3.anarazel.de

In favor of copying, I would say:

- no regression as compared to the current behavior. That means, for example,
not breaking DBA's activities/decisions based on the pg_stat_all_tables fields
after a rewrite.

- a rewrite is not changing the number of dead tuples, ins_since_vacuum and 
mod_since_analyze. So, if don't copy those, then we'd change the
relation_needs_vacanalyze() decision(s) as compared to the current one(s) for no
reasons (as a rewrite has no impact on those).

In favor of not copying, I would say make the code simpler.

I'm in favor of copying but open to different point of views.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

15 декабря, 19:29:18

Hi,

On Wed, Nov 12, 2025 at 05:03:55PM +0000, Bertrand Drouvot wrote:
> In favor of not copying, I would say make the code simpler.
> 
> I'm in favor of copying but open to different point of views.

PFA a mandatory rebase.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: relfilenode statistics

От

Andres Freund

Дата:

15 декабря, 20:48:25

Hi,

On 2025-12-15 16:29:18 +0000, Bertrand Drouvot wrote:
> From 7908ba56cb8b6255b869af6be13077aa0315d5f1 Mon Sep 17 00:00:00 2001
> From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> Date: Wed, 1 Oct 2025 09:45:26 +0000
> Subject: [PATCH v8 1/2] Key PGSTAT_KIND_RELATION by relfile locator
> 
> This patch changes the key used for the PGSTAT_KIND_RELATION statistic kind.
> Instead of the relation oid, it now relies on:
> 
> - dboid (linked to RelFileLocator's dbOid)
> - objoid which is the result of a new macro (namely RelFileLocatorToPgStatObjid())
> that computes an objoid based on the RelFileLocator's spcOid and the
> RelFileLocator's relNumber.

I think this needs to make more explicit that this works because the object ID
now is a uint64, and that both the inputs are 32 bits.


> That will allow us to add new stats (add writes counters) and ensure that some
> counters (n_dead_tup and friends) are replicated.

Yay.


> The patch introduces pgstat_reloid_to_relfilelocator() to 1) avoid calling
> RelationIdGetRelation() to get the relfilelocator based on the relation oid
> and 2) handle the partitioned table case.
> 
> Please note that:
> 
> - when running pg_stat_have_stats('relation',...) we now need to be connected
> to the database that hosts the relation. As pg_stat_have_stats() is not
> documented publicly, then the changes done in 029_stats_restart.pl look
> enough.

That seems fine.


> - this patch does not handle rewrites so some tests are failing. It's only
> intent is to ease the review and should not be pushed without being
> merged with the following patch that handles the rewrites.

Makes sense.


> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 62035b7f9c3..a9b2b4e1033 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -961,8 +961,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
>       * soon in cases where the failsafe prevented significant amounts of heap
>       * vacuuming.
>       */
> -    pgstat_report_vacuum(RelationGetRelid(rel),
> -                         rel->rd_rel->relisshared,
> +    pgstat_report_vacuum(rel->rd_locator,
>                           Max(vacrel->new_live_tuples, 0),
>                           vacrel->recently_dead_tuples +
>                           vacrel->missed_dead_tuples,

Why not pass in the Relation itself? Given that we do that already for
pgstat_report_analyze(), it seems like that'd be an improvement even
independent of this change?


> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 1bd3924e35e..563a3697690 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -2048,8 +2048,7 @@ do_autovacuum(void)
>  
>          /* Fetch reloptions and the pgstat entry for this table */
>          relopts = extract_autovac_opts(tuple, pg_class_desc);
> -        tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared,
> -                                                  relid);
> +        tabentry = pgstat_fetch_stat_tabentry_ext(relid);
>  
>          /* Check if it needs vacuum or analyze */
>          relation_needs_vacanalyze(relid, relopts, classForm, tabentry,

I don't think this is good - now do_autovacuum() will do a separate syscache
lookup to fetch information the caller already has (due to the
pgstat_reloid_to_relfilelocator() in pgstat_fetch_stat_tabentry_ext()). That's
not too bad for things like viewing stats, but do_autovacuum() does this for
every table in a database...


> @@ -326,9 +363,26 @@ pgstat_report_analyze(Relation rel,
>      ts = GetCurrentTimestamp();
>      elapsedtime = TimestampDifferenceMilliseconds(starttime, ts);
>  
> +    if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
> +        locator = rel->rd_locator;
> +    else
> +    {
> +        /*
> +         * Partitioned tables don't have storage, so construct a synthetic
> +         * locator for statistics tracking. Use the relation OID as relNumber.
> +         * No collision with regular relations is possible because relNumbers
> +         * are also assigned from the pg_class OID space (see
> +         * GetNewRelFileNumber()), making each value unique across the
> +         * database regardless of spcOid.
> +         */

I don't think this is true as stated. Two reasons:

1) This afaict guarantees that the relfilenode will not class with oids, but
   it does *NOT* guarantee that it does not clash with other relfilenodes

2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when
   creating a new relfilenode for an existing relation:
 * If the relfilenumber will also be used as the relation's OID, pass the
 * opened pg_class catalog, and this routine will guarantee that the result
 * is also an unused OID within pg_class.  If the result is to be used only
 * as a relfilenumber for an existing relation, pass NULL for pg_class.

Greetings,

Andres Freund

Re: relfilenode statistics

От

Michael Paquier

Дата:

16 декабря, 10:33:17

On Mon, Dec 15, 2025 at 12:48:25PM -0500, Andres Freund wrote:
> I don't think this is true as stated. Two reasons:
>
> 1) This afaict guarantees that the relfilenode will not clash with oids, but
>    it does *NOT* guarantee that it does not clash with other relfilenodes
>
> 2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when
>    creating a new relfilenode for an existing relation:
>  * If the relfilenumber will also be used as the relation's OID, pass the
>  * opened pg_class catalog, and this routine will guarantee that the result
>  * is also an unused OID within pg_class.  If the result is to be used only
>  * as a relfilenumber for an existing relation, pass NULL for pg_class.

FWIW, I am also still troubled by the part of the proposed patch set
where we are trying to hide the idea of a partitioned table has a
relfilenode set by using its relid instead in the key for the data.
This leads to a huge amount of complexity in the patch, mainly to
store data for autovacuum that we do not need at the end:
- autovacuum discards partitioned tables in do_autovacuum(), so the
stats related to partitioned tables that we need to select the
relations does not matter.
- manual vacuums may include partitioned tables to extract its
partitions, vacuum_rel() at the end discarding them.  Well, stats
don't matter anyway.

We only need to attach three fields to let autovacuum know if a
relation needs to run or not: dead_tuples, ins_since_vacuum,
mod_since_analyze.  Most the fields of PgStat_StatTabEntry make sense
only for tables, few are required by indexes for pg_stat_all_indexes.
Some fields actually make sense because they refer to on-disk files,
mostly for pg_statio_all_tables (blocks_fetched, blocks_hit).

Hence, why don't we split PgStat_StatTabEntry into three things from
the start, even if it means to duplicate some of them?  Say:
- Table fields: includes [auto]vacuum/analyze data, block fields,
fields of pg_stat_all_tables.
- Index fields: no need for the [auto]vacuum/analyze time and counts,
block fields, pg_stat_all_indexes fields.
- Relfilenode fields: dead_tuples, ins_since_vacuum and
mod_since_analyze.  Does not apply to partitioned tables and indexes,
only applies to tables.  Provides a clean split, embrace the fact that
these are the only three fields we need to worry about during
recovery.
--
Michael

Вложения

signature.asc

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

16 декабря, 13:22:06

Hi,

On Mon, Dec 15, 2025 at 12:48:25PM -0500, Andres Freund wrote:
> On 2025-12-15 16:29:18 +0000, Bertrand Drouvot wrote:
> > From 7908ba56cb8b6255b869af6be13077aa0315d5f1 Mon Sep 17 00:00:00 2001
> 
> I think this needs to make more explicit that this works because the object ID
> now is a uint64, and that both the inputs are 32 bits.

Yeah, it's now added in the commit message (mentioning b14e9ce7d55c).

> > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> > index 62035b7f9c3..a9b2b4e1033 100644
> > --- a/src/backend/access/heap/vacuumlazy.c
> > +++ b/src/backend/access/heap/vacuumlazy.c
> > @@ -961,8 +961,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
> >       * soon in cases where the failsafe prevented significant amounts of heap
> >       * vacuuming.
> >       */
> > -    pgstat_report_vacuum(RelationGetRelid(rel),
> > -                         rel->rd_rel->relisshared,
> > +    pgstat_report_vacuum(rel->rd_locator,
> >                           Max(vacrel->new_live_tuples, 0),
> >                           vacrel->recently_dead_tuples +
> >                           vacrel->missed_dead_tuples,
> 
> Why not pass in the Relation itself? Given that we do that already for
> pgstat_report_analyze(), it seems like that'd be an improvement even
> independent of this change?

Makes sense, done in [1].

> > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> > index 1bd3924e35e..563a3697690 100644
> > --- a/src/backend/postmaster/autovacuum.c
> > +++ b/src/backend/postmaster/autovacuum.c
> > @@ -2048,8 +2048,7 @@ do_autovacuum(void)
> >  
> >          /* Fetch reloptions and the pgstat entry for this table */
> >          relopts = extract_autovac_opts(tuple, pg_class_desc);
> > -        tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared,
> > -                                                  relid);
> > +        tabentry = pgstat_fetch_stat_tabentry_ext(relid);
> >  
> >          /* Check if it needs vacuum or analyze */
> >          relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
> 
> I don't think this is good - now do_autovacuum() will do a separate syscache
> lookup to fetch information the caller already has (due to the
> pgstat_reloid_to_relfilelocator() in pgstat_fetch_stat_tabentry_ext()). That's
> not too bad for things like viewing stats, but do_autovacuum() does this for
> every table in a database...

Good point. In the attached I added pgstat_fetch_stat_tabentry_by_locator().
It's called directly in do_autovacuum() and also in pgstat_fetch_stat_tabentry_ext().

I did not check if there are other places where we can call pgstat_fetch_stat_tabentry_by_locator()
directly. I want first to validate this idea makes sense, does it?

> I don't think this is true as stated. Two reasons:
> 
> 1) This afaict guarantees that the relfilenode will not class with oids, but
>    it does *NOT* guarantee that it does not clash with other relfilenodes

> 2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when
>    creating a new relfilenode for an existing relation:
>  * If the relfilenumber will also be used as the relation's OID, pass the
>  * opened pg_class catalog, and this routine will guarantee that the result
>  * is also an unused OID within pg_class.  If the result is to be used only
>  * as a relfilenumber for an existing relation, pass NULL for pg_class.

Oh right, in case of OID wraparound. In the attached I added a new 

"
#define PSEUDO_PARTITION_TABLE_SPCOID 1665
"

to ensure uniqueness then.

[1]: https://www.postgresql.org/message-id/flat/aUEA6UZZkDCQFgSA%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

16 декабря, 13:24:28

Hi,

On Tue, Dec 16, 2025 at 04:33:17PM +0900, Michael Paquier wrote:
> 
> Hence, why don't we split PgStat_StatTabEntry into three things from
> the start, even if it means to duplicate some of them?  Say:
> - Table fields: includes [auto]vacuum/analyze data, block fields,
> fields of pg_stat_all_tables.
> - Index fields: no need for the [auto]vacuum/analyze time and counts,
> block fields, pg_stat_all_indexes fields.
> - Relfilenode fields: dead_tuples, ins_since_vacuum and
> mod_since_analyze.  Does not apply to partitioned tables and indexes,
> only applies to tables.  Provides a clean split, embrace the fact that
> these are the only three fields we need to worry about during
> recovery.

I think that the PSEUDO_PARTITION_TABLE_SPCOID just proposed in [1] approach
is simple enough and solves the collision issue raised by Andres.

I think I prefer the unified structure as proposed in the patch (though we
may want to split tables and indexes later on). The reason is that it's
easier to expose publicly.

Indeed, at the very beginning of this thread, in v1, I created a new
PGSTAT_KIND_RELFILENODE and had to make it coexist with PGSTAT_KIND_RELATION and
that led to discussion on how we should expose them ([2]).

[1]: https://www.postgresql.org/message-id/aUEyzoOJtrCLAEeT%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: https://www.postgresql.org/message-id/CA%2BTgmoZtwT6h%3DnyuQ1J9GNSrRyhf0fv7Ai6FzO%3DbH0C9Bf6tew%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: relfilenode statistics

От

Andres Freund

Дата:

16 декабря, 18:39:15

Hi,

On 2025-12-16 16:33:17 +0900, Michael Paquier wrote:
> On Mon, Dec 15, 2025 at 12:48:25PM -0500, Andres Freund wrote:
> > I don't think this is true as stated. Two reasons:
> > 
> > 1) This afaict guarantees that the relfilenode will not clash with oids, but
> >    it does *NOT* guarantee that it does not clash with other relfilenodes
> > 
> > 2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when
> >    creating a new relfilenode for an existing relation:
> >  * If the relfilenumber will also be used as the relation's OID, pass the
> >  * opened pg_class catalog, and this routine will guarantee that the result
> >  * is also an unused OID within pg_class.  If the result is to be used only
> >  * as a relfilenumber for an existing relation, pass NULL for pg_class.
> 
> FWIW, I am also still troubled by the part of the proposed patch set
> where we are trying to hide the idea of a partitioned table has a
> relfilenode set by using its relid instead in the key for the data.
> This leads to a huge amount of complexity in the patch, mainly to
> store data for autovacuum that we do not need at the end:
> - autovacuum discards partitioned tables in do_autovacuum(), so the
> stats related to partitioned tables that we need to select the
> relations does not matter.

I feel like that's an implementation wart that we ought to fix. It's not
infrequently a problem that we don't automatically analyze partitioned
tables. Weren't there even a couple threads on that on the list in the last
weeks?

> - manual vacuums may include partitioned tables to extract its
> partitions, vacuum_rel() at the end discarding them.  Well, stats
> don't matter anyway.
> 
> We only need to attach three fields to let autovacuum know if a
> relation needs to run or not: dead_tuples, ins_since_vacuum,
> mod_since_analyze.

That may be true for autovacuum today, but I don't see any reason for
live_tuples, tuples_inserted etc to be inaccurate after a failover.

> Most the fields of PgStat_StatTabEntry make sense
> only for tables, few are required by indexes for pg_stat_all_indexes.
> Some fields actually make sense because they refer to on-disk files,
> mostly for pg_statio_all_tables (blocks_fetched, blocks_hit).
> 
> Hence, why don't we split PgStat_StatTabEntry into three things from
> the start, even if it means to duplicate some of them?  Say:
> - Table fields: includes [auto]vacuum/analyze data, block fields,
> fields of pg_stat_all_tables.

What do you mean with "block fields"? pg_statio_all_tables? If so, what's the
point of including them here, rather than in the relfilenode fields?

> - Index fields: no need for the [auto]vacuum/analyze time and counts,
> block fields, pg_stat_all_indexes fields.

I think we actually should populate the [auto]vac fields for indexes, right
now it's impossible to figure out from stats whether indexes are frequently
scanned as part of vacuum or not.

> - Relfilenode fields: dead_tuples, ins_since_vacuum and
> mod_since_analyze.  Does not apply to partitioned tables and indexes,
> only applies to tables.  Provides a clean split, embrace the fact that
> these are the only three fields we need to worry about during
> recovery.

I think we really ought to populate not just these during recovery, but also
at least n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup.

I don't understand why we would want to only populate these three fields?

I'm not against splitting the index fields off, but it seems pretty orthogonal
to what we're discussing here.  If we were to split of index stats into a
separate stat, why wouldn't we keep the statio fields in the relfilenode
stats, since they're obviously intimately tied to that?

Greetings,

Andres Freund

Re: relfilenode statistics

От

Bertrand Drouvot

Дата:

17 декабря, 10:30:35

Hi,

On Tue, Dec 16, 2025 at 10:22:06AM +0000, Bertrand Drouvot wrote:
> In the attached

PFA a mandatory rebase due to f4e797171ea.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: relfilenode statistics

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения