Обсуждение: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Поиск

Список

Период

Сортировка

pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

24 января 2020 г., 22:52:26

Hi,

Currently pg_stat_bgwriter.buffers_backend is pretty useless to gauge
whether backends are doing writes they shouldn't do. That's because it
counts things that are either unavoidably or unlikely doable by other
parts of the system (checkpointer, bgwriter).

In particular extending the file can not currently be done by any
another type of process, yet is counted. When using a buffer access
strategy it is also very likely that writes have to be done by the
'dirtying' backend itself, as the buffer will be reused soon after (when
not previously in s_b that is).

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.

I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Possibly by internally, in contrast to SQL level, having just counter
arrays indexed by backend types.

It's also noteworthy that buffers_backend is accounted in an absurd
manner. One might think that writes are accounted from backend -> shared
memory or such. But instead it works like this:

1) backend flushes buffer in bufmgr.c, accounts for backend *write time*
2) mdwrite writes and registers a sync request, which forwards the sync request to checkpointer
3) ForwardSyncRequest(), when not called by bgwriter, increments CheckpointerShmem->num_backend_writes
4) checkpointer, whenever doing AbsorbSyncRequests(), moves
CheckpointerShmem->num_backend_writes to
BgWriterStats.m_buf_written_backend (local memory!)
5) Occasionally it calls pgstat_send_bgwriter(), which sends the data to
pgstat (which bgwriter also does)
6) Which then updates the shared memory used by the display functions

Worthwhile to note that backend buffer read/write *time* is accounted
differently. That's done via pgstat_send_tabstat().

I think there's very little excuse for the indirection via checkpointer,
besides architectually being weird, it actually requires that we
continue to wake up checkpointer over and over instead of optimizing how
and when we submit fsync requests.

As far as I can tell we're also simply not accounting at all for writes
done outside of shared buffers. All writes done directly through
smgrwrite()/extend() aren't accounted anywhere as far as I can tell.

I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Magnus Hagander

Дата:

25 января 2020 г., 17:43:41

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Currently pg_stat_bgwriter.buffers_backend is pretty useless to gauge
> whether backends are doing writes they shouldn't do. That's because it
> counts things that are either unavoidably or unlikely doable by other
> parts of the system (checkpointer, bgwriter).
> In particular extending the file can not currently be done by any
> another type of process, yet is counted. When using a buffer access
> strategy it is also very likely that writes have to be done by the
> 'dirtying' backend itself, as the buffer will be reused soon after (when
> not previously in s_b that is).

Yeah. That's quite annoying.


> Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
> autovacuum et al.
>
>
> I think it'd make sense to at least split buffers_backend into
> buffers_backend_extend,
> buffers_backend_write,
> buffers_backend_write_strat
>
> but it could also be worthwhile to expand it into
> buffers_backend_extend,
> buffers_{backend,checkpoint,bgwriter,autovacuum}_write
> buffers_{backend,autovacuum}_write_stat

Given that these are individual global counters, I don't really see
any reason not to expand it to the bigger set of counters. It's easy
enough to add them up together later if needed.


> Possibly by internally, in contrast to SQL level, having just counter
> arrays indexed by backend types.
>
>
> It's also noteworthy that buffers_backend is accounted in an absurd
> manner. One might think that writes are accounted from backend -> shared
> memory or such. But instead it works like this:
>
> 1) backend flushes buffer in bufmgr.c, accounts for backend *write time*
> 2) mdwrite writes and  registers a sync request, which forwards the sync request to checkpointer
> 3) ForwardSyncRequest(), when not called by bgwriter, increments CheckpointerShmem->num_backend_writes
> 4) checkpointer, whenever doing AbsorbSyncRequests(), moves
>    CheckpointerShmem->num_backend_writes to
>    BgWriterStats.m_buf_written_backend (local memory!)
> 5) Occasionally it calls pgstat_send_bgwriter(), which sends the data to
>    pgstat (which bgwriter also does)
> 6) Which then updates the shared memory used by the display functions
>
> Worthwhile to note that backend buffer read/write *time* is accounted
> differently. That's done via pgstat_send_tabstat().
>
>
> I think there's very little excuse for the indirection via checkpointer,
> besides architectually being weird, it actually requires that we
> continue to wake up checkpointer over and over instead of optimizing how
> and when we submit fsync requests.
>
> As far as I can tell we're also simply not accounting at all for writes
> done outside of shared buffers. All writes done directly through
> smgrwrite()/extend() aren't accounted anywhere as far as I can tell.
>
>
> I think we also count things as writes that aren't writes: mdtruncate()
> is AFAICT counted as one backend write for each segment. Which seems
> weird to me.

It's at least slightly weird :) Might it be worth counting truncate
events separately?


> Lastly, I don't understand what the point of sending fixed size stats,
> like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
> I don't like it's architecture, we obviously need something like pgstat
> to handle variable amounts of stats (database, table level etc
> stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (andmore?)

От

Andres Freund

Дата:

26 января 2020 г., 03:44:01

Hi,

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:
> On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:
> > Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
> > autovacuum et al.

> > I think it'd make sense to at least split buffers_backend into
> > buffers_backend_extend,
> > buffers_backend_write,
> > buffers_backend_write_strat
> >
> > but it could also be worthwhile to expand it into
> > buffers_backend_extend,
> > buffers_{backend,checkpoint,bgwriter,autovacuum}_write
> > buffers_{backend,autovacuum}_write_stat
> 
> Given that these are individual global counters, I don't really see
> any reason not to expand it to the bigger set of counters. It's easy
> enough to add them up together later if needed.

Are you agreeing to
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
or are you suggesting further ones?

> > I think we also count things as writes that aren't writes: mdtruncate()
> > is AFAICT counted as one backend write for each segment. Which seems
> > weird to me.
> 
> It's at least slightly weird :) Might it be worth counting truncate
> events separately?

Is that really something interesting? Feels like it'd have to be done at
a higher level to be useful. E.g. the truncate done by TRUNCATE (when in
same xact as creation) and VACUUM are quite different. I think it'd be
better to just not include it.

> > Lastly, I don't understand what the point of sending fixed size stats,
> > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
> > I don't like it's architecture, we obviously need something like pgstat
> > to handle variable amounts of stats (database, table level etc
> > stats). But that doesn't at all apply to these types of global stats.
> 
> That part has annoyed me as well a few times. +1 for just moving that
> into a global shared memory. Given that we don't really care about
> things being in sync between those different counters *or* if we loose
> a bit of data (which the stats collector is designed to do), we could
> even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers).  There's a few details like how
exactly to implement resetting the counters, but ...

Thanks,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Magnus Hagander

Дата:

26 января 2020 г., 18:20:03

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:
> > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:
> > > Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
> > > autovacuum et al.
>
> > > I think it'd make sense to at least split buffers_backend into
> > > buffers_backend_extend,
> > > buffers_backend_write,
> > > buffers_backend_write_strat
> > >
> > > but it could also be worthwhile to expand it into
> > > buffers_backend_extend,
> > > buffers_{backend,checkpoint,bgwriter,autovacuum}_write
> > > buffers_{backend,autovacuum}_write_stat
> >
> > Given that these are individual global counters, I don't really see
> > any reason not to expand it to the bigger set of counters. It's easy
> > enough to add them up together later if needed.
>
> Are you agreeing to
> buffers_{backend,checkpoint,bgwriter,autovacuum}_write
> or are you suggesting further ones?

The former.


> > > I think we also count things as writes that aren't writes: mdtruncate()
> > > is AFAICT counted as one backend write for each segment. Which seems
> > > weird to me.
> >
> > It's at least slightly weird :) Might it be worth counting truncate
> > events separately?
>
> Is that really something interesting? Feels like it'd have to be done at
> a higher level to be useful. E.g. the truncate done by TRUNCATE (when in
> same xact as creation) and VACUUM are quite different. I think it'd be
> better to just not include it.

Yeah, you're probably right. it certainly makes very little sense
where it is now.


> > > Lastly, I don't understand what the point of sending fixed size stats,
> > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
> > > I don't like it's architecture, we obviously need something like pgstat
> > > to handle variable amounts of stats (database, table level etc
> > > stats). But that doesn't at all apply to these types of global stats.
> >
> > That part has annoyed me as well a few times. +1 for just moving that
> > into a global shared memory. Given that we don't really care about
> > things being in sync between those different counters *or* if we loose
> > a bit of data (which the stats collector is designed to do), we could
> > even do that without a lock?
>
> I don't think we'd quite want to do it without any (single counter)
> synchronization - high concurrency setups would be pretty likely to
> loose values that way. I suspect the best would be to have a struct in
> shared memory that contains the potential counters for each potential
> process. And then sum them up when actually wanting the concrete
> value. That way we avoid unnecessary contention, in contrast to having a
> single shared memory value for each(which would just pingpong between
> different sockets and store buffers).  There's a few details like how
> exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (andmore?)

От

Andres Freund

Дата:

26 января 2020 г., 23:22:03

Hi,

On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:
> On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:
> > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:
> > > > Lastly, I don't understand what the point of sending fixed size stats,
> > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
> > > > I don't like it's architecture, we obviously need something like pgstat
> > > > to handle variable amounts of stats (database, table level etc
> > > > stats). But that doesn't at all apply to these types of global stats.
> > >
> > > That part has annoyed me as well a few times. +1 for just moving that
> > > into a global shared memory. Given that we don't really care about
> > > things being in sync between those different counters *or* if we loose
> > > a bit of data (which the stats collector is designed to do), we could
> > > even do that without a lock?
> >
> > I don't think we'd quite want to do it without any (single counter)
> > synchronization - high concurrency setups would be pretty likely to
> > loose values that way. I suspect the best would be to have a struct in
> > shared memory that contains the potential counters for each potential
> > process. And then sum them up when actually wanting the concrete
> > value. That way we avoid unnecessary contention, in contrast to having a
> > single shared memory value for each(which would just pingpong between
> > different sockets and store buffers).  There's a few details like how
> > exactly to implement resetting the counters, but ...
> 
> Right. Each process gets to do their own write, but still in shared
> memory. But do you need to lock them when reading them (for the
> summary)? That's the part where I figured you could just read and
> summarize them, and accept the possible loss.

Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
integers can be read / written without a danger of torn values, and I
don't think we need perfect cross counter accuracy. To deal with the few
platforms without 64bit "single copy atomicity", we can just use
pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
fall back to using locked operations for those platforms.  So I don't
think there's actually a danger of loss.

Obviously we could also use atomic ops to increment the value, but I'd
rather not add all those atomic operations, even if it's on uncontended
cachelines. It'd allow us to reset the backend values more easily by
just swapping in a 0, which we can't do if the backend increments
non-atomically. But I think we could instead just have one global "bias"
value to implement resets (by subtracting that from the summarized
value, and storing the current sum when resetting). Or use the new
global barrier to trigger a reset. Or something similar.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (andmore?)

От

Kyotaro Horiguchi

Дата:

27 января 2020 г., 07:20:09

Hello.

At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,

I feel the same on the specific issues brought in upthread.

> On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:
> > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:
> > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > Lastly, I don't understand what the point of sending fixed size stats,
> > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
> > > > > I don't like it's architecture, we obviously need something like pgstat
> > > > > to handle variable amounts of stats (database, table level etc
> > > > > stats). But that doesn't at all apply to these types of global stats.
> > > >
> > > > That part has annoyed me as well a few times. +1 for just moving that
> > > > into a global shared memory. Given that we don't really care about
> > > > things being in sync between those different counters *or* if we loose
> > > > a bit of data (which the stats collector is designed to do), we could
> > > > even do that without a lock?
> > >
> > > I don't think we'd quite want to do it without any (single counter)
> > > synchronization - high concurrency setups would be pretty likely to
> > > loose values that way. I suspect the best would be to have a struct in
> > > shared memory that contains the potential counters for each potential
> > > process. And then sum them up when actually wanting the concrete
> > > value. That way we avoid unnecessary contention, in contrast to having a
> > > single shared memory value for each(which would just pingpong between
> > > different sockets and store buffers).  There's a few details like how
> > > exactly to implement resetting the counters, but ...
> > 
> > Right. Each process gets to do their own write, but still in shared
> > memory. But do you need to lock them when reading them (for the
> > summary)? That's the part where I figured you could just read and
> > summarize them, and accept the possible loss.
> 
> Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
> integers can be read / written without a danger of torn values, and I
> don't think we need perfect cross counter accuracy. To deal with the few
> platforms without 64bit "single copy atomicity", we can just use
> pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
> fall back to using locked operations for those platforms.  So I don't
> think there's actually a danger of loss.
> 
> Obviously we could also use atomic ops to increment the value, but I'd
> rather not add all those atomic operations, even if it's on uncontended
> cachelines. It'd allow us to reset the backend values more easily by
> just swapping in a 0, which we can't do if the backend increments
> non-atomically. But I think we could instead just have one global "bias"
> value to implement resets (by subtracting that from the summarized
> value, and storing the current sum when resetting). Or use the new
> global barrier to trigger a reset. Or something similar.

Fixed or global stats are suitable for the startar of shared-memory
stats collector. In the case of buffers_*_write, the global stats
entry for each process needs just 8 bytes plus matbe extra 8 bytes for
the bias value.  I'm not sure how many counters like this there are,
but is such size of footprint acceptatble?  (Each backend already uses
the same amount of local memory for pgstat use, though.)

Anyway I will do something like that as a trial, maybe by adding a
member in PgBackendStatus and one global-shared for the bial value.

      int64       st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+     PgBackendStatsCounters counters;
  } PgBackendStatus;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

13 апреля 2021 г., 05:49:36

On Sun, Jan 26, 2020 at 11:21 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in
> > On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:
> > > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
> > > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:
> > > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > Lastly, I don't understand what the point of sending fixed size stats,
> > > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
> > > > > > I don't like it's architecture, we obviously need something like pgstat
> > > > > > to handle variable amounts of stats (database, table level etc
> > > > > > stats). But that doesn't at all apply to these types of global stats.
> > > > >
> > > > > That part has annoyed me as well a few times. +1 for just moving that
> > > > > into a global shared memory. Given that we don't really care about
> > > > > things being in sync between those different counters *or* if we loose
> > > > > a bit of data (which the stats collector is designed to do), we could
> > > > > even do that without a lock?
> > > >
> > > > I don't think we'd quite want to do it without any (single counter)
> > > > synchronization - high concurrency setups would be pretty likely to
> > > > loose values that way. I suspect the best would be to have a struct in
> > > > shared memory that contains the potential counters for each potential
> > > > process. And then sum them up when actually wanting the concrete
> > > > value. That way we avoid unnecessary contention, in contrast to having a
> > > > single shared memory value for each(which would just pingpong between
> > > > different sockets and store buffers).  There's a few details like how
> > > > exactly to implement resetting the counters, but ...
> > >
> > > Right. Each process gets to do their own write, but still in shared
> > > memory. But do you need to lock them when reading them (for the
> > > summary)? That's the part where I figured you could just read and
> > > summarize them, and accept the possible loss.
> >
> > Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
> > integers can be read / written without a danger of torn values, and I
> > don't think we need perfect cross counter accuracy. To deal with the few
> > platforms without 64bit "single copy atomicity", we can just use
> > pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
> > fall back to using locked operations for those platforms.  So I don't
> > think there's actually a danger of loss.
> >
> > Obviously we could also use atomic ops to increment the value, but I'd
> > rather not add all those atomic operations, even if it's on uncontended
> > cachelines. It'd allow us to reset the backend values more easily by
> > just swapping in a 0, which we can't do if the backend increments
> > non-atomically. But I think we could instead just have one global "bias"
> > value to implement resets (by subtracting that from the summarized
> > value, and storing the current sum when resetting). Or use the new
> > global barrier to trigger a reset. Or something similar.
>
> Fixed or global stats are suitable for the startar of shared-memory
> stats collector. In the case of buffers_*_write, the global stats
> entry for each process needs just 8 bytes plus matbe extra 8 bytes for
> the bias value.  I'm not sure how many counters like this there are,
> but is such size of footprint acceptatble?  (Each backend already uses
> the same amount of local memory for pgstat use, though.)
>
> Anyway I will do something like that as a trial, maybe by adding a
> member in PgBackendStatus and one global-shared for the bial value.
>
>       int64       st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
> +     PgBackendStatsCounters counters;
>   } PgBackendStatus;
>

So, I took a stab at implementing this in PgBackendStatus. The attached
patch is not quite on top of current master, so, alas, don't try and
apply it. I went to rebase today and realized I needed to make some
changes in light of e1025044cd4, however, I wanted to share this WIP so
that I could pose a few questions that I imagine will still be relevant
after I rewrite the patch.

I removed buffers_backend and buffers_backend_fsync from
pg_stat_bgwriter and have created a new view which tracks
  - number of shared buffers the checkpointer and bgwriter write out
  - number of shared buffers a regular backend is forced to flush
  - number of extends done by a regular backend through shared buffers
  - number of buffers flushed by a backend or autovacuum using a
    BufferAccessStrategy which, were they not to use this strategy,
    could perhaps have been avoided if a clean shared buffer was
    available
  - number of fsyncs done by a backend which could have been done by
    checkpointer if sync queue had not been full

This view currently does only track writes and extends that go through
shared buffers and fsyncs of shared buffers (which, AFAIK are the only
things fsync'd though the SyncRequest machinery currently).

BufferAlloc() and SyncOneBuffer() are the main points at which the
tracking is done. I can definitely expand this, but, I want to make sure
that we are tracking the right kind of information.

num_backend_writes and num_backend_fsync were intended (though they were
not accurate) to count buffers that backends had to end up writing
themselves and fsyncs that backends had to end up doing themselves which
could have been avoided with a different configuration (or, I suppose, a
different workload/different data, etc). That is, they were meant to
tell you if checkpointer and bgwriter were keeping up and/or if the
size of shared buffers was adequate.

In implementing this counting per backend, it is easy for all types of
backends to keep track of the number of writes, extends, fsyncs, and
strategy writes they are doing. So, as recommended upthread, I have
added columns in the view for the number of writes for checkpointer and
bgwriter and others. Thus, this view becomes more than just stats on
"avoidable I/O done by backends".

So, my question is, does it makes sense to track all extends -- those to
extend the fsm and visimap and when making a new relation or index? Is
that information useful? If so, is it different than the extends done
through shared buffers? Should it be tracked separately?

Also, if we care about all of the extends, then it seems a bit annoying
to pepper the counting all over the place when it really just needs to
be done when smgrextend() — even though maybe a stats function doesn't
belong in that API.

Another question I have is, should the number of extends be for every
single block extended or should we try to track the initiation of a set
of extends (all of those added in RelationAddExtraBlocks(), in this
case)?

When it comes to fsync counting, I only count the fsyncs counted by the
previous code — that is fsyncs done by backends themselves when the
checkpointer sync request queue was full.
I did the counting in the same place in checkpointer code -- in
ForwardSyncRequest() -- partially because there did not seem to be
another good place to do it since register_dirty_segment() returns void
(thought about having it return a bool to indicate if it fsync'd it or
if it registered the fsync because that seemed alright, but mdextend(),
mdwrite() etc, also return NULL) so there is no way to propagate the
information back up to the bufmgr that the process had to do its own
fsync, so, that means that I would have to muck with the md.c API. and,
since the checkpointer is the one processing these sync requests anyway,
it actually seems okay to do it in the checkpointer code.

I'm not counting fsyncs that are "unavoidable" in the sense that they
couldn't be avoided by changing settings/workload etc -- like those done
when building an index, creating a table/rewriting a table/copying a
table -- is it useful to count these? It seems like it makes the number
of "avoidable fsyncs by backends" less useful if we count the others.
Also, should we count how many fsyncs checkpointer has done (have to
check if there is already a stat for that)? Is that useful in this
context?

Of course, this view, when grown, will begin to overlap with pg_statio,
which is another consideration. What is its identity? I would find
"avoidable I/O" either avoidable entirely or avoidable for that
particular type of process, to be useful.

Or maybe, it should have a more expansive mandate. Maybe it would be
useful to aggregate some of the info from pg_stat_statements at a higher
level -- like maybe shared_blks_read counted across many statements for
a period of time/context in which we expected the relation in shared
buffers becomes potentially interesting.

As for the way I have recorded strategy writes -- it is quite inelegant,
but, I wanted to make sure that I only counted a strategy write as one
in which the backend wrote out the dirty buffer from its strategy ring
but did not check if there was any clean buffer in shared buffers more
generally (so, it is *potentially* an avoidable write). I'm not sure if
this distinction is useful to anyone. I haven't done enough with
BufferAccessStrategies to know what I'd want to know about them when
developing or using Postgres. However, if I don't need to be so careful,
it will make the code much simpler (though, I'm sure I can improve the
code regardless).

As for the implementation of the counters themselves, I appreciate that
it isn't very nice to have a bunch of random members in PgBackendStatus
to count all of these write, extends, fsyncs. I considered if I could
add params that were used for all command types to st_progress_param but
I haven't looked into it yet. Alternatively, I could create an array
just for these kind of stats in PgBackendStatus. Though, I imagine that
I should take a look at the changes that have been made recently to this
area and at the shared memory stats patch.

Oh, also, there should be a way to reset the stats, especially if we add
more extends and fsyncs that happen at the time of relation/index
creation. I, at least, would find it useful to see these numbers once
the database is at some kind of steady state.

Oh and src/test/regress/sql/stats.sql will fail and, of course, I don't
intend to add that SELECT from the view to regress, it was just for
testing purposes to make sure the view was working.

-- Melanie

Вложения

v1-0001-Add-system-view-tracking-shared-buffers-written.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

16 апреля 2021 г., 02:59:54

Hi,

On 2021-04-12 19:49:36 -0700, Melanie Plageman wrote:
> So, I took a stab at implementing this in PgBackendStatus.

Cool!


> The attached patch is not quite on top of current master, so, alas,
> don't try and apply it. I went to rebase today and realized I needed
> to make some changes in light of e1025044cd4, however, I wanted to
> share this WIP so that I could pose a few questions that I imagine
> will still be relevant after I rewrite the patch.
>
> I removed buffers_backend and buffers_backend_fsync from
> pg_stat_bgwriter and have created a new view which tracks
>   - number of shared buffers the checkpointer and bgwriter write out
>   - number of shared buffers a regular backend is forced to flush
>   - number of extends done by a regular backend through shared buffers
>   - number of buffers flushed by a backend or autovacuum using a
>     BufferAccessStrategy which, were they not to use this strategy,
>     could perhaps have been avoided if a clean shared buffer was
>     available
>   - number of fsyncs done by a backend which could have been done by
>     checkpointer if sync queue had not been full

I wonder if leaving buffers_alloc in pg_stat_bgwriter makes sense after
this? I'm tempted to move that to pg_stat_buffers or such...

I'm not quite convinced by having separate columns for checkpointer,
bgwriter, etc. That doesn't seem to scale all that well. What if we
instead made it a view that has one row for each BackendType?


> In implementing this counting per backend, it is easy for all types of
> backends to keep track of the number of writes, extends, fsyncs, and
> strategy writes they are doing. So, as recommended upthread, I have
> added columns in the view for the number of writes for checkpointer and
> bgwriter and others. Thus, this view becomes more than just stats on
> "avoidable I/O done by backends".
>
> So, my question is, does it makes sense to track all extends -- those to
> extend the fsm and visimap and when making a new relation or index? Is
> that information useful? If so, is it different than the extends done
> through shared buffers? Should it be tracked separately?

I don't fully understand what you mean with "extends done through shared
buffers"?


> Another question I have is, should the number of extends be for every
> single block extended or should we try to track the initiation of a set
> of extends (all of those added in RelationAddExtraBlocks(), in this
> case)?

I think it should be 8k blocks, i.e. RelationAddExtraBlocks() should be
tracked as many individual extends. It's implemented that way, but more
importantly, it should be in BLCKSZ units. If we later add some actually
batched operations, we can have separate stats for that.


> Of course, this view, when grown, will begin to overlap with pg_statio,
> which is another consideration. What is its identity? I would find
> "avoidable I/O" either avoidable entirely or avoidable for that
> particular type of process, to be useful.

I think it's fine to overlap with pg_statio_* - those are for individual
objects, so it seems to be expected to overlap with coarser stats.


> Or maybe, it should have a more expansive mandate. Maybe it would be
> useful to aggregate some of the info from pg_stat_statements at a higher
> level -- like maybe shared_blks_read counted across many statements for
> a period of time/context in which we expected the relation in shared
> buffers becomes potentially interesting.

Let's do something more basic first...


Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

05 июня 2021 г., 00:12:43

On Thu, Apr 15, 2021 at 7:59 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-04-12 19:49:36 -0700, Melanie Plageman wrote:
> So, I took a stab at implementing this in PgBackendStatus.

Cool!

Just a note on v2 of the patch -- the diff for the changes I made to
pgstatfuncs.c is pretty atrocious and hard to read. I tried using a
different diff algorithm, to no avail.

> The attached patch is not quite on top of current master, so, alas,
> don't try and apply it. I went to rebase today and realized I needed
> to make some changes in light of e1025044cd4, however, I wanted to
> share this WIP so that I could pose a few questions that I imagine
> will still be relevant after I rewrite the patch.

Regarding the refactor done in e1025044cd4:
Most of the functions I've added access variables in PgBackendStatus, so
I put most of them in backend_status.h/c. However, technically, these
are stats which are aggregated over time, which e1025044cd4 says should
go in pgstat.c/h. I could move some of it, but I hadn't tried to do so,
as it made a few things inconvenient, and, I wasn't sure if it was the
right thing to do anyway.

>
> I removed buffers_backend and buffers_backend_fsync from
> pg_stat_bgwriter and have created a new view which tracks
> - number of shared buffers the checkpointer and bgwriter write out
> - number of shared buffers a regular backend is forced to flush
> - number of extends done by a regular backend through shared buffers
> - number of buffers flushed by a backend or autovacuum using a
> BufferAccessStrategy which, were they not to use this strategy,
> could perhaps have been avoided if a clean shared buffer was
> available
> - number of fsyncs done by a backend which could have been done by
> checkpointer if sync queue had not been full

I wonder if leaving buffers_alloc in pg_stat_bgwriter makes sense after
this? I'm tempted to move that to pg_stat_buffers or such...

I've gone ahead and moved buffers_alloc out of pg_stat_bgwriter and into
pg_stat_buffer_actions (I've renamed it from pg_stat_buffers_written).

I'm not quite convinced by having separate columns for checkpointer,
bgwriter, etc. That doesn't seem to scale all that well. What if we
instead made it a view that has one row for each BackendType?

I've changed the view to have one row for each backend type for which we
would like to report stats and one column for each buffer action type.

To make the code easier to write, I record buffer actions for all
backend types -- even if we don't have any buffer actions we care about
for that backend type. I thought it was okay because when I actually
aggregate the counters across backends, I only do so for the backend
types we care about -- thus there shouldn't be much accessing of shared
memory by multiple different processes.

Also, I copy-pasted most of the code in pg_stat_get_buffer_actions() to
set up the result tuplestore from pg_stat_get_activity() without totally
understanding all the parts of it, so I'm not sure if all of it is
required here.

> In implementing this counting per backend, it is easy for all types of
> backends to keep track of the number of writes, extends, fsyncs, and
> strategy writes they are doing. So, as recommended upthread, I have
> added columns in the view for the number of writes for checkpointer and
> bgwriter and others. Thus, this view becomes more than just stats on
> "avoidable I/O done by backends".
>
> So, my question is, does it makes sense to track all extends -- those to
> extend the fsm and visimap and when making a new relation or index? Is
> that information useful? If so, is it different than the extends done
> through shared buffers? Should it be tracked separately?

I don't fully understand what you mean with "extends done through shared
buffers"?

By "extends done through shared buffers", I just mean when an extend of
a relation is done and the data that will be written to the new block is
written into a shared buffer (as opposed to a local one or local memory
or a strategy buffer).

Random note:

I added a length member to the BackendType enum (BACKEND_NUM_TYPES),
which led to this compiler warning:

miscinit.c: In function ‘GetBackendTypeDesc’:
miscinit.c:236:2: warning: enumeration value ‘BACKEND_NUM_TYPES’ not handled in switch [-Wswitch]
236 | switch (backendType)
| ^~~~~~

I tried using pg_attribute_unused() for BACKEND_NUM_TYPES, but, it
didn't seem to have the desired effect. As such, I just threw a case
into GetBackendTypeDesc() which does nothing (as opposed to erroring
out), since the backendDesc already is initialized to "unknown process
type", erroring out doesn't seem to be expected.

- Melanie

Вложения

v2-0001-Add-system-view-tracking-shared-buffer-actions.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Alvaro Herrera

Дата:

05 июня 2021 г., 00:52:17

On 2021-Apr-12, Melanie Plageman wrote:

> As for the way I have recorded strategy writes -- it is quite inelegant,
> but, I wanted to make sure that I only counted a strategy write as one
> in which the backend wrote out the dirty buffer from its strategy ring
> but did not check if there was any clean buffer in shared buffers more
> generally (so, it is *potentially* an avoidable write). I'm not sure if
> this distinction is useful to anyone. I haven't done enough with
> BufferAccessStrategies to know what I'd want to know about them when
> developing or using Postgres. However, if I don't need to be so careful,
> it will make the code much simpler (though, I'm sure I can improve the
> code regardless).

I was bitten last year by REFRESH MATERIALIZED VIEW counting its writes
via buffers_backend, and I was very surprised/confused about it.  So it
seems definitely worthwhile to count writes via strategy separately.
For a DBA tuning the server configuration it is very useful.

The main thing is to *not* let these writes end up regular
buffers_backend (or whatever you call these now).  I didn't read your
patch, but the way you have described it seems okay to me.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

03 августа 2021 г., 01:25:56

On Fri, Jun 4, 2021 at 5:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Apr-12, Melanie Plageman wrote:
>
> > As for the way I have recorded strategy writes -- it is quite inelegant,
> > but, I wanted to make sure that I only counted a strategy write as one
> > in which the backend wrote out the dirty buffer from its strategy ring
> > but did not check if there was any clean buffer in shared buffers more
> > generally (so, it is *potentially* an avoidable write). I'm not sure if
> > this distinction is useful to anyone. I haven't done enough with
> > BufferAccessStrategies to know what I'd want to know about them when
> > developing or using Postgres. However, if I don't need to be so careful,
> > it will make the code much simpler (though, I'm sure I can improve the
> > code regardless).
>
> I was bitten last year by REFRESH MATERIALIZED VIEW counting its writes
> via buffers_backend, and I was very surprised/confused about it.  So it
> seems definitely worthwhile to count writes via strategy separately.
> For a DBA tuning the server configuration it is very useful.
>
> The main thing is to *not* let these writes end up regular
> buffers_backend (or whatever you call these now).  I didn't read your
> patch, but the way you have described it seems okay to me.
>

Thanks for the feedback!

I agree it makes sense to count strategy writes separately.

I thought about this some more, and I don't know if it makes sense to
only count "avoidable" strategy writes.

This would mean that a backend writing out a buffer from the strategy
ring when no clean shared buffers (as well as no clean strategy buffers)
are available would not count that write as a strategy write (even
though it is writing out a buffer from its strategy ring). But, it
obviously doesn't make sense to count it as a regular buffer being
written out. So, I plan to change this code.

On another note, I've updated the patch with more correct concurrency
control control mechanisms (had some data races and other problems
before). Now, I am using atomics for the buffer action counters, though
the code includes several #TODO questions around the correctness of what
I have now too.

I also wrapped the buffer action types in a struct to make them easier
to work with.

The most substantial missing piece of the patch right now is persisting
the data across reboots.

The two places in the code I can see to persist the buffer action stats
data are:
1) using the stats collector code (like in
pgstat_read/write_statsfiles()
2) using a before_shmem_exit() hook which writes the data structure to a
file and then read from it when making the shared memory array initially

It feels a bit weird to me to wedge the buffer actions stats into the
stats collector code--since the stats collector isn't receiving and
aggregating the buffer action stats.

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

And, I don't think I can use pgstat_read_statsfiles() since the
BufferActionStatsArray should have the data from the file as soon as the
view containing the buffer action stats can be queried. Thus, it seems
like I would need to read the file while initializing the array in
CreateBufferActionStatsCounters().

I am registering the patch for September commitfest but plan to update
the stats persistence before then (and docs, etc).

-- Melanie

Вложения

v3-0001-Add-system-view-tracking-shared-buffer-actions.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

03 августа 2021 г., 21:12:58

Hi,

On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote:
> Thanks for the feedback!
> 
> I agree it makes sense to count strategy writes separately.
> 
> I thought about this some more, and I don't know if it makes sense to
> only count "avoidable" strategy writes.
> 
> This would mean that a backend writing out a buffer from the strategy
> ring when no clean shared buffers (as well as no clean strategy buffers)
> are available would not count that write as a strategy write (even
> though it is writing out a buffer from its strategy ring). But, it
> obviously doesn't make sense to count it as a regular buffer being
> written out. So, I plan to change this code.

What do you mean with "no clean shared buffers ... are available"?



> The most substantial missing piece of the patch right now is persisting
> the data across reboots.
> 
> The two places in the code I can see to persist the buffer action stats
> data are:
> 1) using the stats collector code (like in
> pgstat_read/write_statsfiles()
> 2) using a before_shmem_exit() hook which writes the data structure to a
> file and then read from it when making the shared memory array initially

I think it's pretty clear that we should go for 1. Having two mechanisms for
persisting stats data is a bad idea.


> Also, I'm unsure how writing the buffer action stats out in
> pgstat_write_statsfiles() will work, since I think that backends can
> update their buffer action stats after we would have already persisted
> the data from the BufferActionStatsArray -- causing us to lose those
> updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.


> And, I don't think I can use pgstat_read_statsfiles() since the
> BufferActionStatsArray should have the data from the file as soon as the
> view containing the buffer action stats can be queried. Thus, it seems
> like I would need to read the file while initializing the array in
> CreateBufferActionStatsCounters().

Why would backends need to read that data back?


> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> index 55f6e3711d..96cac0a74e 100644
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
>          pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
>          pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
>          pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> -        pg_stat_get_buf_written_backend() AS buffers_backend,
> -        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> -        pg_stat_get_buf_alloc() AS buffers_alloc,
>          pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;

Material for a separate patch, not this. But if we're going to break
monitoring queries anyway, I think we should consider also renaming
maxwritten_clean (and perhaps a few others), because nobody understands what
that is supposed to mean.



> @@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
>  
>      LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
>  
> -    /* Count all backend writes regardless of if they fit in the queue */
> -    if (!AmBackgroundWriterProcess())
> -        CheckpointerShmem->num_backend_writes++;
> -
>      /*
>       * If the checkpointer isn't running or the request queue is full, the
>       * backend will have to perform its own fsync request.  But before forcing
> @@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
>           * Count the subset of writes where backends have to do their own
>           * fsync
>           */
> +        /* TODO: should we count fsyncs for all types of procs? */
>          if (!AmBackgroundWriterProcess())
> -            CheckpointerShmem->num_backend_fsync++;
> +            pgstat_increment_buffer_action(BA_Fsync);
> +

Yes, I think that'd make sense. Now that we can disambiguate the different
types of syncs between procs, I don't see a point of having a process-type
filter here. We just loose data...



>          /* don't set checksum for all-zero page */
> @@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>                      if (XLogNeedsFlush(lsn) &&
>                          StrategyRejectBuffer(strategy, buf))
>                      {
> +                        /*
> +                         * Unset the strat write flag, as we will not be writing
> +                         * this particular buffer from our ring out and may end
> +                         * up having to find a buffer from main shared buffers,
> +                         * which, if it is dirty, we may have to write out, which
> +                         * could have been prevented by checkpointing and background
> +                         * writing
> +                         */
> +                        StrategyUnChooseBufferFromRing(strategy);
> +
>                          /* Drop lock/pin and loop around for another buffer */
>                          LWLockRelease(BufferDescriptorGetContentLock(buf));
>                          UnpinBuffer(buf, true);
>                          continue;
>                      }

Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to
have two function calls into freelist.c when the second happens exactly when
the first returns true?


> +
> +                    /*
> +                     * TODO: there is certainly a better way to write this
> +                     * logic
> +                     */
> +
> +                    /*
> +                     * The dirty buffer that will be written out was selected
> +                     * from the ring and we did not bother checking the
> +                     * freelist or doing a clock sweep to look for a clean
> +                     * buffer to use, thus, this write will be counted as a
> +                     * strategy write -- one that may be unnecessary without a
> +                     * strategy
> +                     */
> +                    if (StrategyIsBufferFromRing(strategy))
> +                    {
> +                        pgstat_increment_buffer_action(BA_Write_Strat);
> +                    }
> +
> +                        /*
> +                         * If the dirty buffer was one we grabbed from the
> +                         * freelist or through a clock sweep, it could have been
> +                         * written out by bgwriter or checkpointer, thus, we will
> +                         * count it as a regular write
> +                         */
> +                    else
> +                        pgstat_increment_buffer_action(BA_Write);

It seems this would be better solved by having an "bool *from_ring" or
GetBufferSource* parameter to StrategyGetBuffer().


> @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
>      /*
>       * bufToWrite is either the shared buffer or a copy, as appropriate.
>       */
> +
> +    /*
> +     * TODO: consider that if we did not need to distinguish between a buffer
> +     * flushed that was grabbed from the ring buffer and written out as part
> +     * of a strategy which was not from main Shared Buffers (and thus
> +     * preventable by bgwriter or checkpointer), then we could move all calls
> +     * to pgstat_increment_buffer_action() here except for the one for
> +     * extends, which would remain in ReadBuffer_common() before smgrextend()
> +     * (unless we decide to start counting other extends). That includes the
> +     * call to count buffers written by bgwriter and checkpointer which go
> +     * through FlushBuffer() but not BufferAlloc(). That would make it
> +     * simpler. Perhaps instead we can find somewhere else to indicate that
> +     * the buffer is from the ring of buffers to reuse.
> +     */
>      smgrwrite(reln,
>                buf->tag.forkNum,
>                buf->tag.blockNum,

Can we just add a parameter to FlushBuffer indicating what the source of the
write is?


> @@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
>       * the rate of buffer consumption.  Note that buffers recycled by a
>       * strategy object are intentionally not counted here.
>       */
> -    pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
> +    pgstat_increment_buffer_action(BA_Alloc);
>  
>      /*
>       * First check, without acquiring the lock, whether there's buffers in the

> @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
>           */
>          *complete_passes += nextVictimBuffer / NBuffers;
>      }
> -
> -    if (num_buf_alloc)
> -    {
> -        *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
> -    }
>      SpinLockRelease(&StrategyControl->buffer_strategy_lock);
>      return result;
>  }

Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
suspect this patch shouldn't get rid of numBufferAllocs at the same time as
overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
that that's the case / how we can make that work.




> +void
> +pgstat_increment_buffer_action(BufferActionType ba_type)
> +{
> +    volatile PgBackendStatus *beentry   = MyBEEntry;
> +
> +    if (!beentry || !pgstat_track_activities)
> +        return;
> +
> +    if (ba_type == BA_Alloc)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1);
> +    else if (ba_type == BA_Extend)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1);
> +    else if (ba_type == BA_Fsync)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1);
> +    else if (ba_type == BA_Write)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1);
> +    else if (ba_type == BA_Write_Strat)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1);
> +}

I don't think we want to use atomic increments here - they're *slow*. And
there only ever can be a single writer to a backend's stats. So just doing
something like
    pg_atomic_write_u64(&var, pg_atomic_read_u64(&var) + 1)
should do the trick.


> +/*
> + * Called for a single backend at the time of death to persist its I/O stats
> + */
> +void
> +pgstat_record_dead_backend_buffer_actions(void)
> +{
> +    volatile PgBackendBufferActionStats *ba_stats;
> +    volatile    PgBackendStatus *beentry = MyBEEntry;
> +
> +    if (beentry->st_procpid != 0)
> +        return;
> +
> +    // TODO: is this correct? could there be a data race? do I need a lock?
> +    ba_stats = &BufferActionStatsArray[beentry->st_backendType];
> +    pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs));
> +    pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends));
> +    pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs));
> +    pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes));
> +    pg_atomic_add_fetch_u64(&ba_stats->writes_strat,
pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat));
> +}

I don't see a race, FWIW.

This is where I propose that we instead report the values up to the stats
collector, instead of having a separate array that we need to persist


> +/*
> + * Fill the provided values array with the accumulated counts of buffer actions
> + * taken by all backends of type backend_type (input parameter), both alive and
> + * dead. This is currently only used by pg_stat_get_buffer_actions() to create
> + * the rows in the pg_stat_buffer_actions system view.
> + */
> +void
> +pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values)
> +{
> +    int            i;
> +    volatile PgBackendStatus *beentry;
> +
> +    /*
> +     * Add stats from all exited backends
> +     */
> +    values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs);
> +    values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends);
> +    values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs);
> +    values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes);
> +    values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat);
> +
> +    /*
> +     * Loop through all live backends and count their buffer actions
> +     */
> +    // TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method
> +
> +    beentry = BackendStatusArray;
> +    for (i = 1; i <= MaxBackends; i++)
> +    {
> +        /* Don't count dead backends. They should already be counted */
> +        if (beentry->st_procpid == 0)
> +            continue;
> +        if (beentry->st_backendType != backend_type)
> +            continue;
> +
> +        values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
> +        values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
> +        values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
> +        values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
> +        values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
> +
> +        beentry++;
> +    }
> +}

It seems to make a bit more sense to have this sum up the stats for all
backend types at once.

> +        /*
> +         * Currently, the only supported backend types for stats are the following.
> +         * If this were to change, pg_proc.dat would need to be changed as well
> +         * to reflect the new expected number of rows.
> +         */
> +        Datum values[BUFFER_ACTION_NUM_TYPES];
> +        bool nulls[BUFFER_ACTION_NUM_TYPES];

Ah ;)

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

11 августа 2021 г., 23:11:34

On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote:
> > Thanks for the feedback!
> >
> > I agree it makes sense to count strategy writes separately.
> >
> > I thought about this some more, and I don't know if it makes sense to
> > only count "avoidable" strategy writes.
> >
> > This would mean that a backend writing out a buffer from the strategy
> > ring when no clean shared buffers (as well as no clean strategy buffers)
> > are available would not count that write as a strategy write (even
> > though it is writing out a buffer from its strategy ring). But, it
> > obviously doesn't make sense to count it as a regular buffer being
> > written out. So, I plan to change this code.
>
> What do you mean with "no clean shared buffers ... are available"?
>

I think I was talking about the scenario in which a backend using a
strategy does not find a clean buffer in the strategy ring and goes to
look in the freelist for a clean shared buffer and doesn't find one.

I was probably talking in circles up there. I think the current
patch counts the right writes in the right way, though.

>
>
> > The most substantial missing piece of the patch right now is persisting
> > the data across reboots.
> >
> > The two places in the code I can see to persist the buffer action stats
> > data are:
> > 1) using the stats collector code (like in
> > pgstat_read/write_statsfiles()
> > 2) using a before_shmem_exit() hook which writes the data structure to a
> > file and then read from it when making the shared memory array initially
>
> I think it's pretty clear that we should go for 1. Having two mechanisms for
> persisting stats data is a bad idea.

New version uses the stats collector.

>
>
> > Also, I'm unsure how writing the buffer action stats out in
> > pgstat_write_statsfiles() will work, since I think that backends can
> > update their buffer action stats after we would have already persisted
> > the data from the BufferActionStatsArray -- causing us to lose those
> > updates.
>
> I was thinking it'd work differently. Whenever a connection ends, it reports
> its data up to pgstats.c (otherwise we'd loose those stats). By the time
> shutdown happens, they all need to have already have reported their stats - so
> we don't need to do anything to get the data to pgstats.c during shutdown
> time.
>

When you say "whenever a connection ends", what part of the code are you
referring to specifically?

Also, when you say "shutdown", do you mean a backend shutting down or
all backends shutting down (including postmaster) -- like pg_ctl stop?

>
> > And, I don't think I can use pgstat_read_statsfiles() since the
> > BufferActionStatsArray should have the data from the file as soon as the
> > view containing the buffer action stats can be queried. Thus, it seems
> > like I would need to read the file while initializing the array in
> > CreateBufferActionStatsCounters().
>
> Why would backends need to read that data back?
>

To get totals across restarts, but, doesn't matter now that I am using
stats collector.

>
> > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> > index 55f6e3711d..96cac0a74e 100644
> > --- a/src/backend/catalog/system_views.sql
> > +++ b/src/backend/catalog/system_views.sql
> > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
> >          pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
> >          pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
> >          pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> > -        pg_stat_get_buf_written_backend() AS buffers_backend,
> > -        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> > -        pg_stat_get_buf_alloc() AS buffers_alloc,
> >          pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
>
> Material for a separate patch, not this. But if we're going to break
> monitoring queries anyway, I think we should consider also renaming
> maxwritten_clean (and perhaps a few others), because nobody understands what
> that is supposed to mean.
>
>

Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?

>
> > @@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
> >
> >       LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
> >
> > -     /* Count all backend writes regardless of if they fit in the queue */
> > -     if (!AmBackgroundWriterProcess())
> > -             CheckpointerShmem->num_backend_writes++;
> > -
> >       /*
> >        * If the checkpointer isn't running or the request queue is full, the
> >        * backend will have to perform its own fsync request.  But before forcing
> > @@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
> >                * Count the subset of writes where backends have to do their own
> >                * fsync
> >                */
> > +             /* TODO: should we count fsyncs for all types of procs? */
> >               if (!AmBackgroundWriterProcess())
> > -                     CheckpointerShmem->num_backend_fsync++;
> > +                     pgstat_increment_buffer_action(BA_Fsync);
> > +
>
> Yes, I think that'd make sense. Now that we can disambiguate the different
> types of syncs between procs, I don't see a point of having a process-type
> filter here. We just loose data...
>
>

Done

>
> >               /* don't set checksum for all-zero page */
> > @@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >                                       if (XLogNeedsFlush(lsn) &&
> >                                               StrategyRejectBuffer(strategy, buf))
> >                                       {
> > +                                             /*
> > +                                              * Unset the strat write flag, as we will not be writing
> > +                                              * this particular buffer from our ring out and may end
> > +                                              * up having to find a buffer from main shared buffers,
> > +                                              * which, if it is dirty, we may have to write out, which
> > +                                              * could have been prevented by checkpointing and background
> > +                                              * writing
> > +                                              */
> > +                                             StrategyUnChooseBufferFromRing(strategy);
> > +
> >                                               /* Drop lock/pin and loop around for another buffer */
> >                                               LWLockRelease(BufferDescriptorGetContentLock(buf));
> >                                               UnpinBuffer(buf, true);
> >                                               continue;
> >                                       }
>
> Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to
> have two function calls into freelist.c when the second happens exactly when
> the first returns true?
>
>
> > +
> > +                                     /*
> > +                                      * TODO: there is certainly a better way to write this
> > +                                      * logic
> > +                                      */
> > +
> > +                                     /*
> > +                                      * The dirty buffer that will be written out was selected
> > +                                      * from the ring and we did not bother checking the
> > +                                      * freelist or doing a clock sweep to look for a clean
> > +                                      * buffer to use, thus, this write will be counted as a
> > +                                      * strategy write -- one that may be unnecessary without a
> > +                                      * strategy
> > +                                      */
> > +                                     if (StrategyIsBufferFromRing(strategy))
> > +                                     {
> > +                                             pgstat_increment_buffer_action(BA_Write_Strat);
> > +                                     }
> > +
> > +                                             /*
> > +                                              * If the dirty buffer was one we grabbed from the
> > +                                              * freelist or through a clock sweep, it could have been
> > +                                              * written out by bgwriter or checkpointer, thus, we will
> > +                                              * count it as a regular write
> > +                                              */
> > +                                     else
> > +                                             pgstat_increment_buffer_action(BA_Write);
>
> It seems this would be better solved by having an "bool *from_ring" or
> GetBufferSource* parameter to StrategyGetBuffer().
>

I've addressed both of these in the new version.

>
> > @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
> >       /*
> >        * bufToWrite is either the shared buffer or a copy, as appropriate.
> >        */
> > +
> > +     /*
> > +      * TODO: consider that if we did not need to distinguish between a buffer
> > +      * flushed that was grabbed from the ring buffer and written out as part
> > +      * of a strategy which was not from main Shared Buffers (and thus
> > +      * preventable by bgwriter or checkpointer), then we could move all calls
> > +      * to pgstat_increment_buffer_action() here except for the one for
> > +      * extends, which would remain in ReadBuffer_common() before smgrextend()
> > +      * (unless we decide to start counting other extends). That includes the
> > +      * call to count buffers written by bgwriter and checkpointer which go
> > +      * through FlushBuffer() but not BufferAlloc(). That would make it
> > +      * simpler. Perhaps instead we can find somewhere else to indicate that
> > +      * the buffer is from the ring of buffers to reuse.
> > +      */
> >       smgrwrite(reln,
> >                         buf->tag.forkNum,
> >                         buf->tag.blockNum,
>
> Can we just add a parameter to FlushBuffer indicating what the source of the
> write is?
>

I just noticed this comment now, so I'll address that in the next
version. I rebased today and noticed merge conflicts, so, it looks like
v5 will be on its way soon anyway.

>
> > @@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
> >        * the rate of buffer consumption.  Note that buffers recycled by a
> >        * strategy object are intentionally not counted here.
> >        */
> > -     pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
> > +     pgstat_increment_buffer_action(BA_Alloc);
> >
> >       /*
> >        * First check, without acquiring the lock, whether there's buffers in the
>
> > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
> >                */
> >               *complete_passes += nextVictimBuffer / NBuffers;
> >       }
> > -
> > -     if (num_buf_alloc)
> > -     {
> > -             *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
> > -     }
> >       SpinLockRelease(&StrategyControl->buffer_strategy_lock);
> >       return result;
> >  }
>
> Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
> suspect this patch shouldn't get rid of numBufferAllocs at the same time as
> overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
> that that's the case / how we can make that work.
>
>

I initially meant to add a function to the patch like
pg_stat_get_buffer_actions() but which took a BufferActionType and
BackendType as parameters and returned a single value which is the
number of buffer action types of that type for that type of backend.

let's say I defined it like this:
uint64
  pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
                                          BufferActionType ba_type)

Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
by subtracting the value of StrategyControl->numBufferAllocs from the
value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
BA_Alloc), val, then adding that value, val, to
StrategyControl->numBufferAllocs.

I think that would have the same behavior as current, though I'm not
sure if the performance would end up being better or worse. It wouldn't
be atomically incrementing StrategyControl->numBufferAllocs, but it
would do a few additional atomic operations in StrategySyncStart() than
before. Also, we would do all the work done by
pg_stat_get_buffer_actions() in StrategySyncStart().

But that is called comparatively infrequently, right?

>
>
> > +void
> > +pgstat_increment_buffer_action(BufferActionType ba_type)
> > +{
> > +     volatile PgBackendStatus *beentry   = MyBEEntry;
> > +
> > +     if (!beentry || !pgstat_track_activities)
> > +             return;
> > +
> > +     if (ba_type == BA_Alloc)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1);
> > +     else if (ba_type == BA_Extend)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1);
> > +     else if (ba_type == BA_Fsync)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1);
> > +     else if (ba_type == BA_Write)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1);
> > +     else if (ba_type == BA_Write_Strat)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1);
> > +}
>
> I don't think we want to use atomic increments here - they're *slow*. And
> there only ever can be a single writer to a backend's stats. So just doing
> something like
>     pg_atomic_write_u64(&var, pg_atomic_read_u64(&var) + 1)
> should do the trick.
>

Done

>
> > +/*
> > + * Called for a single backend at the time of death to persist its I/O stats
> > + */
> > +void
> > +pgstat_record_dead_backend_buffer_actions(void)
> > +{
> > +     volatile PgBackendBufferActionStats *ba_stats;
> > +     volatile        PgBackendStatus *beentry = MyBEEntry;
> > +
> > +     if (beentry->st_procpid != 0)
> > +             return;
> > +
> > +     // TODO: is this correct? could there be a data race? do I need a lock?
> > +     ba_stats = &BufferActionStatsArray[beentry->st_backendType];
> > +     pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs));
> > +     pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends));
> > +     pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs));
> > +     pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes));
> > +     pg_atomic_add_fetch_u64(&ba_stats->writes_strat,
pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat));
> > +}
>
> I don't see a race, FWIW.
>
> This is where I propose that we instead report the values up to the stats
> collector, instead of having a separate array that we need to persist
>

Changed

>
> > +/*
> > + * Fill the provided values array with the accumulated counts of buffer actions
> > + * taken by all backends of type backend_type (input parameter), both alive and
> > + * dead. This is currently only used by pg_stat_get_buffer_actions() to create
> > + * the rows in the pg_stat_buffer_actions system view.
> > + */
> > +void
> > +pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values)
> > +{
> > +     int                     i;
> > +     volatile PgBackendStatus *beentry;
> > +
> > +     /*
> > +      * Add stats from all exited backends
> > +      */
> > +     values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs);
> > +     values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends);
> > +     values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs);
> > +     values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes);
> > +     values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat);
> > +
> > +     /*
> > +      * Loop through all live backends and count their buffer actions
> > +      */
> > +     // TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method
> > +
> > +     beentry = BackendStatusArray;
> > +     for (i = 1; i <= MaxBackends; i++)
> > +     {
> > +             /* Don't count dead backends. They should already be counted */
> > +             if (beentry->st_procpid == 0)
> > +                     continue;
> > +             if (beentry->st_backendType != backend_type)
> > +                     continue;
> > +
> > +             values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
> > +             values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
> > +             values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
> > +             values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
> > +             values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
> > +
> > +             beentry++;
> > +     }
> > +}
>
> It seems to make a bit more sense to have this sum up the stats for all
> backend types at once.

Changed.

>
> > +             /*
> > +              * Currently, the only supported backend types for stats are the following.
> > +              * If this were to change, pg_proc.dat would need to be changed as well
> > +              * to reflect the new expected number of rows.
> > +              */
> > +             Datum values[BUFFER_ACTION_NUM_TYPES];
> > +             bool nulls[BUFFER_ACTION_NUM_TYPES];
>
> Ah ;)
>

I just went ahead and made a row for each backend type.

- Melanie

Вложения

v4-0001-Add-system-view-tracking-shared-buffer-actions.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

12 августа 2021 г., 01:00:40

On Wed, Aug 11, 2021 at 4:11 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
> > >       /*
> > >        * bufToWrite is either the shared buffer or a copy, as appropriate.
> > >        */
> > > +
> > > +     /*
> > > +      * TODO: consider that if we did not need to distinguish between a buffer
> > > +      * flushed that was grabbed from the ring buffer and written out as part
> > > +      * of a strategy which was not from main Shared Buffers (and thus
> > > +      * preventable by bgwriter or checkpointer), then we could move all calls
> > > +      * to pgstat_increment_buffer_action() here except for the one for
> > > +      * extends, which would remain in ReadBuffer_common() before smgrextend()
> > > +      * (unless we decide to start counting other extends). That includes the
> > > +      * call to count buffers written by bgwriter and checkpointer which go
> > > +      * through FlushBuffer() but not BufferAlloc(). That would make it
> > > +      * simpler. Perhaps instead we can find somewhere else to indicate that
> > > +      * the buffer is from the ring of buffers to reuse.
> > > +      */
> > >       smgrwrite(reln,
> > >                         buf->tag.forkNum,
> > >                         buf->tag.blockNum,
> >
> > Can we just add a parameter to FlushBuffer indicating what the source of the
> > write is?
> >
>
> I just noticed this comment now, so I'll address that in the next
> version. I rebased today and noticed merge conflicts, so, it looks like
> v5 will be on its way soon anyway.
>

Actually, after moving the code around like you suggested, calling
pgstat_increment_buffer_action() before smgrwrite() in FlushBuffer() and
using a parameter to indicate if it is a strategy write or not would
only save us one other call to pgstat_increment_buffer_action() -- the
one in SyncOneBuffer(). We would end up moving the one in BufferAlloc()
to FlushBuffer() and removing the one in SyncOneBuffer().
Do you think it is still worth it?

Rebased v5 attached.

Вложения

v5-0001-Add-system-view-tracking-shared-buffer-actions.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

13 августа 2021 г., 13:08:11

Hi,

On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:
> On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
> > > Also, I'm unsure how writing the buffer action stats out in
> > > pgstat_write_statsfiles() will work, since I think that backends can
> > > update their buffer action stats after we would have already persisted
> > > the data from the BufferActionStatsArray -- causing us to lose those
> > > updates.
> >
> > I was thinking it'd work differently. Whenever a connection ends, it reports
> > its data up to pgstats.c (otherwise we'd loose those stats). By the time
> > shutdown happens, they all need to have already have reported their stats - so
> > we don't need to do anything to get the data to pgstats.c during shutdown
> > time.
> >
> 
> When you say "whenever a connection ends", what part of the code are you
> referring to specifically?

pgstat_beshutdown_hook()


> Also, when you say "shutdown", do you mean a backend shutting down or
> all backends shutting down (including postmaster) -- like pg_ctl stop?

Admittedly our language is very imprecise around this :(. What I meant
is that backends would report their own stats up to the stats collector
when the connection ends (in pgstat_beshutdown_hook()). That means that
when the whole server (pgstat and then postmaster, potentially via
pg_ctl stop) shuts down, all the per-connection stats have already been
reported up to pgstat.


> > > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> > > index 55f6e3711d..96cac0a74e 100644
> > > --- a/src/backend/catalog/system_views.sql
> > > +++ b/src/backend/catalog/system_views.sql
> > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
> > >          pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
> > >          pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
> > >          pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> > > -        pg_stat_get_buf_written_backend() AS buffers_backend,
> > > -        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> > > -        pg_stat_get_buf_alloc() AS buffers_alloc,
> > >          pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
> >
> > Material for a separate patch, not this. But if we're going to break
> > monitoring queries anyway, I think we should consider also renaming
> > maxwritten_clean (and perhaps a few others), because nobody understands what
> > that is supposed to mean.

> Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?

No - I just meant that now that we're breaking pg_stat_bgwriter queries,
we should also rename the columns to be easier to understand. But that
it should be a separate patch / commit...


> > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
> > >                */
> > >               *complete_passes += nextVictimBuffer / NBuffers;
> > >       }
> > > -
> > > -     if (num_buf_alloc)
> > > -     {
> > > -             *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
> > > -     }
> > >       SpinLockRelease(&StrategyControl->buffer_strategy_lock);
> > >       return result;
> > >  }
> >
> > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
> > suspect this patch shouldn't get rid of numBufferAllocs at the same time as
> > overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
> > that that's the case / how we can make that work.
> >
> >
> 
> I initially meant to add a function to the patch like
> pg_stat_get_buffer_actions() but which took a BufferActionType and
> BackendType as parameters and returned a single value which is the
> number of buffer action types of that type for that type of backend.
> 
> let's say I defined it like this:
> uint64
>   pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
>                                           BufferActionType ba_type)
> 
> Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
> by subtracting the value of StrategyControl->numBufferAllocs from the
> value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
> BA_Alloc), val, then adding that value, val, to
> StrategyControl->numBufferAllocs.

I don't think you could restrict this to B_BG_WRITER? The whole point of
this logic is that bgwriter uses the stats for *all* backends to get the
"usage rate" for buffers, which it then uses to control how many buffers
to clean.


> I think that would have the same behavior as current, though I'm not
> sure if the performance would end up being better or worse. It wouldn't
> be atomically incrementing StrategyControl->numBufferAllocs, but it
> would do a few additional atomic operations in StrategySyncStart() than
> before. Also, we would do all the work done by
> pg_stat_get_buffer_actions() in StrategySyncStart().

I think it'd be better to separate changing the bgwriter pacing logic
(and thus numBufferAllocs) from changing the stats reporting.


> But that is called comparatively infrequently, right?

Depending on the workload not that rarely. I'm afraid this might be a
bit too expensive. It's possible we can work around that however.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

07 сентября 2021 г., 23:16:28

On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:
> > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
> > > > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> > > > index 55f6e3711d..96cac0a74e 100644
> > > > --- a/src/backend/catalog/system_views.sql
> > > > +++ b/src/backend/catalog/system_views.sql
> > > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
> > > >          pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
> > > >          pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
> > > >          pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> > > > -        pg_stat_get_buf_written_backend() AS buffers_backend,
> > > > -        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> > > > -        pg_stat_get_buf_alloc() AS buffers_alloc,
> > > >          pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
> > >
> > > Material for a separate patch, not this. But if we're going to break
> > > monitoring queries anyway, I think we should consider also renaming
> > > maxwritten_clean (and perhaps a few others), because nobody understands what
> > > that is supposed to mean.
>
> > Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?
>
> No - I just meant that now that we're breaking pg_stat_bgwriter queries,
> we should also rename the columns to be easier to understand. But that
> it should be a separate patch / commit...
>

I separated the removal of some redundant stats from pg_stat_bgwriter
into a different commit but haven't removed or clarified any additional
columns in pg_stat_bgwriter.

>
>
> > > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
> > > >                */
> > > >               *complete_passes += nextVictimBuffer / NBuffers;
> > > >       }
> > > > -
> > > > -     if (num_buf_alloc)
> > > > -     {
> > > > -             *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
> > > > -     }
> > > >       SpinLockRelease(&StrategyControl->buffer_strategy_lock);
> > > >       return result;
> > > >  }
> > >
> > > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
> > > suspect this patch shouldn't get rid of numBufferAllocs at the same time as
> > > overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
> > > that that's the case / how we can make that work.
> > >
> > >
> >
> > I initially meant to add a function to the patch like
> > pg_stat_get_buffer_actions() but which took a BufferActionType and
> > BackendType as parameters and returned a single value which is the
> > number of buffer action types of that type for that type of backend.
> >
> > let's say I defined it like this:
> > uint64
> >   pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
> >                                           BufferActionType ba_type)
> >
> > Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
> > by subtracting the value of StrategyControl->numBufferAllocs from the
> > value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
> > BA_Alloc), val, then adding that value, val, to
> > StrategyControl->numBufferAllocs.
>
> I don't think you could restrict this to B_BG_WRITER? The whole point of
> this logic is that bgwriter uses the stats for *all* backends to get the
> "usage rate" for buffers, which it then uses to control how many buffers
> to clean.
>
>
> > I think that would have the same behavior as current, though I'm not
> > sure if the performance would end up being better or worse. It wouldn't
> > be atomically incrementing StrategyControl->numBufferAllocs, but it
> > would do a few additional atomic operations in StrategySyncStart() than
> > before. Also, we would do all the work done by
> > pg_stat_get_buffer_actions() in StrategySyncStart().
>
> I think it'd be better to separate changing the bgwriter pacing logic
> (and thus numBufferAllocs) from changing the stats reporting.
>
>
> > But that is called comparatively infrequently, right?
>
> Depending on the workload not that rarely. I'm afraid this might be a
> bit too expensive. It's possible we can work around that however.
>

I've restored StrategyControl->numBuffersAlloc.

Attached is v6 of the patchset.

I have made several small updates to the patch, including user docs
updates, comment clarifications, various changes related to how
structures are initialized, code simplications, small details like
alphabetizing of #includes, etc.

Below are details on the remaining TODOs and open questions for this
patch and why I haven't done them yet:

1) performance testing (initial tests done, but need to do some further
investigation before sharing)

2) stats_reset
Because pg_stat_buffer_actions fields were added to the globalStats
structure, they get reset when the target RESET_BGWRITER is reset.
Depending on whether or not these commits remove columns from the
pg_stat_bgwriter view, I would approach adding stats_reset to
pg_stat_buffer_actions differently. If removing all of pg_stat_bgwriter,
I would just rename the target to apply to pg_stat_buffer_actions. If
not removing all of pg_stat_bgwriter, I would add a new target for
pg_stat_buffer_actions to reset those stats and then either remove them
from globalStats or MemSet() only the relevant parts of the struct in
pgstat_recv_resetsharedcounter().
I haven't done this yet because I want to get input on what should
happen to pg_stat_bgwriter first (all of it goes, all of it stays, some
goes, etc).

3) what to count
Currently, the patch counts allocs, extends, fsyncs and writes of shared
buffers and writes done when using a buffer access strategy. So, it is a
mix of mostly shared buffers and a few non-shared buffers. I am
wondering if it makes sense to also count extends with smgrextend()
other than those using shared buffers--for example when building an
index or when extending the free space map or visibility map. For
fsyncs, the patch does not count checkpointer fsyncs or fsyncs done from
XLogWrite().
On a related note, depending on what the view counts, the name
buffer_actions may or may not be too general.

I also feel like the BackendType B_BACKEND is a bit confusing when we
are tracking buffer actions for different backend types -- this name
makes it seem like other types of backends are not backends.

I'm not sure what the view should track and can see arguments for
excluding certain extends or separating them into another stat. I
haven't made the changes because I am looking for other peoples'
opinions.

4) Adding some sort of protection against regressions when code is added
that adds additional buffer actions but doesn't count them -- more
likely if we are counting all users of smgrextend() but not doing the
counter incrementing there.

I'm not sure how I would even do this, so, that's why I haven't done it.

5) It seems like the code to create a tuplestore used by various stats
functions like pg_stat_get_progress_info(), pg_stat_get_activity, and
pg_stat_get_slru could be refactored into a helper function since it is
quite redundant (maybe returning a ReturnSetInfo).

I haven't done this because I wasn't sure if it was a good idea, and, if
it is, if I should do it in a separate commit.

6) Cleaning up of commit message, running pgindent, and, eventually,
catalog bump (waiting until the patch is done to do this).

7) Additional testing to ensure all codepaths added are hit (one-off
testing, not added to regression test suite). I am waiting to do this
until all of the types of buffer actions that will be done are
finalized.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

09 сентября 2021 г., 04:28:38

On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:
> > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
> > > > Also, I'm unsure how writing the buffer action stats out in
> > > > pgstat_write_statsfiles() will work, since I think that backends can
> > > > update their buffer action stats after we would have already persisted
> > > > the data from the BufferActionStatsArray -- causing us to lose those
> > > > updates.
> > >
> > > I was thinking it'd work differently. Whenever a connection ends, it reports
> > > its data up to pgstats.c (otherwise we'd loose those stats). By the time
> > > shutdown happens, they all need to have already have reported their stats - so
> > > we don't need to do anything to get the data to pgstats.c during shutdown
> > > time.
> > >
> >
> > When you say "whenever a connection ends", what part of the code are you
> > referring to specifically?
>
> pgstat_beshutdown_hook()
>
>
> > Also, when you say "shutdown", do you mean a backend shutting down or
> > all backends shutting down (including postmaster) -- like pg_ctl stop?
>
> Admittedly our language is very imprecise around this :(. What I meant
> is that backends would report their own stats up to the stats collector
> when the connection ends (in pgstat_beshutdown_hook()). That means that
> when the whole server (pgstat and then postmaster, potentially via
> pg_ctl stop) shuts down, all the per-connection stats have already been
> reported up to pgstat.
>

So, I realized that the patch has a problem. I added the code to send
buffer actions stats to the stats collector
(pgstat_send_buffer_actions()) to pgstat_report_stat() and this isn't
getting called when all types of backends exit.

I originally thought to add pgstat_send_buffer_actions() to
pgstat_beshutdown_hook() (as suggested), but, this is called after
pgstat_shutdown_hook(), so, we aren't able to send stats to the stats
collector at that time. (pgstat_shutdown_hook() sets pgstat_is_shutdown
to true and then in pgstat_beshutdown_hook() (called after), if we call
pgstat_send_buffer_actions(), it calls pgstat_send() which calls
pgstat_assert_is_up() which trips when pgstat_is_shutdown is true.)

After calling pgstat_send_buffer_actions() from pgstat_report_stat(), it
seems to miss checkpointer stats entirely. I did find that if I
sprinkled pgstat_send_buffer_actions() around in the various places that
pgstat_send_checkpointer() is called, I could get checkpointer stats
(see attached patch, capture_checkpointer_buffer_actions.patch), but,
that seems a little bit haphazard since pgstat_send_buffer_actions() is
supposed to capture stats for all backend types. Is there somewhere else
I can call it that is exercised by all backend types before
pgstat_shutdown_hook() is called but after they would have finished any
relevant buffer actions?

- Melanie

Вложения

capture_checkpointer_buffer_actions.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

11 сентября 2021 г., 01:16:28

On Wed, Sep 8, 2021 at 9:28 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:
> > > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > Also, I'm unsure how writing the buffer action stats out in
> > > > > pgstat_write_statsfiles() will work, since I think that backends can
> > > > > update their buffer action stats after we would have already persisted
> > > > > the data from the BufferActionStatsArray -- causing us to lose those
> > > > > updates.
> > > >
> > > > I was thinking it'd work differently. Whenever a connection ends, it reports
> > > > its data up to pgstats.c (otherwise we'd loose those stats). By the time
> > > > shutdown happens, they all need to have already have reported their stats - so
> > > > we don't need to do anything to get the data to pgstats.c during shutdown
> > > > time.
> > > >
> > >
> > > When you say "whenever a connection ends", what part of the code are you
> > > referring to specifically?
> >
> > pgstat_beshutdown_hook()
> >
> >
> > > Also, when you say "shutdown", do you mean a backend shutting down or
> > > all backends shutting down (including postmaster) -- like pg_ctl stop?
> >
> > Admittedly our language is very imprecise around this :(. What I meant
> > is that backends would report their own stats up to the stats collector
> > when the connection ends (in pgstat_beshutdown_hook()). That means that
> > when the whole server (pgstat and then postmaster, potentially via
> > pg_ctl stop) shuts down, all the per-connection stats have already been
> > reported up to pgstat.
> >
>
> So, I realized that the patch has a problem. I added the code to send
> buffer actions stats to the stats collector
> (pgstat_send_buffer_actions()) to pgstat_report_stat() and this isn't
> getting called when all types of backends exit.
>
> I originally thought to add pgstat_send_buffer_actions() to
> pgstat_beshutdown_hook() (as suggested), but, this is called after
> pgstat_shutdown_hook(), so, we aren't able to send stats to the stats
> collector at that time. (pgstat_shutdown_hook() sets pgstat_is_shutdown
> to true and then in pgstat_beshutdown_hook() (called after), if we call
> pgstat_send_buffer_actions(), it calls pgstat_send() which calls
> pgstat_assert_is_up() which trips when pgstat_is_shutdown is true.)
>
> After calling pgstat_send_buffer_actions() from pgstat_report_stat(), it
> seems to miss checkpointer stats entirely. I did find that if I
> sprinkled pgstat_send_buffer_actions() around in the various places that
> pgstat_send_checkpointer() is called, I could get checkpointer stats
> (see attached patch, capture_checkpointer_buffer_actions.patch), but,
> that seems a little bit haphazard since pgstat_send_buffer_actions() is
> supposed to capture stats for all backend types. Is there somewhere else
> I can call it that is exercised by all backend types before
> pgstat_shutdown_hook() is called but after they would have finished any
> relevant buffer actions?
>

I realized that putting these additional calls in checkpointer code and
not clearing out PgBackendStatus counters for buffer actions results in
a lot of duplicate stats. I was wondering if
pgstat_send_buffer_actions() is needed, however, in
HandleCheckpointerInterrupts() before the proc_exit().

It does seem like additional calls to pgstat_send_buffer_actions()
shouldn't be needed since most processes register
pgstat_shutdown_hook(). However, since MyDatabaseId isn't valid for the
auxiliary processes, even though the pgstat_shutdown_hook() is
registered from BaseInit(), pgstat_report_stat() never gets called for
them, so their stats aren't persisted using the current method.

It seems like the best solution to persisting all processes' stats would
be to have all processes register pgstat_shutdown_hook() and to still
call pgstat_report_stat() even if MyDatabaseId is not valid if the
process is not a regular backend (I assume that it is only a problem
that MyDatabaseId is InvalidOid for backends that have had it set to a
valid oid at some point). For the stats that rely on database OID,
perhaps those can be reported based on whether or not MyDatabaseId is
valid from within pgstat_report_stat().

I also realized that I am not collecting stats from live auxiliary
processes in pg_stat_get_buffer_actions(). I need to change the loop to
for (i = 0; i <= MaxBackends + NUM_AUXPROCTYPES; i++) to actually get
stats from live auxiliary processes when querying the view.

On an unrelated note, I am planning to remove buffers_clean and
buffers_checkpoint from the pg_stat_bgwriter view since those are also
redundant. When I was removing them, I noticed that buffers_checkpoint
and buffers_clean count buffers as having been written even when
FlushBuffer() "does nothing" because someone else wrote out the dirty
buffer before the bgwriter or checkpointer had a chance to do it. This
seems like it would result in an incorrect count. Am I missing
something?

- Melanie

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

14 сентября 2021 г., 00:46:02

Hi,

I've attached the v7 patch set.

Changes from v6:
- removed unnecessary global variable BufferActionsStats
- fixed the loop condition in pg_stat_get_buffer_actions()
- updated some comments
- removed buffers_checkpoint and buffers_clean from pg_stat_bgwriter
view (now pg_stat_bgwriter view is mainly checkpointer statistics,
which isn't great)
- instead of calling pgstat_send_buffer_actions() in
pgstat_report_stat(), I renamed pgstat_send_buffer_actions() to
pgstat_report_buffers() and call it directly from
pgstat_shutdown_hook() for all types of processes (including processes
with invalid MyDatabaseId [like auxiliary processes])

I began changing the code to add the stats reset timestamp to the
pg_stat_buffer_actions view, but, I realized that it will be kind of
distracting to have every row for every backend type have a stats reset
timestamp (since it will be the same timestamp over and over). If,
however, you could reset buffer stats for each backend type
individually, then, I could see having it. Otherwise, we could add a
function like pg_stat_get_stats_reset_time(viewname) where viewname
would be pg_stat_buffer_actions in our case. Though, maybe that is
annoying and not very usable--I'm not sure.

I also think it makes sense to rename the pg_stat_buffer_actions view to
pg_stat_buffers and to name the columns using both the buffer action
type and buffer type -- e.g. shared, strategy, local. This leaves open
the possibility of counting buffer actions done on other non-shared
buffers -- like those done while building indexes or those using local
buffers. The third patch in the set does this (I wanted to see if it
made sense before fixing it up into the first patch in the set).

This naming convention (BufferType_BufferActionType) made me think that
it might make sense to have two enumerations: one being the current
BufferActionType (which could also be called BufferAccessType though
that might get confusing with BufferAccessStrategyType and buffer access
strategies in general) and the other being BufferType (which would be
one of shared, local, index, etc).

I attached a patch with the outline of this idea
(buffer_type_enum_addition.patch). It doesn't work because
pg_stat_get_buffer_actions() uses the BufferActionType as an index into
the values array returned. If I wanted to use a combination of the two
enums as an indexing mechanism (BufferActionType and BufferType), we
would end up with a tuple having every combination of the two
enums--some of which aren't valid. It might not make sense to implement
this. I do think it is useful to think of these stats as a combination

of a buffer action and a type of buffer.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Alvaro Herrera

Дата:

15 сентября 2021 г., 04:30:02

Hello Melanie

On 2021-Sep-13, Melanie Plageman wrote:

> I also think it makes sense to rename the pg_stat_buffer_actions view to
> pg_stat_buffers and to name the columns using both the buffer action
> type and buffer type -- e.g. shared, strategy, local. This leaves open
> the possibility of counting buffer actions done on other non-shared
> buffers -- like those done while building indexes or those using local
> buffers. The third patch in the set does this (I wanted to see if it
> made sense before fixing it up into the first patch in the set).

What do you think of the idea of having the "shared/strategy/local"
attribute be a column?  So you'd have up to three rows per buffer action
type.  Users wishing to see an aggregate can just aggregate them, just
like they'd do with pg_buffercache.  I think that leads to an easy
decision with regards to this point:

> I attached a patch with the outline of this idea
> (buffer_type_enum_addition.patch). It doesn't work because
> pg_stat_get_buffer_actions() uses the BufferActionType as an index into
> the values array returned. If I wanted to use a combination of the two
> enums as an indexing mechanism (BufferActionType and BufferType), we
> would end up with a tuple having every combination of the two
> enums--some of which aren't valid. It might not make sense to implement
> this. I do think it is useful to think of these stats as a combination
> of a buffer action and a type of buffer.

Does that seem sensible?

(It's weird to have enum values that are there just to indicate what's
the maximum value.  I think that sort of thing is better done by having
a "#define LAST_THING" that takes the last valid value from the enum.
That would free you from having to handle the last value in switch
blocks, for example.  LAST_OCLASS in dependency.h is a precedent on this.)

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."                        (Paul Thomas)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

24 сентября 2021 г., 00:05:07

On Tue, Sep 14, 2021 at 9:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Sep-13, Melanie Plageman wrote:
>
> > I also think it makes sense to rename the pg_stat_buffer_actions view to
> > pg_stat_buffers and to name the columns using both the buffer action
> > type and buffer type -- e.g. shared, strategy, local. This leaves open
> > the possibility of counting buffer actions done on other non-shared
> > buffers -- like those done while building indexes or those using local
> > buffers. The third patch in the set does this (I wanted to see if it
> > made sense before fixing it up into the first patch in the set).
>
> What do you think of the idea of having the "shared/strategy/local"
> attribute be a column?  So you'd have up to three rows per buffer action
> type.  Users wishing to see an aggregate can just aggregate them, just
> like they'd do with pg_buffercache.  I think that leads to an easy
> decision with regards to this point:

I have rewritten the code to implement this.

>
>
> (It's weird to have enum values that are there just to indicate what's
> the maximum value.  I think that sort of thing is better done by having
> a "#define LAST_THING" that takes the last valid value from the enum.
> That would free you from having to handle the last value in switch
> blocks, for example.  LAST_OCLASS in dependency.h is a precedent on this.)
>

I have made this change.

The attached v8 patchset is rewritten to add in an additional dimension
-- buffer type. Now, a backend keeps track of how many buffers of a
particular type (e.g. shared, local) it has accessed in a particular way
(e.g. alloc, write). It also changes the naming of various structures
and the view members.

Previously, stats reset did not work since it did not consider live
backends' counters. Now, the reset message includes the current live
backends' counters to be tracked by the stats collector and used when
the view is queried.

The reset message is one of the areas in which I still need to do some
work -- I shoved the array of PgBufferAccesses into the existing reset
message used for checkpointer, bgwriter, etc. Before making a new type
of message, I would like feedback from a reviewer about the approach.

There are various TODOs in the code which are actually questions for the
reviewer. Once I have some feedback, it will be easier to address these
items.

There a few other items which may be material for other commits that
I would also like to do:
1) write wrapper functions for smgr* functions which count buffer
accesses of the appropriate type. I wasn't sure if these should
literally just take all the parameters that the smgr* functions take +
buffer type. Once these exist, there will be less possibility for
regressions in which new code is added using smgr* functions without
counting this buffer activity. Once I add these, I was going to go
through and replace existing calls to smgr* functions and thereby start
counting currently uncounted buffer type accesses (direct, local, etc).

2) Separate checkpointer and bgwriter into two views and add additional
stats to the bgwriter view.

3) Consider adding a helper function to pgstatfuncs.c to help create the
tuplestore. These functions all have quite a few lines which are exactly
the same, and I thought it might be nice to do something about that:
  pg_stat_get_progress_info(PG_FUNCTION_ARGS)
  pg_stat_get_activity(PG_FUNCTION_ARGS)
  pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
  pg_stat_get_slru(PG_FUNCTION_ARGS)
  pg_stat_get_progress_info(PG_FUNCTION_ARGS)
I can imagine a function that takes a Datums array, a nulls array, and a
ResultSetInfo and then makes the tuplestore -- though I think that will
use more memory. Perhaps we could make a macro which does the initial
error checking (checking if caller supports returning a tuplestore)? I'm
not sure if there is something meaningful here, but I thought I would
ask.

Finally, I haven't removed the test in pg_stats and haven't done a final
pass for comment clarity, alphabetization, etc on this version.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

25 сентября 2021 г., 00:58:48

On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The attached v8 patchset is rewritten to add in an additional dimension
> -- buffer type. Now, a backend keeps track of how many buffers of a
> particular type (e.g. shared, local) it has accessed in a particular way
> (e.g. alloc, write). It also changes the naming of various structures
> and the view members.
>
> Previously, stats reset did not work since it did not consider live
> backends' counters. Now, the reset message includes the current live
> backends' counters to be tracked by the stats collector and used when
> the view is queried.
>
> The reset message is one of the areas in which I still need to do some
> work -- I shoved the array of PgBufferAccesses into the existing reset
> message used for checkpointer, bgwriter, etc. Before making a new type
> of message, I would like feedback from a reviewer about the approach.
>
> There are various TODOs in the code which are actually questions for the
> reviewer. Once I have some feedback, it will be easier to address these
> items.
>
> There a few other items which may be material for other commits that
> I would also like to do:
> 1) write wrapper functions for smgr* functions which count buffer
> accesses of the appropriate type. I wasn't sure if these should
> literally just take all the parameters that the smgr* functions take +
> buffer type. Once these exist, there will be less possibility for
> regressions in which new code is added using smgr* functions without
> counting this buffer activity. Once I add these, I was going to go
> through and replace existing calls to smgr* functions and thereby start
> counting currently uncounted buffer type accesses (direct, local, etc).
>
> 2) Separate checkpointer and bgwriter into two views and add additional
> stats to the bgwriter view.
>
> 3) Consider adding a helper function to pgstatfuncs.c to help create the
> tuplestore. These functions all have quite a few lines which are exactly
> the same, and I thought it might be nice to do something about that:
>   pg_stat_get_progress_info(PG_FUNCTION_ARGS)
>   pg_stat_get_activity(PG_FUNCTION_ARGS)
>   pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
>   pg_stat_get_slru(PG_FUNCTION_ARGS)
>   pg_stat_get_progress_info(PG_FUNCTION_ARGS)
> I can imagine a function that takes a Datums array, a nulls array, and a
> ResultSetInfo and then makes the tuplestore -- though I think that will
> use more memory. Perhaps we could make a macro which does the initial
> error checking (checking if caller supports returning a tuplestore)? I'm
> not sure if there is something meaningful here, but I thought I would
> ask.
>
> Finally, I haven't removed the test in pg_stats and haven't done a final
> pass for comment clarity, alphabetization, etc on this version.
>

I have addressed almost all of the issues mentioned above in v9.
The only remaining TODOs are described in the commit message.
most critical one is that the reset message doesn't work.

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

27 сентября 2021 г., 21:58:53

On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> The only remaining TODOs are described in the commit message.
> most critical one is that the reset message doesn't work.

v10 is attached with updated comments and some limited code refactoring

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

29 сентября 2021 г., 23:46:07

On Mon, Sep 27, 2021 at 2:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > The only remaining TODOs are described in the commit message.
> > most critical one is that the reset message doesn't work.
>
> v10 is attached with updated comments and some limited code refactoring

v11 has fixed the oversize message issue by sending a reset message for
each backend type. Now, we will call GetCurrentTimestamp
BACKEND_NUM_TYPES times, so maybe I should add some kind of flag to the
reset message that indicates the first message so that all the "do once"
things can be done at that point.

I've also fixed a few style/cosmetic issues and updated the commit
message with a link to the thread [1] where I proposed smgrwrite() and
smgrextend() wrappers (which is where I propose to call
pgstat_incremement_buffer_access_type() for unbuffered writes and
extends).

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

01 октября 2021 г., 00:16:34

On Wed, Sep 29, 2021 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Sep 27, 2021 at 2:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > > The only remaining TODOs are described in the commit message.
> > > most critical one is that the reset message doesn't work.
> >
> > v10 is attached with updated comments and some limited code refactoring
>
> v11 has fixed the oversize message issue by sending a reset message for
> each backend type. Now, we will call GetCurrentTimestamp
> BACKEND_NUM_TYPES times, so maybe I should add some kind of flag to the
> reset message that indicates the first message so that all the "do once"
> things can be done at that point.
>
> I've also fixed a few style/cosmetic issues and updated the commit
> message with a link to the thread [1] where I proposed smgrwrite() and
> smgrextend() wrappers (which is where I propose to call
> pgstat_incremement_buffer_access_type() for unbuffered writes and
> extends).
>
> - Melanie
>
> [1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com


v12 (attached) has various style and code clarity updates (it is
pgindented as well). I also added a new commit which creates a utility
function to make a tuplestore for views that need one in pgstatfuncs.c.

Having received some offlist feedback about the names BufferAccessType
and BufferType being confusing, I am planning to rename these variables
and all of the associated functions. I agree that BufferType and
BufferAccessType are confusing for the following reasons:
  - They sound similar.
  - They aren't very precise.
  - One of the types of buffers is not using a Postgres buffer.

So far, the proposed alternative is IO_Op or IOOp for BufferAccessType
and IOPath for BufferType.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Alvaro Herrera

Дата:

01 октября 2021 г., 02:15:13

Can you say more about 0001?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

01 октября 2021 г., 23:05:31

v13 (attached) contains several cosmetic updates and the full rename
(comments included) of BufferAccessType and BufferType.

On Thu, Sep 30, 2021 at 7:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Can you say more about 0001?
>

The rationale for this patch was that it doesn't save much to avoid
initializing backend activity state in the bootstrap process and by
doing so, I don't have to do the check if (beentry) in pgstat_inc_ioop()
--which happens on most buffer accesses.

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

08 октября 2021 г., 20:56:20

Hi,

On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote:
> From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 24 Sep 2021 17:39:12 -0400
> Subject: [PATCH v13 1/4] Allow bootstrap process to beinit
>
> ---
>  src/backend/utils/init/postinit.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
> index 78bc64671e..fba5864172 100644
> --- a/src/backend/utils/init/postinit.c
> +++ b/src/backend/utils/init/postinit.c
> @@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
>      EnablePortalManager();
>
>      /* Initialize status reporting */
> -    if (!bootstrap)
> -        pgstat_beinit();
> +    pgstat_beinit();
>
>      /*
>       * Load relcache entries for the shared system catalogs.  This must create
> --
> 2.27.0
>

I think it's good to remove more and more of these !bootstrap cases - they
really make it harder to understand the state of the system at various
points. Optimizing for the rarely executed bootstrap mode at the cost of
checks in very common codepaths...


> From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 30 Sep 2021 16:16:22 -0400
> Subject: [PATCH v13 2/4] Add utility to make tuplestores for pg stat views
>
> Most of the steps to make a tuplestore for those pg_stat views requiring
> one are the same. Consolidate them into a single helper function for
> clarity and to avoid bugs.
> ---
>  src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------
>  1 file changed, 44 insertions(+), 85 deletions(-)
>
> diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
> index ff5aedc99c..513f5aecf6 100644
> --- a/src/backend/utils/adt/pgstatfuncs.c
> +++ b/src/backend/utils/adt/pgstatfuncs.c
> @@ -36,6 +36,42 @@
>
>  #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
>
> +/*
> + * Helper function for views with multiple rows constructed from a tuplestore
> + */
> +static Tuplestorestate *
> +pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc)
> +{
> +    Tuplestorestate *tupstore;
> +    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> +    MemoryContext per_query_ctx;
> +    MemoryContext oldcontext;
> +
> +    /* check to see if caller supports us returning a tuplestore */
> +    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("set-valued function called in context that cannot accept a set")));
> +    if (!(rsinfo->allowedModes & SFRM_Materialize))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("materialize mode required, but it is not allowed in this context")));
> +
> +    /* Build a tuple descriptor for our result type */
> +    if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE)
> +        elog(ERROR, "return type must be a row type");
> +
> +    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
> +    oldcontext = MemoryContextSwitchTo(per_query_ctx);
> +
> +    tupstore = tuplestore_begin_heap(true, false, work_mem);
> +    rsinfo->returnMode = SFRM_Materialize;
> +    rsinfo->setResult = tupstore;
> +    rsinfo->setDesc = *tupdesc;
> +    MemoryContextSwitchTo(oldcontext);
> +    return tupstore;
> +}

Is pgstattuple the best place for this helper? It's not really pgstatfuncs
specific...

It also looks vaguely familiar - I wonder if we have a helper roughly like
this somewhere else already...




> From e9a5d2a021d429fdbb2daa58ab9d75a069f334d4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 29 Sep 2021 15:39:45 -0400
> Subject: [PATCH v13 3/4] Add system view tracking IO ops per backend type
>

> diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> index be7366379d..0d18e7f71a 100644
> --- a/src/backend/postmaster/checkpointer.c
> +++ b/src/backend/postmaster/checkpointer.c
> @@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
>           */
>          if (!AmBackgroundWriterProcess())
>              CheckpointerShmem->num_backend_fsync++;
> +        pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
>          LWLockRelease(CheckpointerCommLock);
>          return false;
>      }

ISTM this doens't need to happen while holding CheckpointerCommLock?




> @@ -1461,7 +1467,25 @@ pgstat_reset_shared_counters(const char *target)
>                   errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
>
>      pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
> -    pgstat_send(&msg, sizeof(msg));
> +
> +    if (msg.m_resettarget == RESET_BUFFERS)
> +    {
> +        int            backend_type;
> +        PgStatIOPathOps ops[BACKEND_NUM_TYPES];
> +
> +        memset(ops, 0, sizeof(ops));
> +        pgstat_report_live_backend_io_path_ops(ops);
> +
> +        for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
> +        {
> +            msg.m_backend_resets.backend_type = backend_type;
> +            memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop));
> +            pgstat_send(&msg, sizeof(msg));
> +        }
> +    }
> +    else
> +        pgstat_send(&msg, sizeof(msg));
> +
>  }

I'd perhaps put this in a small helper function.


>  /* ----------
>   * pgstat_fetch_stat_dbentry() -
> @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
>  {
>      Assert(!pgstat_is_shutdown);
>
> +    /*
> +     * Only need to send stats on IO Ops for IO Paths when a process exits, as
> +     * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
> +     * then sum this with totals from exited backends persisted by the stats
> +     * collector.
> +     */
> +    pgstat_send_buffers();
> +
>      /*
>       * If we got as far as discovering our own database ID, we can report what
>       * we did to the collector.  Otherwise, we'd be sending an invalid
> @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
>  #endif
>  }

I think it might be nicer to move pgstat_beshutdown_hook() to be a
before_shmem_exit(), and do this in there.


> +/*
> + * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the
> + * equivalent stats structure for exited backends. Note that this adds and
> + * doesn't set, so the destination stats structure should be zeroed out by the
> + * caller initially. This would commonly be used to transfer all IO Op stats
> + * for all IO Paths for a particular backend type to the pgstats structure.
> + */

This seems a bit odd. Why not zero it in here? Perhaps it also should be
called something like _sum_ instead of _add_?


> +void
> +pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types)
> +{

Why is io_path_num_types a parameter?


> +static void
> +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
> +{
> +    int            io_path;
> +    PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
> +    PgStatIOOps *dest_io_path_ops =
> +    globalStats.buffers.ops[msg->backend_type].io_path_ops;
> +
> +    for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
> +    {
> +        PgStatIOOps *src = &src_io_path_ops[io_path];
> +        PgStatIOOps *dest = &dest_io_path_ops[io_path];
> +
> +        dest->allocs += src->allocs;
> +        dest->extends += src->extends;
> +        dest->fsyncs += src->fsyncs;
> +        dest->writes += src->writes;
> +    }
> +}

Could this, with a bit of finessing, use pgstat_add_io_path_ops()?


> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

What about writes originating in like FlushRelationBuffers()?


>  bool
> -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
>  {
> +    /*
> +     * If we decide to use the dirty buffer selected by StrategyGetBuffer(),
> +     * then ensure that we count it as such in pg_stat_buffers view.
> +     */
> +    *from_ring = true;
> +

Absolutely minor nitpick: Somehow it feelsoff to talk about the view here.


> +PgBackendStatus *
> +pgstat_fetch_backend_statuses(void)
> +{
> +    return BackendStatusArray;
> +}

Hm, not sure this adds much?


> +            /*
> +             * Subtract 1 from backend_type to avoid having rows for B_INVALID
> +             * BackendType
> +             */
> +            int            rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path;


Perhaps worth wrapping this in a macro or inline function? It's repeated and nontrivial.


> +    /* Add stats from all exited backends */
> +    backend_io_path_ops = pgstat_fetch_exited_backend_buffers();

It's probably *not* worth it, but I do wonder if we should do the addition on the SQL
level, and actually have two functions, one returning data for exited
backends, and one for currently connected ones.


> +static inline void
> +pgstat_inc_ioop(IOOp io_op, IOPath io_path)
> +{
> +    IOOps       *io_ops;
> +    PgBackendStatus *beentry = MyBEEntry;
> +
> +    Assert(beentry);
> +
> +    io_ops = &beentry->io_path_stats[io_path];
> +    switch (io_op)
> +    {
> +        case IOOP_ALLOC:
> +            pg_atomic_write_u64(&io_ops->allocs,
> +                                pg_atomic_read_u64(&io_ops->allocs) + 1);
> +            break;
> +        case IOOP_EXTEND:
> +            pg_atomic_write_u64(&io_ops->extends,
> +                                pg_atomic_read_u64(&io_ops->extends) + 1);
> +            break;
> +        case IOOP_FSYNC:
> +            pg_atomic_write_u64(&io_ops->fsyncs,
> +                                pg_atomic_read_u64(&io_ops->fsyncs) + 1);
> +            break;
> +        case IOOP_WRITE:
> +            pg_atomic_write_u64(&io_ops->writes,
> +                                pg_atomic_read_u64(&io_ops->writes) + 1);
> +            break;
> +    }
> +}

IIRC Thomas Munro had a patch adding a nonatomic_add or such
somewhere. Perhaps in the recovery readahead thread? Might be worth using
instead?


Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

11 октября 2021 г., 23:48:01

On Fri, Oct 8, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote:
> > From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Fri, 24 Sep 2021 17:39:12 -0400
> > Subject: [PATCH v13 1/4] Allow bootstrap process to beinit
> >
> > ---
> >  src/backend/utils/init/postinit.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
> > index 78bc64671e..fba5864172 100644
> > --- a/src/backend/utils/init/postinit.c
> > +++ b/src/backend/utils/init/postinit.c
> > @@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
> >       EnablePortalManager();
> >
> >       /* Initialize status reporting */
> > -     if (!bootstrap)
> > -             pgstat_beinit();
> > +     pgstat_beinit();
> >
> >       /*
> >        * Load relcache entries for the shared system catalogs.  This must create
> > --
> > 2.27.0
> >
>
> I think it's good to remove more and more of these !bootstrap cases - they
> really make it harder to understand the state of the system at various
> points. Optimizing for the rarely executed bootstrap mode at the cost of
> checks in very common codepaths...

What scope do you suggest for this patch set? A single patch which does
this in more locations (remove !bootstrap) or should I remove this patch
from the patchset?

>
>
>
> > From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Thu, 30 Sep 2021 16:16:22 -0400
> > Subject: [PATCH v13 2/4] Add utility to make tuplestores for pg stat views
> >
> > Most of the steps to make a tuplestore for those pg_stat views requiring
> > one are the same. Consolidate them into a single helper function for
> > clarity and to avoid bugs.
> > ---
> >  src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------
> >  1 file changed, 44 insertions(+), 85 deletions(-)
> >
> > diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
> > index ff5aedc99c..513f5aecf6 100644
> > --- a/src/backend/utils/adt/pgstatfuncs.c
> > +++ b/src/backend/utils/adt/pgstatfuncs.c
> > @@ -36,6 +36,42 @@
> >
> >  #define HAS_PGSTAT_PERMISSIONS(role)  (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
> >
> > +/*
> > + * Helper function for views with multiple rows constructed from a tuplestore
> > + */
> > +static Tuplestorestate *
> > +pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc)
> > +{
> > +     Tuplestorestate *tupstore;
> > +     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> > +     MemoryContext per_query_ctx;
> > +     MemoryContext oldcontext;
> > +
> > +     /* check to see if caller supports us returning a tuplestore */
> > +     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
> > +             ereport(ERROR,
> > +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > +                              errmsg("set-valued function called in context that cannot accept a set")));
> > +     if (!(rsinfo->allowedModes & SFRM_Materialize))
> > +             ereport(ERROR,
> > +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > +                              errmsg("materialize mode required, but it is not allowed in this context")));
> > +
> > +     /* Build a tuple descriptor for our result type */
> > +     if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE)
> > +             elog(ERROR, "return type must be a row type");
> > +
> > +     per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
> > +     oldcontext = MemoryContextSwitchTo(per_query_ctx);
> > +
> > +     tupstore = tuplestore_begin_heap(true, false, work_mem);
> > +     rsinfo->returnMode = SFRM_Materialize;
> > +     rsinfo->setResult = tupstore;
> > +     rsinfo->setDesc = *tupdesc;
> > +     MemoryContextSwitchTo(oldcontext);
> > +     return tupstore;
> > +}
>
> Is pgstattuple the best place for this helper? It's not really pgstatfuncs
> specific...
>
> It also looks vaguely familiar - I wonder if we have a helper roughly like
> this somewhere else already...
>

I don't see a function which is specifically a utility to make a
tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice
very similar code to the function I added in pg_tablespace_databases()
in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c,
pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in
event_tigger.c, pg_available_extensions in extension.c, etc.

Do you think it makes sense to refactor this code out of all of these
places? If so, where would such a utility function belong?

>
>
> > From e9a5d2a021d429fdbb2daa58ab9d75a069f334d4 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 29 Sep 2021 15:39:45 -0400
> > Subject: [PATCH v13 3/4] Add system view tracking IO ops per backend type
> >
>
> > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> > index be7366379d..0d18e7f71a 100644
> > --- a/src/backend/postmaster/checkpointer.c
> > +++ b/src/backend/postmaster/checkpointer.c
> > @@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
> >                */
> >               if (!AmBackgroundWriterProcess())
> >                       CheckpointerShmem->num_backend_fsync++;
> > +             pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
> >               LWLockRelease(CheckpointerCommLock);
> >               return false;
> >       }
>
> ISTM this doens't need to happen while holding CheckpointerCommLock?
>

Fixed in attached updates. I only attached the diff from my previous version.

>
>
> > @@ -1461,7 +1467,25 @@ pgstat_reset_shared_counters(const char *target)
> >                                errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
> >
> >       pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
> > -     pgstat_send(&msg, sizeof(msg));
> > +
> > +     if (msg.m_resettarget == RESET_BUFFERS)
> > +     {
> > +             int                     backend_type;
> > +             PgStatIOPathOps ops[BACKEND_NUM_TYPES];
> > +
> > +             memset(ops, 0, sizeof(ops));
> > +             pgstat_report_live_backend_io_path_ops(ops);
> > +
> > +             for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
> > +             {
> > +                     msg.m_backend_resets.backend_type = backend_type;
> > +                     memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop));
> > +                     pgstat_send(&msg, sizeof(msg));
> > +             }
> > +     }
> > +     else
> > +             pgstat_send(&msg, sizeof(msg));
> > +
> >  }
>
> I'd perhaps put this in a small helper function.
>

Done.

>
> >  /* ----------
> >   * pgstat_fetch_stat_dbentry() -
> > @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
> >  {
> >       Assert(!pgstat_is_shutdown);
> >
> > +     /*
> > +      * Only need to send stats on IO Ops for IO Paths when a process exits, as
> > +      * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
> > +      * then sum this with totals from exited backends persisted by the stats
> > +      * collector.
> > +      */
> > +     pgstat_send_buffers();
> > +
> >       /*
> >        * If we got as far as discovering our own database ID, we can report what
> >        * we did to the collector.  Otherwise, we'd be sending an invalid
> > @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
> >  #endif
> >  }
>
> I think it might be nicer to move pgstat_beshutdown_hook() to be a
> before_shmem_exit(), and do this in there.
>

Not really sure the correct way to do this. A cursory attempt to do so
failed because ShutdownXLOG() is also registered as a
before_shmem_exit() and ends up being called after
pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out
PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a
checkpoint, the checkpointer increments IO op counter for writes in its
PgBackendStatus.

>
> > +/*
> > + * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the
> > + * equivalent stats structure for exited backends. Note that this adds and
> > + * doesn't set, so the destination stats structure should be zeroed out by the
> > + * caller initially. This would commonly be used to transfer all IO Op stats
> > + * for all IO Paths for a particular backend type to the pgstats structure.
> > + */
>
> This seems a bit odd. Why not zero it in here? Perhaps it also should be
> called something like _sum_ instead of _add_?
>

I wanted to be able to use the function both when it was setting the
values and when it needed to add to the values (which are the two
current callers). I have changed the name from add -> sum.

>
> > +void
> > +pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types)
> > +{
>
> Why is io_path_num_types a parameter?
>

I imagined that maybe another caller would want to only add some IO path
types and still use the function, but I think it is more confusing than
anything else so I've changed it.

>
> > +static void
> > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
> > +{
> > +     int                     io_path;
> > +     PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
> > +     PgStatIOOps *dest_io_path_ops =
> > +     globalStats.buffers.ops[msg->backend_type].io_path_ops;
> > +
> > +     for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
> > +     {
> > +             PgStatIOOps *src = &src_io_path_ops[io_path];
> > +             PgStatIOOps *dest = &dest_io_path_ops[io_path];
> > +
> > +             dest->allocs += src->allocs;
> > +             dest->extends += src->extends;
> > +             dest->fsyncs += src->fsyncs;
> > +             dest->writes += src->writes;
> > +     }
> > +}
>
> Could this, with a bit of finessing, use pgstat_add_io_path_ops()?
>

I didn't really see a good way to do this -- given that
pgstat_add_io_path_ops() adds IOOps members to PgStatIOOps members --
which requires a pg_atomic_read_u64() and pgstat_recv_io_path_ops adds
PgStatIOOps to PgStatIOOps which doesn't require pg_atomic_read_u64().
Maybe I could pass a flag which, based on the type, either does or
doesn't use pg_atomic_read_u64 to access the value? But that seems worse
to me.

>
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
>
> What about writes originating in like FlushRelationBuffers()?
>

Yes, I have made IOPath a parameter to FlushBuffer() so that it can
distinguish between strategy buffer writes and shared buffer writes and
then pushed pgstat_inc_ioop() into FlushBuffer().

>
> >  bool
> > -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> > +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
> >  {
> > +     /*
> > +      * If we decide to use the dirty buffer selected by StrategyGetBuffer(),
> > +      * then ensure that we count it as such in pg_stat_buffers view.
> > +      */
> > +     *from_ring = true;
> > +
>
> Absolutely minor nitpick: Somehow it feelsoff to talk about the view here.

Fixed.

>
>
> > +PgBackendStatus *
> > +pgstat_fetch_backend_statuses(void)
> > +{
> > +     return BackendStatusArray;
> > +}
>
> Hm, not sure this adds much?

Is there a better way to access the whole BackendStatusArray from within
pgstatfuncs.c?

>
>
> > +                     /*
> > +                      * Subtract 1 from backend_type to avoid having rows for B_INVALID
> > +                      * BackendType
> > +                      */
> > +                     int                     rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path;
>
>
> Perhaps worth wrapping this in a macro or inline function? It's repeated and nontrivial.
>

Done.

>
> > +     /* Add stats from all exited backends */
> > +     backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
>
> It's probably *not* worth it, but I do wonder if we should do the addition on the SQL
> level, and actually have two functions, one returning data for exited
> backends, and one for currently connected ones.
>

It would be easy enough to implement. I would defer to others on whether
or not this would be useful. My use case for pg_stat_buffers() is to see
what backends' IO during a benchmark or test workload. For that, I reset
the stats before and then query pg_stat_buffers after running the
benchmark. I don't know if I would use exited and live stats
individually. In a real workload, I could see using
pg_stat_buffers live and exited to see if the workload causing lots of
backends to do their own writes is ongoing. Though a given workload may
be composed of lots of different queries, with backends exiting
throughout.

>
> > +static inline void
> > +pgstat_inc_ioop(IOOp io_op, IOPath io_path)
> > +{
> > +     IOOps      *io_ops;
> > +     PgBackendStatus *beentry = MyBEEntry;
> > +
> > +     Assert(beentry);
> > +
> > +     io_ops = &beentry->io_path_stats[io_path];
> > +     switch (io_op)
> > +     {
> > +             case IOOP_ALLOC:
> > +                     pg_atomic_write_u64(&io_ops->allocs,
> > +                                                             pg_atomic_read_u64(&io_ops->allocs) + 1);
> > +                     break;
> > +             case IOOP_EXTEND:
> > +                     pg_atomic_write_u64(&io_ops->extends,
> > +                                                             pg_atomic_read_u64(&io_ops->extends) + 1);
> > +                     break;
> > +             case IOOP_FSYNC:
> > +                     pg_atomic_write_u64(&io_ops->fsyncs,
> > +                                                             pg_atomic_read_u64(&io_ops->fsyncs) + 1);
> > +                     break;
> > +             case IOOP_WRITE:
> > +                     pg_atomic_write_u64(&io_ops->writes,
> > +                                                             pg_atomic_read_u64(&io_ops->writes) + 1);
> > +                     break;
> > +     }
> > +}
>
> IIRC Thomas Munro had a patch adding a nonatomic_add or such
> somewhere. Perhaps in the recovery readahead thread? Might be worth using
> instead?
>

I've added Thomas' function in a separate commit. I looked for a better
place to add it (I was thinking somewhere in src/backend/utils/misc) but
couldn't find anywhere that made sense.

I also added a call to pgstat_inc_ioop() in ProcessSyncRequests() to capture
when the checkpointer does fsyncs.

I also added pgstat_inc_ioop() calls to callers of smgrwrite() flushing local
buffers. I don't know if that is desirable or not in this patch. They could be
removed if wrappers for smgrwrite() go in and pgstat_inc_ioop() can be called
from within those wrappers.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

19 октября 2021 г., 22:29:31

Hi,

On 2021-10-11 16:48:01 -0400, Melanie Plageman wrote:
> On Fri, Oct 8, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote:
> > > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
> > > index 78bc64671e..fba5864172 100644
> > > --- a/src/backend/utils/init/postinit.c
> > > +++ b/src/backend/utils/init/postinit.c
> > > @@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
> > >       EnablePortalManager();
> > >
> > >       /* Initialize status reporting */
> > > -     if (!bootstrap)
> > > -             pgstat_beinit();
> > > +     pgstat_beinit();
> > >
> > >       /*
> > >        * Load relcache entries for the shared system catalogs.  This must create
> > > --
> > > 2.27.0
> > >
> >
> > I think it's good to remove more and more of these !bootstrap cases - they
> > really make it harder to understand the state of the system at various
> > points. Optimizing for the rarely executed bootstrap mode at the cost of
> > checks in very common codepaths...
>
> What scope do you suggest for this patch set? A single patch which does
> this in more locations (remove !bootstrap) or should I remove this patch
> from the patchset?

I think the scope is fine as-is.


> > Is pgstattuple the best place for this helper? It's not really pgstatfuncs
> > specific...
> >
> > It also looks vaguely familiar - I wonder if we have a helper roughly like
> > this somewhere else already...
> >
>
> I don't see a function which is specifically a utility to make a
> tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice
> very similar code to the function I added in pg_tablespace_databases()
> in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c,
> pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in
> event_tigger.c, pg_available_extensions in extension.c, etc.
>
> Do you think it makes sense to refactor this code out of all of these
> places?

Yes, I think it'd make sense. We have about 40 copies of this stuff, which is
fairly ridiculous.


> If so, where would such a utility function belong?

Not quite sure. src/backend/utils/fmgr/funcapi.c maybe? I suggest starting a
separate thread about that...


> > > @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
> > >  {
> > >       Assert(!pgstat_is_shutdown);
> > >
> > > +     /*
> > > +      * Only need to send stats on IO Ops for IO Paths when a process exits, as
> > > +      * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
> > > +      * then sum this with totals from exited backends persisted by the stats
> > > +      * collector.
> > > +      */
> > > +     pgstat_send_buffers();
> > > +
> > >       /*
> > >        * If we got as far as discovering our own database ID, we can report what
> > >        * we did to the collector.  Otherwise, we'd be sending an invalid
> > > @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
> > >  #endif
> > >  }
> >
> > I think it might be nicer to move pgstat_beshutdown_hook() to be a
> > before_shmem_exit(), and do this in there.
> >
>
> Not really sure the correct way to do this. A cursory attempt to do so
> failed because ShutdownXLOG() is also registered as a
> before_shmem_exit() and ends up being called after
> pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out
> PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a
> checkpoint, the checkpointer increments IO op counter for writes in its
> PgBackendStatus.

I think we'll really need to do a proper redesign of the shutdown callback
mechanism :(.



> > > +static void
> > > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
> > > +{
> > > +     int                     io_path;
> > > +     PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
> > > +     PgStatIOOps *dest_io_path_ops =
> > > +     globalStats.buffers.ops[msg->backend_type].io_path_ops;
> > > +
> > > +     for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
> > > +     {
> > > +             PgStatIOOps *src = &src_io_path_ops[io_path];
> > > +             PgStatIOOps *dest = &dest_io_path_ops[io_path];
> > > +
> > > +             dest->allocs += src->allocs;
> > > +             dest->extends += src->extends;
> > > +             dest->fsyncs += src->fsyncs;
> > > +             dest->writes += src->writes;
> > > +     }
> > > +}
> >
> > Could this, with a bit of finessing, use pgstat_add_io_path_ops()?
> >
>
> I didn't really see a good way to do this -- given that
> pgstat_add_io_path_ops() adds IOOps members to PgStatIOOps members --
> which requires a pg_atomic_read_u64() and pgstat_recv_io_path_ops adds
> PgStatIOOps to PgStatIOOps which doesn't require pg_atomic_read_u64().
> Maybe I could pass a flag which, based on the type, either does or
> doesn't use pg_atomic_read_u64 to access the value? But that seems worse
> to me.

Yea, you're probably right, that's worse.


> > > +PgBackendStatus *
> > > +pgstat_fetch_backend_statuses(void)
> > > +{
> > > +     return BackendStatusArray;
> > > +}
> >
> > Hm, not sure this adds much?
>
> Is there a better way to access the whole BackendStatusArray from within
> pgstatfuncs.c?

Export the variable itself?


> > IIRC Thomas Munro had a patch adding a nonatomic_add or such
> > somewhere. Perhaps in the recovery readahead thread? Might be worth using
> > instead?
> >
>
> I've added Thomas' function in a separate commit. I looked for a better
> place to add it (I was thinking somewhere in src/backend/utils/misc) but
> couldn't find anywhere that made sense.

I think it should just live in atomics.h?


> I also added pgstat_inc_ioop() calls to callers of smgrwrite() flushing local
> buffers. I don't know if that is desirable or not in this patch. They could be
> removed if wrappers for smgrwrite() go in and pgstat_inc_ioop() can be called
> from within those wrappers.

Makes sense to me to to have it here.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

02 ноября 2021 г., 22:26:52

v14 attached.

On Tue, Oct 19, 2021 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
>
>
> > > Is pgstattuple the best place for this helper? It's not really pgstatfuncs
> > > specific...
> > >
> > > It also looks vaguely familiar - I wonder if we have a helper roughly like
> > > this somewhere else already...
> > >
> >
> > I don't see a function which is specifically a utility to make a
> > tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice
> > very similar code to the function I added in pg_tablespace_databases()
> > in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c,
> > pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in
> > event_tigger.c, pg_available_extensions in extension.c, etc.
> >
> > Do you think it makes sense to refactor this code out of all of these
> > places?
>
> Yes, I think it'd make sense. We have about 40 copies of this stuff, which is
> fairly ridiculous.
>
>
> > If so, where would such a utility function belong?
>
> Not quite sure. src/backend/utils/fmgr/funcapi.c maybe? I suggest starting a
> separate thread about that...
>

done [1]. also, I dropped that commit from this patchset.

>
> > > > @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
> > > >  {
> > > >       Assert(!pgstat_is_shutdown);
> > > >
> > > > +     /*
> > > > +      * Only need to send stats on IO Ops for IO Paths when a process exits, as
> > > > +      * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
> > > > +      * then sum this with totals from exited backends persisted by the stats
> > > > +      * collector.
> > > > +      */
> > > > +     pgstat_send_buffers();
> > > > +
> > > >       /*
> > > >        * If we got as far as discovering our own database ID, we can report what
> > > >        * we did to the collector.  Otherwise, we'd be sending an invalid
> > > > @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
> > > >  #endif
> > > >  }
> > >
> > > I think it might be nicer to move pgstat_beshutdown_hook() to be a
> > > before_shmem_exit(), and do this in there.
> > >
> >
> > Not really sure the correct way to do this. A cursory attempt to do so
> > failed because ShutdownXLOG() is also registered as a
> > before_shmem_exit() and ends up being called after
> > pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out
> > PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a
> > checkpoint, the checkpointer increments IO op counter for writes in its
> > PgBackendStatus.
>
> I think we'll really need to do a proper redesign of the shutdown callback
> mechanism :(.
>

I've left what I originally had, then.

>
>
> > > > +PgBackendStatus *
> > > > +pgstat_fetch_backend_statuses(void)
> > > > +{
> > > > +     return BackendStatusArray;
> > > > +}
> > >
> > > Hm, not sure this adds much?
> >
> > Is there a better way to access the whole BackendStatusArray from within
> > pgstatfuncs.c?
>
> Export the variable itself?
>

done but wasn't sure about PGDLLIMPORT

>
> > > IIRC Thomas Munro had a patch adding a nonatomic_add or such
> > > somewhere. Perhaps in the recovery readahead thread? Might be worth using
> > > instead?
> > >
> >
> > I've added Thomas' function in a separate commit. I looked for a better
> > place to add it (I was thinking somewhere in src/backend/utils/misc) but
> > couldn't find anywhere that made sense.
>
> I think it should just live in atomics.h?
>

done

-- melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_azyd1Z3W_r7Ou4sorTjRCs%2BPxeHw1CWJeXKofkE6TuZg%40mail.gmail.com

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

19 ноября 2021 г., 19:49:52

Hi,

On 2021-11-02 15:26:52 -0400, Melanie Plageman wrote:
> Subject: [PATCH v14 1/4] Allow bootstrap process to beinit

Pushed.


> +/*
> + * On modern systems this is really just *counter++.  On some older systems
> + * there might be more to it, due to inability to read and write 64 bit values
> + * atomically.
> + */
> +static inline void inc_counter(pg_atomic_uint64 *counter)
> +{
> +    pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> +}
> +
>  #undef INSIDE_ATOMICS_H

Why is this using a completely different naming scheme from the rest of the
file?



>  doc/src/sgml/monitoring.sgml                | 116 +++++++++++++-
>  src/backend/catalog/system_views.sql        |  11 ++
>  src/backend/postmaster/checkpointer.c       |   3 +-
>  src/backend/postmaster/pgstat.c             | 161 +++++++++++++++++++-
>  src/backend/storage/buffer/bufmgr.c         |  46 ++++--
>  src/backend/storage/buffer/freelist.c       |  23 ++-
>  src/backend/storage/buffer/localbuf.c       |   3 +
>  src/backend/storage/sync/sync.c             |   1 +
>  src/backend/utils/activity/backend_status.c |  60 +++++++-
>  src/backend/utils/adt/pgstatfuncs.c         | 152 ++++++++++++++++++
>  src/include/catalog/pg_proc.dat             |   9 ++
>  src/include/miscadmin.h                     |   2 +
>  src/include/pgstat.h                        |  53 +++++++
>  src/include/storage/buf_internals.h         |   4 +-
>  src/include/utils/backend_status.h          |  80 ++++++++++
>  src/test/regress/expected/rules.out         |   8 +
>  16 files changed, 701 insertions(+), 31 deletions(-)

This is a pretty large change, I wonder if there's a way to make it a bit more
granular.



Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

25 ноября 2021 г., 00:19:20

On Fri, Nov 19, 2021 at 11:49 AM Andres Freund <andres@anarazel.de> wrote:
> > +/*
> > + * On modern systems this is really just *counter++.  On some older systems
> > + * there might be more to it, due to inability to read and write 64 bit values
> > + * atomically.
> > + */
> > +static inline void inc_counter(pg_atomic_uint64 *counter)
> > +{
> > +     pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> > +}
> > +
> >  #undef INSIDE_ATOMICS_H
>
> Why is this using a completely different naming scheme from the rest of the
> file?

It was what Thomas originally named it. Also, I noticed all the other
pg_atomic* in this file were wrappers around the same impl function, so
I thought maybe naming it this way would be confusing. I renamed it to
pg_atomic_inc_counter(), though maybe pg_atomic_readonly_write() would
be better?

>
> >  doc/src/sgml/monitoring.sgml                | 116 +++++++++++++-
> >  src/backend/catalog/system_views.sql        |  11 ++
> >  src/backend/postmaster/checkpointer.c       |   3 +-
> >  src/backend/postmaster/pgstat.c             | 161 +++++++++++++++++++-
> >  src/backend/storage/buffer/bufmgr.c         |  46 ++++--
> >  src/backend/storage/buffer/freelist.c       |  23 ++-
> >  src/backend/storage/buffer/localbuf.c       |   3 +
> >  src/backend/storage/sync/sync.c             |   1 +
> >  src/backend/utils/activity/backend_status.c |  60 +++++++-
> >  src/backend/utils/adt/pgstatfuncs.c         | 152 ++++++++++++++++++
> >  src/include/catalog/pg_proc.dat             |   9 ++
> >  src/include/miscadmin.h                     |   2 +
> >  src/include/pgstat.h                        |  53 +++++++
> >  src/include/storage/buf_internals.h         |   4 +-
> >  src/include/utils/backend_status.h          |  80 ++++++++++
> >  src/test/regress/expected/rules.out         |   8 +
> >  16 files changed, 701 insertions(+), 31 deletions(-)
>
> This is a pretty large change, I wonder if there's a way to make it a bit more
> granular.
>

I have done this. See v15 patch set attached.

- Melanie

v16 (also rebased) attached

On Fri, Nov 26, 2021 at 4:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Wed, Nov 24, 2021 at 07:15:59PM -0600, Justin Pryzby wrote:
> > There's extraneous blank lines in these functions:
> >
> > +pgstat_sum_io_path_ops
> > +pgstat_report_live_backend_io_path_ops
> > +pgstat_recv_resetsharedcounter
> > +GetIOPathDesc
> > +StrategyRejectBuffer
>
> + an extra blank line pgstat_reset_shared_counters.

Fixed

>
> In 0005:
>
> monitoring.sgml says that the columns in pg_stat_buffers are integers, but
> they're actually bigint.

Fixed

>
> +       tupstore = tuplestore_begin_heap(true, false, work_mem);
>
> You're passing a constant randomAccess=true to tuplestore_begin_heap ;)

Fixed

>
> +Datum all_values[NROWS][COLUMN_LENGTH];
>
> If you were to allocate this as an array, I think it could actually be 3-D:
> Datum all_values[BACKEND_NUM_TYPES-1][IOPATH_NUM_TYPES][COLUMN_LENGTH];

I've changed this to a 3D array as you suggested and removed the NROWS
macro.

> But I don't know if this is portable across postgres' supported platforms; I
> haven't seen any place which allocates a multidimensional array on the stack,
> nor passes one to a function:
>
> +static inline Datum *
> +get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path)
>
> Maybe the allocation half is okay (I think it's ~3kB), but it seems easier to
> palloc the required amount than to research compiler behavior.

I think passing it to the function is okay. The parameter type would be
adjusted from an array to a pointer.
I am not sure if the allocation on the stack in the body of
pg_stat_get_buffers is too large. (left as is for now)

> That function is only used as a one-line helper, and doesn't use
> multidimensional array access anyway:
>
> +       return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];

with your suggested changes to a 3D array, it now does use multidimensional
array access

> I think it'd be better as a macro, like (I think)
> #define ROW(backend_type, io_path) all_values[NROWS*(backend_type-1)+io_path]

If I am understanding the idea of the macro, it would change the call
site from this:

+Datum *values = get_pg_stat_buffers_row(all_values,
beentry->st_backendType, io_path);

+values[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+values[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);

to this:

+Datum *row =  ROW(beentry->st_backendType, io_path);

+row[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+row[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);

I usually prefer functions to macros, but I am fine with changing it.
(I did not change it in this version)
I have changed all the local variables from "values" to "row" which
I think is a bit clearer.

> Maybe it should take the column type as a 3 arg.

If I am understanding this idea, the call site would look like this now:
+CELL(beentry->st_backendType, io_path, COLUMN_FSYNCS) +=
pg_atomic_read_u64(&io_ops->fsyncs);
+CELL(beentry->st_backendType, io_path, COLUMN_ALLOCS) +=
pg_atomic_read_u64(&io_ops->allocs);

I don't like this as much. Since this code is inside of a loop, it kind
of makes sense to me that you get a row at the top of the loop and then
fill in all the cells in the row using that "row" variable.

> The enum with COLUMN_LENGTH should be named.

I only use the values in it, so it didn't need a name.

> Or maybe it should be removed, and the enum names moved to comments, like:
>
> +                       /* backend_type */
> +                       values[val++] = backend_type_desc;
>
> +                       /* io_path */
> +                       values[val++] = CStringGetTextDatum(GetIOPathDesc(io_path));
>
> +                       /* allocs */
> +                       values[val++] += io_ops->allocs - resets->allocs;
> ...

I find it easier to understand with it in code instead of as a comment.

> *Note the use of += and not =.

Thanks for seeing this. I have changed this (to use +=).

> Also:
> src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1)
>
> I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using
> lessthan-or-equal instead of lessthan as you are).
>
> Since the valid backend types start at 1 , the "count" of backend types is
> currently B_LOGGER (13) - not 14.  I think you should remove the "+1" here.
> Then NROWS (if it continued to exist at all) wouldn't need to subtract one.

I think what I currently have is technically correct because I start at
1 when I am using it as a loop condition. I do waste a spot in the
arrays I allocate with BACKEND_NUM_TYPES size.

I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER
because it seems kind of weird to have it have the same value as the
B_LOGGER enum.

I am open to changing it. (I didn't change it in this v16).

- Melanie

Thanks again! I really appreciate the thorough review.

I have combined responses to all three of your emails below.
Let me know if it is more confusing to do it this way.

On Wed, Dec 1, 2021 at 6:59 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Wed, Dec 01, 2021 at 05:00:14PM -0500, Melanie Plageman wrote:
> > > Also:
> > > src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1)
> > >
> > > I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using
> > > lessthan-or-equal instead of lessthan as you are).
> > >
> > > Since the valid backend types start at 1 , the "count" of backend types is
> > > currently B_LOGGER (13) - not 14.  I think you should remove the "+1" here.
> > > Then NROWS (if it continued to exist at all) wouldn't need to subtract one.
> >
> > I think what I currently have is technically correct because I start at
> > 1 when I am using it as a loop condition. I do waste a spot in the
> > arrays I allocate with BACKEND_NUM_TYPES size.
> >
> > I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER
> > because it seems kind of weird to have it have the same value as the
> > B_LOGGER enum.
>
> I don't mean to say that the code is misbehaving - I mean "num_x" means "the
> number of x's" - how many there are.  Since the first, valid backend type is 1,
> and they're numbered consecutively and without duplicates, then "the number of
> backend types" is the same as the value of the last one (B_LOGGER).  It's
> confusing if there's a macro called BACKEND_NUM_TYPES which is greater than the
> number of backend types.
>
> Most loops say for (int i=0; i<NUM; ++i)
> If it's 1-based, they say for (int i=1; i<=NUM; ++i)
> You have two different loops like:
>
> +       for (int i = 0; i < BACKEND_NUM_TYPES - 1 ; i++)
> +       for (int backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
>
> Both of these iterate over the correct number of backend types, but they both
> *look* wrong, which isn't desirable.

I've changed this and added comments wherever I could to make it clear.
Whenever the parameter was of type BackendType, I tried to use the
correct (not adjusted by subtracting 1) number and wherever the type was
int and being used as an index into the array, I used the adjusted value
and added the idx suffix to make it clear that the number does not
reflect the actual BackendType:

On Wed, Dec 1, 2021 at 10:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote:
> > Thanks for the review!
> >
> > On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > There's extraneous blank lines in these functions:
> > > +pgstat_recv_resetsharedcounter
> > I didn't see one here
>
> => The extra blank line is after the RESET_BUFFERS memset.

Fixed.

> > +        * Reset the global, bgwriter and checkpointer statistics for the
> > +        * cluster.
>
> The first comma in this comment was introduced in 1bc8e7b09, and seems to be
> extraneous, since bgwriter and checkpointer are both global.  With the comma,
> it looks like it should be memsetting 3 things.

Fixed.

> > +               /* Don't count dead backends. They should already be counted */
>
> Maybe this comment should say ".. they'll be added below"

Fixed.

> > +                       row[COLUMN_BACKEND_TYPE] = backend_type_desc;
> > +                       row[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
> > +                       row[COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
> > +                       row[COLUMN_EXTENDS] += io_ops->extends - resets->extends;
> > +                       row[COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
> > +                       row[COLUMN_WRITES] += io_ops->writes - resets->writes;
> > +                       row[COLUMN_RESET_TIME] = reset_time;
>
> It'd be clearer if RESET_TIME were set adjacent to BACKEND_TYPE and IO_PATH.

If you mean just in the order here (not in the column order in the
view), then I have changed it as you recommended.

> This message needs to be updated:
>         errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")))

Done.

> When I query the view, I see reset times as: 1999-12-31 18:00:00-06.
> I guess it should be initialized like this one:
>         globalStats.bgwriter.stat_reset_timestamp = ts

Done.

> The cfbot shows failures now (I thought it was passing with the previous patch,
> but I suppose I'm wrong.)
>
> It looks like running recovery during single user mode hits this assertion.
> TRAP: FailedAssertion("beentry", File: "../../../../src/include/utils/backend_status.h", Line: 359, PID: 3499)
>

Yes, thank you for catching this.
I have moved up pgstat_beinit and pgstat_bestart so that single user
mode process will also have PgBackendStatus. I also have to guard
against sending these stats to the collector since there is no room for
B_INVALID backendtype in the array of IO Op values.

With this change `make check-world` passes on my machine.

On Wed, Dec 1, 2021 at 11:06 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote:
> > Thanks for the review!
> >
> > On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > You wrote beentry++ at the start of two loops, but I think that's wrong; it
> > > should be at the end, as in the rest of the file (or as a loop increment).
> > > BackendStatusArray[0] is actually used (even though its backend has
> > > backendId==1, not 0).  "MyBEEntry = &BackendStatusArray[MyBackendId - 1];"
> >
> > I've fixed this in v16 which I will attach to the next email in the thread.
>
> I just noticed that since beentry++ is now at the end of the loop, it's being
> missed when you "continue":
>
> +               if (beentry->st_procpid == 0)
> +                       continue;

Fixed.

> Also, I saw that pgindent messed up and added spaces after pointers in function
> declarations, due to new typedefs not in typedefs.list:
>
> -pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
> +pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter * msg)
>
> -static inline void pg_atomic_inc_counter(pg_atomic_uint64 *counter)
> +static inline void
> +pg_atomic_inc_counter(pg_atomic_uint64 * counter)

Fixed.

-- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

09 декабря 2021 г., 22:17:52

Hi,

On 2021-12-03 15:02:24 -0500, Melanie Plageman wrote:
> From e0f7f3dfd60a68fa01f3c023bcdb69305ade3738 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 11 Oct 2021 16:15:06 -0400
> Subject: [PATCH v17 1/7] Read-only atomic backend write function
> 
> For counters in shared memory which can be read by any backend but only
> written to by one backend, an atomic is still needed to protect against
> torn values, however, pg_atomic_fetch_add_u64() is overkill for
> incrementing the counter. pg_atomic_inc_counter() is a helper function
> which can be used to increment these values safely but without
> unnecessary overhead.
>
> Author: Thomas Munro
> ---
>  src/include/port/atomics.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
> index 856338f161..39ffff24dd 100644
> --- a/src/include/port/atomics.h
> +++ b/src/include/port/atomics.h
> @@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
>      return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
>  }
>  
> +/*
> + * On modern systems this is really just *counter++.  On some older systems
> + * there might be more to it, due to inability to read and write 64 bit values
> + * atomically.
> + */
> +static inline void
> +pg_atomic_inc_counter(pg_atomic_uint64 *counter)
> +{
> +    pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> +}

I wonder if it's worth putting something in the name indicating that this is
not actual atomic RMW operation. Perhaps adding _unlocked?



> From b0e193cfa08f0b8cf1be929f26fe38f06a39aeae Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 24 Nov 2021 10:32:56 -0500
> Subject: [PATCH v17 2/7] Add IO operation counters to PgBackendStatus
> 
> Add an array of counters in PgBackendStatus which count the buffers
> allocated, extended, fsynced, and written by a given backend. Each "IO
> Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
> local, shared, or strategy). "local" and "shared" IO Path counters count
> operations on local and shared buffers. The "strategy" IO Path counts
> buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
> The "direct" IO Path counts blocks of IO which are read, written, or
> fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
> [Local]BufferAlloc()).
> 
> With this commit, all backends increment a counter in their
> PgBackendStatus when performing an IO operation. This is in preparation
> for future commits which will persist these stats upon backend exit and
> use the counters to provide observability of database IO operations.
> 
> Note that this commit does not add code to increment the "direct" path.
> A separate proposed patch [1] which would add wrappers for smgrwrite(),
> smgrextend(), and smgrimmedsync() would provide a good location to call
> pgstat_inc_ioop() for unbuffered IO and avoid regressions for future
> users of these functions.
>
> [1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com

On longer thread it's nice for committers to already have Reviewed-By: in the
commit message.

> diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
> index 7229598822..413cc605f8 100644
> --- a/src/backend/utils/activity/backend_status.c
> +++ b/src/backend/utils/activity/backend_status.c
> @@ -399,6 +399,15 @@ pgstat_bestart(void)
>      lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
>      lbeentry.st_progress_command_target = InvalidOid;
>      lbeentry.st_query_id = UINT64CONST(0);
> +    for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
> +    {
> +        IOOps       *io_ops = &lbeentry.io_path_stats[io_path];
> +
> +        pg_atomic_init_u64(&io_ops->allocs, 0);
> +        pg_atomic_init_u64(&io_ops->extends, 0);
> +        pg_atomic_init_u64(&io_ops->fsyncs, 0);
> +        pg_atomic_init_u64(&io_ops->writes, 0);
> +    }
>  
>      /*
>       * we don't zero st_progress_param here to save cycles; nobody should

nit: I think we nearly always have a blank line before loops


> diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
> index 646126edee..93f1b4bcfc 100644
> --- a/src/backend/utils/init/postinit.c
> +++ b/src/backend/utils/init/postinit.c
> @@ -623,6 +623,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
>          RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
>      }
>  
> +    pgstat_beinit();
>      /*
>       * Initialize local process's access to XLOG.
>       */

nit: same with multi-line comments.


> @@ -649,6 +650,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
>           */
>          CreateAuxProcessResourceOwner();
>  
> +        pgstat_bestart();
>          StartupXLOG();
>          /* Release (and warn about) any buffer pins leaked in StartupXLOG */
>          ReleaseAuxProcessResources(true);
> @@ -676,7 +678,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
>      EnablePortalManager();
>  
>      /* Initialize status reporting */
> -    pgstat_beinit();

I'd like to see changes like moving this kind of thing around broken around
and committed separately. It's much easier to pinpoint breakage if the CF
breaks after moving just pgstat_beinit() around, rather than when committing
this considerably larger patch. And reordering subsystem initialization has
the habit of causing problems...


> +/* ----------
> + * IO Stats reporting utility types
> + * ----------
> + */
> +
> +typedef enum IOOp
> +{
> +    IOOP_ALLOC,
> +    IOOP_EXTEND,
> +    IOOP_FSYNC,
> +    IOOP_WRITE,
> +} IOOp;
> [...]
> +/*
> + * Structure for counting all types of IOOps for a live backend.
> + */
> +typedef struct IOOps
> +{
> +    pg_atomic_uint64 allocs;
> +    pg_atomic_uint64 extends;
> +    pg_atomic_uint64 fsyncs;
> +    pg_atomic_uint64 writes;
> +} IOOps;

To me IOop and IOOps sound to much alike - even though they're really kind of
separate things. s/IOOps/IOOpCounters/ maybe?


> @@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg)
>  {
>      Assert(!pgstat_is_shutdown);
>  
> +    /*
> +     * Only need to send stats on IO Ops for IO Paths when a process exits.
> +     * Users requiring IO Ops for both live and exited backends can read from
> +     * live backends' PgBackendStatus and sum this with totals from exited
> +     * backends persisted by the stats collector.
> +     */
> +    pgstat_send_buffers();

Perhaps something like this comment belongs somewhere at the top of the file,
or in the header, or ...? It's a fairly central design piece, and it's not
obvious one would need to look in the shutdown hook for it?


> +/*
> + * Before exiting, a backend sends its IO op statistics to the collector so
> + * that they may be persisted.
> + */
> +void
> +pgstat_send_buffers(void)
> +{
> +    PgStat_MsgIOPathOps msg;
> +
> +    PgBackendStatus *beentry = MyBEEntry;
> +
> +    /*
> +     * Though some backends with type B_INVALID (such as the single-user mode
> +     * process) do initialize and increment IO operations stats, there is no
> +     * spot in the array of IO operations for backends of type B_INVALID. As
> +     * such, do not send these to the stats collector.
> +     */
> +    if (!beentry || beentry->st_backendType == B_INVALID)
> +        return;

Why does single user mode use B_INVALID? That doesn't seem quite right.


> +    memset(&msg, 0, sizeof(msg));
> +    msg.backend_type = beentry->st_backendType;
> +
> +    pgstat_sum_io_path_ops(msg.iop.io_path_ops,
> +                           (IOOps *) &beentry->io_path_stats);
> +
> +    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
> +    pgstat_send(&msg, sizeof(msg));
> +}

It seems worth having a path skipping sending the message if there was no IO?



> +/*
> + * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
> + * local) to those in the equivalent stats structure for exited backends. Note
> + * that this adds and doesn't set, so the destination stats structure should be
> + * zeroed out by the caller initially. This would commonly be used to transfer
> + * all IO Op stats for all IO Paths for a particular backend type to the
> + * pgstats structure.
> + */
> +void
> +pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
> +{
> +    for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
> +    {

Sacriligeous, but I find io_path a harder to understand variable name for the
counter than i (or io_path_off or ...) ;)


> +static void
> +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
> +{
> +    PgStatIOOps *src_io_path_ops;
> +    PgStatIOOps *dest_io_path_ops;
> +
> +    /*
> +     * Subtract 1 from message's BackendType to get a valid index into the
> +     * array of IO Ops which does not include an entry for B_INVALID
> +     * BackendType.
> +     */
> +    Assert(msg->backend_type > B_INVALID);

Probably worth also asserting the upper boundary?



> From f972ea87270feaed464a74fb6541ac04b4fc7d98 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 24 Nov 2021 11:39:48 -0500
> Subject: [PATCH v17 4/7] Add "buffers" to pgstat_reset_shared_counters
> 
> Backends count IO operations for various IO paths in their PgBackendStatus.
> Upon exit, they send these counts to the stats collector. Prior to this commit,
> these IO Ops stats would have been reset when the target was "bgwriter".
> 
> With this commit, target "bgwriter" no longer will cause the IO operations
> stats to be reset, and the IO operations stats can be reset with new target,
> "buffers".
> ---
>  doc/src/sgml/monitoring.sgml                |  2 +-
>  src/backend/postmaster/pgstat.c             | 83 +++++++++++++++++++--
>  src/backend/utils/activity/backend_status.c | 29 +++++++
>  src/include/pgstat.h                        |  8 +-
>  src/include/utils/backend_status.h          |  2 +
>  5 files changed, 117 insertions(+), 7 deletions(-)
> 
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 62f2a3332b..bda3eef309 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
>         <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
>        </para>
>        <para>
> -       Time at which these statistics were last reset
> +       Time at which these statistics were last reset.
>        </para></entry>
>       </row>
>      </tbody>

Hm?

Shouldn't this new reset target be documented?


> +/*
> + * Helper function to collect and send live backends' current IO operations
> + * stats counters when a stats reset is initiated so that they may be deducted
> + * from future totals.
> + */
> +static void
> +pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
> +{
> +    PgStatIOPathOps ops[BACKEND_NUM_TYPES];
> +
> +    memset(ops, 0, sizeof(ops));
> +    pgstat_report_live_backend_io_path_ops(ops);
> +
> +    /*
> +     * Iterate through the array of IO Ops for all IO Paths for each
> +     * BackendType. Because the array does not include a spot for BackendType
> +     * B_INVALID, add 1 to the index when setting backend_type so that there is
> +     * no confusion as to the BackendType with which this reset message
> +     * corresponds.
> +     */
> +    for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++)
> +    {
> +        msg->m_backend_resets.backend_type = backend_type_idx + 1;
> +        memcpy(&msg->m_backend_resets.iop, &ops[backend_type_idx],
> +                sizeof(msg->m_backend_resets.iop));
> +        pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
> +    }
> +}

Probably worth explaining why multiple messages are sent?


> @@ -5583,10 +5621,45 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
>  {
>      if (msg->m_resettarget == RESET_BGWRITER)
>      {
> -        /* Reset the global, bgwriter and checkpointer statistics for the cluster. */
> -        memset(&globalStats, 0, sizeof(globalStats));
> +        /*
> +         * Reset the global bgwriter and checkpointer statistics for the
> +         * cluster.
> +         */
> +        memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
> +        memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
>          globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
>      }

Oh, is this a live bug?


> +        /*
> +         * Subtract 1 from the BackendType to arrive at a valid index in the
> +         * array, as it does not contain a spot for B_INVALID BackendType.
> +         */

Instead of repeating a comment about +- 1 in a bunch of places, would it look
better to have two helper inline functions for this purpose?



> +/*
> +* When adding a new column to the pg_stat_buffers view, add a new enum
> +* value here above COLUMN_LENGTH.
> +*/
> +enum
> +{
> +    COLUMN_BACKEND_TYPE,
> +    COLUMN_IO_PATH,
> +    COLUMN_ALLOCS,
> +    COLUMN_EXTENDS,
> +    COLUMN_FSYNCS,
> +    COLUMN_WRITES,
> +    COLUMN_RESET_TIME,
> +    COLUMN_LENGTH,
> +};

COLUMN_LENGTH seems like a fairly generic name...



> From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 24 Nov 2021 12:20:10 -0500
> Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats
> 
> Remove stats from pg_stat_bgwriter which are now more clearly expressed
> in pg_stat_buffers.
> 
> TODO:
> - make pg_stat_checkpointer view and move relevant stats into it
> - add additional stats to pg_stat_bgwriter

When do you think it makes sense to tackle these wrt committing some of the
patches?


> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 6926fc5742..67447f997a 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -2164,7 +2164,6 @@ BufferSync(int flags)
>              if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
>              {
>                  TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
> -                PendingCheckpointerStats.m_buf_written_checkpoints++;
>                  num_written++;
>              }
>          }
> @@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
>       */
>      strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
>  
> -    /* Report buffer alloc counts to pgstat */
> -    PendingBgWriterStats.m_buf_alloc += recent_alloc;
> -
>      /*
>       * If we're not running the LRU scan, just stop after doing the stats
>       * stuff.  We mark the saved state invalid so that we can recover sanely
> @@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
>              reusable_buffers++;
>      }
>  
> -    PendingBgWriterStats.m_buf_written_clean += num_written;
> -

Isn't num_written unused now, unless tracepoints are enabled? I'd expect some
compilers to warn... Perhaps we should just remove information from the
tracepoint?


Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

16 декабря 2021 г., 00:40:27

v18 attached.

On Thu, Dec 9, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-12-03 15:02:24 -0500, Melanie Plageman wrote:
> > From e0f7f3dfd60a68fa01f3c023bcdb69305ade3738 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Mon, 11 Oct 2021 16:15:06 -0400
> > Subject: [PATCH v17 1/7] Read-only atomic backend write function
> >
> > For counters in shared memory which can be read by any backend but only
> > written to by one backend, an atomic is still needed to protect against
> > torn values, however, pg_atomic_fetch_add_u64() is overkill for
> > incrementing the counter. pg_atomic_inc_counter() is a helper function
> > which can be used to increment these values safely but without
> > unnecessary overhead.
> >
> > Author: Thomas Munro
> > ---
> >  src/include/port/atomics.h | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
> > index 856338f161..39ffff24dd 100644
> > --- a/src/include/port/atomics.h
> > +++ b/src/include/port/atomics.h
> > @@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
> >       return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
> >  }
> >
> > +/*
> > + * On modern systems this is really just *counter++.  On some older systems
> > + * there might be more to it, due to inability to read and write 64 bit values
> > + * atomically.
> > + */
> > +static inline void
> > +pg_atomic_inc_counter(pg_atomic_uint64 *counter)
> > +{
> > +     pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> > +}
>
> I wonder if it's worth putting something in the name indicating that this is
> not actual atomic RMW operation. Perhaps adding _unlocked?
>

Done.

>
> > From b0e193cfa08f0b8cf1be929f26fe38f06a39aeae Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 24 Nov 2021 10:32:56 -0500
> > Subject: [PATCH v17 2/7] Add IO operation counters to PgBackendStatus
> >
> > Add an array of counters in PgBackendStatus which count the buffers
> > allocated, extended, fsynced, and written by a given backend. Each "IO
> > Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
> > local, shared, or strategy). "local" and "shared" IO Path counters count
> > operations on local and shared buffers. The "strategy" IO Path counts
> > buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
> > The "direct" IO Path counts blocks of IO which are read, written, or
> > fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
> > [Local]BufferAlloc()).
> >
> > With this commit, all backends increment a counter in their
> > PgBackendStatus when performing an IO operation. This is in preparation
> > for future commits which will persist these stats upon backend exit and
> > use the counters to provide observability of database IO operations.
> >
> > Note that this commit does not add code to increment the "direct" path.
> > A separate proposed patch [1] which would add wrappers for smgrwrite(),
> > smgrextend(), and smgrimmedsync() would provide a good location to call
> > pgstat_inc_ioop() for unbuffered IO and avoid regressions for future
> > users of these functions.
> >
> > [1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com
>
> On longer thread it's nice for committers to already have Reviewed-By: in the
> commit message.

Done.

> > diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
> > index 7229598822..413cc605f8 100644
> > --- a/src/backend/utils/activity/backend_status.c
> > +++ b/src/backend/utils/activity/backend_status.c
> > @@ -399,6 +399,15 @@ pgstat_bestart(void)
> >       lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
> >       lbeentry.st_progress_command_target = InvalidOid;
> >       lbeentry.st_query_id = UINT64CONST(0);
> > +     for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
> > +     {
> > +             IOOps      *io_ops = &lbeentry.io_path_stats[io_path];
> > +
> > +             pg_atomic_init_u64(&io_ops->allocs, 0);
> > +             pg_atomic_init_u64(&io_ops->extends, 0);
> > +             pg_atomic_init_u64(&io_ops->fsyncs, 0);
> > +             pg_atomic_init_u64(&io_ops->writes, 0);
> > +     }
> >
> >       /*
> >        * we don't zero st_progress_param here to save cycles; nobody should
>
> nit: I think we nearly always have a blank line before loops

Done.

> > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
> > index 646126edee..93f1b4bcfc 100644
> > --- a/src/backend/utils/init/postinit.c
> > +++ b/src/backend/utils/init/postinit.c
> > @@ -623,6 +623,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
> >               RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
> >       }
> >
> > +     pgstat_beinit();
> >       /*
> >        * Initialize local process's access to XLOG.
> >        */
>
> nit: same with multi-line comments.

Done.

> > @@ -649,6 +650,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
> >                */
> >               CreateAuxProcessResourceOwner();
> >
> > +             pgstat_bestart();
> >               StartupXLOG();
> >               /* Release (and warn about) any buffer pins leaked in StartupXLOG */
> >               ReleaseAuxProcessResources(true);
> > @@ -676,7 +678,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
> >       EnablePortalManager();
> >
> >       /* Initialize status reporting */
> > -     pgstat_beinit();
>
> I'd like to see changes like moving this kind of thing around broken around
> and committed separately. It's much easier to pinpoint breakage if the CF
> breaks after moving just pgstat_beinit() around, rather than when committing
> this considerably larger patch. And reordering subsystem initialization has
> the habit of causing problems...

Done

> > +/* ----------
> > + * IO Stats reporting utility types
> > + * ----------
> > + */
> > +
> > +typedef enum IOOp
> > +{
> > +     IOOP_ALLOC,
> > +     IOOP_EXTEND,
> > +     IOOP_FSYNC,
> > +     IOOP_WRITE,
> > +} IOOp;
> > [...]
> > +/*
> > + * Structure for counting all types of IOOps for a live backend.
> > + */
> > +typedef struct IOOps
> > +{
> > +     pg_atomic_uint64 allocs;
> > +     pg_atomic_uint64 extends;
> > +     pg_atomic_uint64 fsyncs;
> > +     pg_atomic_uint64 writes;
> > +} IOOps;
>
> To me IOop and IOOps sound to much alike - even though they're really kind of
> separate things. s/IOOps/IOOpCounters/ maybe?

Done.

> > @@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg)
> >  {
> >       Assert(!pgstat_is_shutdown);
> >
> > +     /*
> > +      * Only need to send stats on IO Ops for IO Paths when a process exits.
> > +      * Users requiring IO Ops for both live and exited backends can read from
> > +      * live backends' PgBackendStatus and sum this with totals from exited
> > +      * backends persisted by the stats collector.
> > +      */
> > +     pgstat_send_buffers();
>
> Perhaps something like this comment belongs somewhere at the top of the file,
> or in the header, or ...? It's a fairly central design piece, and it's not
> obvious one would need to look in the shutdown hook for it?
>

now in pgstat.h above the declaration of pgstat_send_buffers()

> > +/*
> > + * Before exiting, a backend sends its IO op statistics to the collector so
> > + * that they may be persisted.
> > + */
> > +void
> > +pgstat_send_buffers(void)
> > +{
> > +     PgStat_MsgIOPathOps msg;
> > +
> > +     PgBackendStatus *beentry = MyBEEntry;
> > +
> > +     /*
> > +      * Though some backends with type B_INVALID (such as the single-user mode
> > +      * process) do initialize and increment IO operations stats, there is no
> > +      * spot in the array of IO operations for backends of type B_INVALID. As
> > +      * such, do not send these to the stats collector.
> > +      */
> > +     if (!beentry || beentry->st_backendType == B_INVALID)
> > +             return;
>
> Why does single user mode use B_INVALID? That doesn't seem quite right.

I think PgBackendStatus->st_backendType is set from MyBackendType which
isn't set for the single user mode process. What BackendType would you
expect to see?

> > +     memset(&msg, 0, sizeof(msg));
> > +     msg.backend_type = beentry->st_backendType;
> > +
> > +     pgstat_sum_io_path_ops(msg.iop.io_path_ops,
> > +                                                (IOOps *) &beentry->io_path_stats);
> > +
> > +     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
> > +     pgstat_send(&msg, sizeof(msg));
> > +}
>
> It seems worth having a path skipping sending the message if there was no IO?

Makes sense. I've updated pgstat_send_buffers() to do a loop after calling
pgstat_sum_io_path_ops() and check if it should skip sending.

I also thought about having pgstat_sum_io_path_ops() return a value to
indicate if everything was 0 -- which could be useful to future callers
potentially?

I didn't do this because I am not sure what the return value would be.
It could be a bool and be true if any IO was done and false if none was
done -- but that doesn't really make sense given the function's name it
would be called like
if (!pgstat_sum_io_path_ops())
  return
which I'm not sure is very clear

> > +/*
> > + * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
> > + * local) to those in the equivalent stats structure for exited backends. Note
> > + * that this adds and doesn't set, so the destination stats structure should be
> > + * zeroed out by the caller initially. This would commonly be used to transfer
> > + * all IO Op stats for all IO Paths for a particular backend type to the
> > + * pgstats structure.
> > + */
> > +void
> > +pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
> > +{
> > +     for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
> > +     {
>
> Sacriligeous, but I find io_path a harder to understand variable name for the
> counter than i (or io_path_off or ...) ;)

I've updated almost all my non-standard loop index variable names.

> > +static void
> > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
> > +{
> > +     PgStatIOOps *src_io_path_ops;
> > +     PgStatIOOps *dest_io_path_ops;
> > +
> > +     /*
> > +      * Subtract 1 from message's BackendType to get a valid index into the
> > +      * array of IO Ops which does not include an entry for B_INVALID
> > +      * BackendType.
> > +      */
> > +     Assert(msg->backend_type > B_INVALID);
>
> Probably worth also asserting the upper boundary?

Done.

> > From f972ea87270feaed464a74fb6541ac04b4fc7d98 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 24 Nov 2021 11:39:48 -0500
> > Subject: [PATCH v17 4/7] Add "buffers" to pgstat_reset_shared_counters
> >
> > Backends count IO operations for various IO paths in their PgBackendStatus.
> > Upon exit, they send these counts to the stats collector. Prior to this commit,
> > these IO Ops stats would have been reset when the target was "bgwriter".
> >
> > With this commit, target "bgwriter" no longer will cause the IO operations
> > stats to be reset, and the IO operations stats can be reset with new target,
> > "buffers".
> > ---
> >  doc/src/sgml/monitoring.sgml                |  2 +-
> >  src/backend/postmaster/pgstat.c             | 83 +++++++++++++++++++--
> >  src/backend/utils/activity/backend_status.c | 29 +++++++
> >  src/include/pgstat.h                        |  8 +-
> >  src/include/utils/backend_status.h          |  2 +
> >  5 files changed, 117 insertions(+), 7 deletions(-)
> >
> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 62f2a3332b..bda3eef309 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
> > @@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
> >         <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
> >        </para>
> >        <para>
> > -       Time at which these statistics were last reset
> > +       Time at which these statistics were last reset.
> >        </para></entry>
> >       </row>
> >      </tbody>
>
> Hm?
>
> Shouldn't this new reset target be documented?

It is in the commit adding the view. I didn't include it in this commit
because the pg_stat_buffers view doesn't exist yet, as of this commit,
and I thought it would be odd to mention it in the docs (in this
commit).
As an aside, I shouldn't have left this correction in this commit. I
moved it now to the other one.

> > +/*
> > + * Helper function to collect and send live backends' current IO operations
> > + * stats counters when a stats reset is initiated so that they may be deducted
> > + * from future totals.
> > + */
> > +static void
> > +pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
> > +{
> > +     PgStatIOPathOps ops[BACKEND_NUM_TYPES];
> > +
> > +     memset(ops, 0, sizeof(ops));
> > +     pgstat_report_live_backend_io_path_ops(ops);
> > +
> > +     /*
> > +      * Iterate through the array of IO Ops for all IO Paths for each
> > +      * BackendType. Because the array does not include a spot for BackendType
> > +      * B_INVALID, add 1 to the index when setting backend_type so that there is
> > +      * no confusion as to the BackendType with which this reset message
> > +      * corresponds.
> > +      */
> > +     for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++)
> > +     {
> > +             msg->m_backend_resets.backend_type = backend_type_idx + 1;
> > +             memcpy(&msg->m_backend_resets.iop, &ops[backend_type_idx],
> > +                             sizeof(msg->m_backend_resets.iop));
> > +             pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
> > +     }
> > +}
>
> Probably worth explaining why multiple messages are sent?

Done.

> > @@ -5583,10 +5621,45 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
> >  {
> >       if (msg->m_resettarget == RESET_BGWRITER)
> >       {
> > -             /* Reset the global, bgwriter and checkpointer statistics for the cluster. */
> > -             memset(&globalStats, 0, sizeof(globalStats));
> > +             /*
> > +              * Reset the global bgwriter and checkpointer statistics for the
> > +              * cluster.
> > +              */
> > +             memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
> > +             memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
> >               globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
> >       }
>
> Oh, is this a live bug?

I don't think it is a bug. globalStats only contained bgwriter and
checkpointer stats and those were all only displayed in
pg_stat_bgwriter(), so memsetting the whole thing seems fine.

> > +             /*
> > +              * Subtract 1 from the BackendType to arrive at a valid index in the
> > +              * array, as it does not contain a spot for B_INVALID BackendType.
> > +              */
>
> Instead of repeating a comment about +- 1 in a bunch of places, would it look
> better to have two helper inline functions for this purpose?

Done.

> > +/*
> > +* When adding a new column to the pg_stat_buffers view, add a new enum
> > +* value here above COLUMN_LENGTH.
> > +*/
> > +enum
> > +{
> > +     COLUMN_BACKEND_TYPE,
> > +     COLUMN_IO_PATH,
> > +     COLUMN_ALLOCS,
> > +     COLUMN_EXTENDS,
> > +     COLUMN_FSYNCS,
> > +     COLUMN_WRITES,
> > +     COLUMN_RESET_TIME,
> > +     COLUMN_LENGTH,
> > +};
>
> COLUMN_LENGTH seems like a fairly generic name...

Changed.

> > From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 24 Nov 2021 12:20:10 -0500
> > Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats
> >
> > Remove stats from pg_stat_bgwriter which are now more clearly expressed
> > in pg_stat_buffers.
> >
> > TODO:
> > - make pg_stat_checkpointer view and move relevant stats into it
> > - add additional stats to pg_stat_bgwriter
>
> When do you think it makes sense to tackle these wrt committing some of the
> patches?

Well, the new stats are a superset of the old stats (no stats have been
removed that are not represented in the new or old views). So, I don't
see that as a blocker for committing these patches.

Since it is weird that pg_stat_bgwriter had mostly checkpointer stats,
I've edited this commit to rename that view to pg_stat_checkpointer.

I have not made a separate view just for maxwritten_clean (presumably
called pg_stat_bgwriter), but I would not be opposed to doing this if
you thought having a view with a single column isn't a problem (in the
event that we don't get around to adding more bgwriter stats right
away).

I noticed after changing the docs on the "bgwriter" target for
pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in
  src/backend/po/ko.po
  src/backend/po/it.po
  ...
I presume these are automatically updated with some incantation, but I wasn't
sure what it was nor could I find documentation on this.

> > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> > index 6926fc5742..67447f997a 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -2164,7 +2164,6 @@ BufferSync(int flags)
> >                       if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
> >                       {
> >                               TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
> > -                             PendingCheckpointerStats.m_buf_written_checkpoints++;
> >                               num_written++;
> >                       }
> >               }
> > @@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
> >        */
> >       strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
> >
> > -     /* Report buffer alloc counts to pgstat */
> > -     PendingBgWriterStats.m_buf_alloc += recent_alloc;
> > -
> >       /*
> >        * If we're not running the LRU scan, just stop after doing the stats
> >        * stuff.  We mark the saved state invalid so that we can recover sanely
> > @@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
> >                       reusable_buffers++;
> >       }
> >
> > -     PendingBgWriterStats.m_buf_written_clean += num_written;
> > -
>
> Isn't num_written unused now, unless tracepoints are enabled? I'd expect some
> compilers to warn... Perhaps we should just remove information from the
> tracepoint?

The local variable num_written is used in BgBufferSync() to determine
whether or not to increment maxwritten_clean which is still represented
in the view pg_stat_checkpointer (formerly pg_stat_bgwriter).

A local variable num_written is used in BufferSync() to increment
CheckpointStats.ckpt_bufs_written which is logged in LogCheckpointEnd(),
so I'm not sure that can be removed.

- Melanie

On Thu, Dec 30, 2021 at 3:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Dec 21, 2021 at 8:32 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > On Thu, Dec 16, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
> > > > > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > > > > Date: Wed, 24 Nov 2021 12:20:10 -0500
> > > > > > Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats
> > > > > >
> > > > > > Remove stats from pg_stat_bgwriter which are now more clearly expressed
> > > > > > in pg_stat_buffers.
> > > > > >
> > > > > > TODO:
> > > > > > - make pg_stat_checkpointer view and move relevant stats into it
> > > > > > - add additional stats to pg_stat_bgwriter
> > > > >
> > > > > When do you think it makes sense to tackle these wrt committing some of the
> > > > > patches?
> > > >
> > > > Well, the new stats are a superset of the old stats (no stats have been
> > > > removed that are not represented in the new or old views). So, I don't
> > > > see that as a blocker for committing these patches.
> > >
> > > > Since it is weird that pg_stat_bgwriter had mostly checkpointer stats,
> > > > I've edited this commit to rename that view to pg_stat_checkpointer.
> > >
> > > > I have not made a separate view just for maxwritten_clean (presumably
> > > > called pg_stat_bgwriter), but I would not be opposed to doing this if
> > > > you thought having a view with a single column isn't a problem (in the
> > > > event that we don't get around to adding more bgwriter stats right
> > > > away).
> > >
> > > How about keeping old bgwriter values in place in the view , but generated
> > > from the new stats stuff?
> >
> > I tried this, but I actually don't think it is the right way to go. In
> > order to maintain the old view with the new source code, I had to add
> > new code to maintain a separate resets array just for the bgwriter view.
> > It adds some fiddly code that will be annoying to maintain (the reset
> > logic is confusing enough as is).
> > And, besides the implementation complexity, if a user resets
> > pg_stat_bgwriter and not pg_stat_buffers (or vice versa), they will
> > see totally different numbers for "buffers_backend" in pg_stat_bgwriter
> > than shared buffers written by B_BACKEND in pg_stat_buffers. I would
> > find that confusing.
>
> In a quick chat off-list, Andres suggested it might be okay to have a
> single reset target for both the pg_stat_buffers view and legacy
> pg_stat_bgwriter view. So, I am planning to share a new patchset which
> has only the new "buffers" target which will also reset the legacy
> pg_stat_bgwriter view.
>
> I'll also remove the bgwriter stats I proposed and the
> pg_stat_checkpointer view to keep things simple for now.
>

I've done the above in v20, attached.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

19 февраля 2022 г., 19:06:18

v21 rebased with compile errors fixed is attached.

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

22 марта 2022 г., 03:15:05

Hi,

On 2022-02-19 11:06:18 -0500, Melanie Plageman wrote:
> v21 rebased with compile errors fixed is attached.

This currently doesn't apply (mea culpa likely): http://cfbot.cputube.org/patch_37_3272.log

Could you rebase? Marked as waiting-on-author for now.

- Andres

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Justin Pryzby

Дата:

06 апреля 2022 г., 19:16:44

I already rebased this in a local branch, so here it's.
I don't expect it to survive the day.

This should be updated to use the tuplestore helper.

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

05 июля 2022 г., 20:24:55

On Mon, Mar 21, 2022 at 8:15 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-02-19 11:06:18 -0500, Melanie Plageman wrote:
> v21 rebased with compile errors fixed is attached.

This currently doesn't apply (mea culpa likely): http://cfbot.cputube.org/patch_37_3272.log

Could you rebase? Marked as waiting-on-author for now.

Attached is the rebased/rewritten version of the pg_stat_buffers patch
which uses the cumulative stats system instead of stats collector.

I've moved to the model of backend-local pending stats which get
accumulated into shared memory by pgstat_report_stat().

It is worth noting that, with this method, other backends will no longer
have access to each other's individual IO operation statistics. An
argument could be made to keep the statistics in each backend in
PgBackendStatus before accumulating them to the cumulative stats system
so that they can be accessed at the per-backend level of detail.

There are two TODOs related to when pgstat_report_io_ops() should be
called. pgstat_report_io_ops() is meant for backends that will not
commonly call pgstat_report_stat(). I was unsure if it made sense for
BootstrapModeMain() to explicitly call pgstat_report_io_ops() and if
auto vacuum worker should call it explicitly and, if so, if it was the
right location to call it after do_autovacuum().

Archiver and syslogger do not increment or report IO operations.

I did not change pg_stat_bgwriter fields to derive from the IO
operations statistics structures since the reset targets differ.

Also, I added one test, but I'm not sure if it will be flakey. It tests
that the "writes" for checkpointer are tracked when data is inserted
into a table and then CHECKPOINT is explicitly invoked directly after. I
don't know if this will have a problem if the checkpointer is busy and
somehow the backend which dirtied the buffer is forced to write out its
own buffer, causing the test to potentially fail (even if the
checkpointer is doing other writes [causing it to be busy], it may not
do them in between the INSERT and the SELECT from pg_stat_buffers).

I am wondering how to add a non-flakey test. For regular backends, I
couldn't think of a way to suspend checkpointer to make them do their
own writes and fsyncs in the context of a regression or isolation test.
In fact for many of the dirty buffers it seems like it will be difficult
to keep bgwriter, checkpointer, and regular backends from competing and
sometimes causing test failures.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

06 июля 2022 г., 22:20:47

Hi,

On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:
> From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 28 Jun 2022 11:33:04 -0400
> Subject: [PATCH v22 1/3] Add BackendType for standalone backends
> 
> All backends should have a BackendType to enable statistics reporting
> per BackendType.
> 
> Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
> alphabetize the BackendTypes). Both the bootstrap backend and single
> user mode backends will have BackendType B_STANDALONE_BACKEND.
> 
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Discussion:
https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
> ---
>  src/backend/utils/init/miscinit.c | 17 +++++++++++------
>  src/include/miscadmin.h           |  5 +++--
>  2 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
> index eb43b2c5e5..07e6db1a1c 100644
> --- a/src/backend/utils/init/miscinit.c
> +++ b/src/backend/utils/init/miscinit.c
> @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
>  {
>      Assert(!IsPostmasterEnvironment);
>  
> +    MyBackendType = B_STANDALONE_BACKEND;

Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.


> @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
>       * out the initial relation mapping files.
>       */
>      RelationMapFinishBootstrap();
> +    // TODO: should this be done for bootstrap?
> +    pgstat_report_io_ops();

Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.



> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 2e146aac93..e6dbb1c4bb 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
>          recentXid = ReadNextTransactionId();
>          recentMulti = ReadNextMultiXactId();
>          do_autovacuum();
> +
> +        // TODO: should this be done more often somewhere in do_autovacuum()?
> +        pgstat_report_io_ops();
>      }

Don't think you need all these calls before process exit - it'll happen
via pgstat_shutdown_hook().

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.


> diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
> index 91e6f6ea18..87e4b9e9bd 100644
> --- a/src/backend/postmaster/bgwriter.c
> +++ b/src/backend/postmaster/bgwriter.c
> @@ -242,6 +242,7 @@ BackgroundWriterMain(void)
>  
>          /* Report pending statistics to the cumulative stats system */
>          pgstat_report_bgwriter();
> +        pgstat_report_io_ops();
>  
>          if (FirstCallSinceLastCheckpoint())
>          {

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.



> +/*
> + * Flush out locally pending IO Operation statistics entries
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true. Writer processes are mutually excluded
> + * using LWLock, but readers are expected to use change-count protocol to avoid
> + * interference with writers.
> + *
> + * If nowait is true, this function returns true if the lock could not be
> + * acquired. Otherwise return false.
> + *
> + */
> +bool
> +pgstat_flush_io_ops(bool nowait)
> +{
> +    PgStat_IOPathOps *dest_io_path_ops;
> +    PgStatShared_BackendIOPathOps *stats_shmem;
> +
> +    PgBackendStatus *beentry = MyBEEntry;
> +
> +    if (!have_ioopstats)
> +        return false;
> +
> +    if (!beentry || beentry->st_backendType == B_INVALID)
> +        return false;
> +
> +    stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> +    if (!nowait)
> +        LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
> +        return true;

Wonder if it's worth making the lock specific to the backend type?


> +    dest_io_path_ops =
> +        &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +

This could be done before acquiring the lock, right?


> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> +    PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
> +    PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
> +    PgStat_IOPathOps *reset_ops;
> +
> +    PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset;
> +    PgStat_IOPathOps reset[BACKEND_NUM_TYPES];
> +
> +    pgstat_copy_changecounted_stats(snapshot_ops,
> +            &stats_shmem->stats, sizeof(stats_shmem->stats),
> +            &stats_shmem->changecount);

This doesn't make sense - with multiple writers you can't use the
changecount approach (and you don't in the flush part above).


> +    LWLockAcquire(&stats_shmem->lock, LW_SHARED);
> +    memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
> +    LWLockRelease(&stats_shmem->lock);

Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock to
reset the stats (since there is no lock). With a lock you can just reset
the shared state.


> +void
> +pgstat_count_io_op(IOOp io_op, IOPath io_path)
> +{
> +    PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
> +    PgStat_IOOpCounters *cumulative_counters =
> +            &cumulative_IOOpStats.data[io_path];

the pending_/cumultive_ prefix before an uppercase-first camelcase name
seems ugly...

> +    switch (io_op)
> +    {
> +        case IOOP_ALLOC:
> +            pending_counters->allocs++;
> +            cumulative_counters->allocs++;
> +            break;
> +        case IOOP_EXTEND:
> +            pending_counters->extends++;
> +            cumulative_counters->extends++;
> +            break;
> +        case IOOP_FSYNC:
> +            pending_counters->fsyncs++;
> +            cumulative_counters->fsyncs++;
> +            break;
> +        case IOOP_WRITE:
> +            pending_counters->writes++;
> +            cumulative_counters->writes++;
> +            break;
> +    }
> +
> +    have_ioopstats = true;
> +}

Doing two math ops / memory accesses every time seems off. Seems better
to maintain cumultive_counters whenever reporting stats, just before
zeroing pending_counters?


> +/*
> + * Report IO operation statistics
> + *
> + * This works in much the same way as pgstat_flush_io_ops() but is meant for
> + * BackendTypes like bgwriter for whom pgstat_report_stat() will not be called
> + * frequently enough to keep shared memory stats fresh.
> + * Backends not typically calling pgstat_report_stat() can invoke
> + * pgstat_report_io_ops() explicitly.
> + */
> +void
> +pgstat_report_io_ops(void)
> +{

This shouldn't be needed - the flush function above can be used.


> +    PgStat_IOPathOps *dest_io_path_ops;
> +    PgStatShared_BackendIOPathOps *stats_shmem;
> +
> +    PgBackendStatus *beentry = MyBEEntry;
> +
> +    Assert(!pgStatLocal.shmem->is_shutdown);
> +    pgstat_assert_is_up();
> +
> +    if (!have_ioopstats)
> +        return;
> +
> +    if (!beentry || beentry->st_backendType == B_INVALID)
> +        return;

Is there a case where this may be called where we have no beentry?

Why not just use MyBackendType?


> +    stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> +    dest_io_path_ops =
> +        &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +
> +    pgstat_begin_changecount_write(&stats_shmem->changecount);

A mentioned before, the changecount stuff doesn't apply here. You need a
lock.


> +PgStat_IOPathOps *
> +pgstat_fetch_backend_io_path_ops(void)
> +{
> +    pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
> +    return pgStatLocal.snapshot.io_path_ops;
> +}
> +
> +PgStat_Counter
> +pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op)
> +{
> +    PgStat_IOOpCounters *counters = &cumulative_IOOpStats.data[io_path];
> +
> +    switch (io_op)
> +    {
> +        case IOOP_ALLOC:
> +            return counters->allocs;
> +        case IOOP_EXTEND:
> +            return counters->extends;
> +        case IOOP_FSYNC:
> +            return counters->fsyncs;
> +        case IOOP_WRITE:
> +            return counters->writes;
> +        default:
> +            elog(ERROR, "IO Operation %s for IO Path %s is undefined.",
> +                    pgstat_io_op_desc(io_op), pgstat_io_path_desc(io_path));
> +    }
> +}

There's currently no user for this, right? Maybe let's just defer the
cumulative stuff until we need it?


> +const char *
> +pgstat_io_path_desc(IOPath io_path)
> +{
> +    const char *io_path_desc = "Unknown IO Path";
> +

This should be unreachable, right?


> From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 4 Jul 2022 15:44:17 -0400
> Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type

> Add pg_stat_buffers, a system view which tracks the number of IO
> operations (allocs, writes, fsyncs, and extends) done through each IO
> path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
> backend.

I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.

I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.


>       <row>
>        <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
>        <entry>One row only, showing statistics about WAL activity. See
> @@ -3595,7 +3604,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
>         <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
>        </para>
>        <para>
> -       Time at which these statistics were last reset
> +       Time at which these statistics were last reset.
> +      </para></entry>

Grammar critique time :)


> +CREATE VIEW pg_stat_buffers AS
> +SELECT
> +       b.backend_type,
> +       b.io_path,
> +       b.alloc,
> +       b.extend,
> +       b.fsync,
> +       b.write,
> +       b.stats_reset
> +FROM pg_stat_get_buffers() b;

Do we want to expose all data to all users? I guess pg_stat_bgwriter
does? But this does split things out a lot more...



> +    for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> +    {
> +        PgStat_IOOpCounters *counters = io_path_ops->data;
> +        Datum        backend_type_desc =
> +            CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
> +            /* const char *log_name = GetBackendTypeDesc(idx_get_backend_type(i)); */
> +
> +        for (int j = 0; j < IOPATH_NUM_TYPES; j++)
> +        {
> +            Datum values[BUFFERS_NUM_COLUMNS];
> +            bool nulls[BUFFERS_NUM_COLUMNS];
> +            memset(values, 0, sizeof(values));
> +            memset(nulls, 0, sizeof(nulls));
> +
> +            values[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
> +            values[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));

Random musing: I wonder if we should start to use SQL level enums for
this kind of thing.


>  DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
>  DROP TABLE prevstats;
> +SELECT pg_stat_reset_shared('buffers');
> + pg_stat_reset_shared 
> +----------------------
> + 
> +(1 row)
> +
> +SELECT pg_stat_force_next_flush();
> + pg_stat_force_next_flush 
> +--------------------------
> + 
> +(1 row)
> +
> +SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
> + ?column? 
> +----------
> + t
> +(1 row)


Don't think you can rely on that. The lookup of the view, functions
might have needed to load catalog data, which might have needed to evict
buffers.  I think you can do something more reliable by checking that
there's more written buffers after a checkpoint than before, or such.


Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

12 июля 2022 г., 05:22:28

Hi,

In the attached patch set, I've added in missing IO operations for
certain IO Paths as well as enumerating in the commit message which IO
Paths and IO Operations are not currently counted and or not possible.

There is a TODO in HandleWalWriterInterrupts() about removing
pgstat_report_wal() since it is immediately before a proc_exit()

I was wondering if LocalBufferAlloc() should increment the counter or if
I should wait until GetLocalBufferStorage() to increment the counter.

I also realized that I am not differentiating between IOPATH_SHARED and
IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
of buffer we are fsync'ing by the time we call register_dirty_segment(),
I'm not sure how we would fix this.

On Wed, Jul 6, 2022 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:
> From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 28 Jun 2022 11:33:04 -0400
> Subject: [PATCH v22 1/3] Add BackendType for standalone backends
>
> All backends should have a BackendType to enable statistics reporting
> per BackendType.
>
> Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
> alphabetize the BackendTypes). Both the bootstrap backend and single
> user mode backends will have BackendType B_STANDALONE_BACKEND.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
> ---
> src/backend/utils/init/miscinit.c | 17 +++++++++++------
> src/include/miscadmin.h | 5 +++--
> 2 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
> index eb43b2c5e5..07e6db1a1c 100644
> --- a/src/backend/utils/init/miscinit.c
> +++ b/src/backend/utils/init/miscinit.c
> @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
> {
> Assert(!IsPostmasterEnvironment);
>
> + MyBackendType = B_STANDALONE_BACKEND;

Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.

I have no opinion currently.
It depends on how commonly you think developers might want separate
bootstrap and single user mode IO stats.

> @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
> * out the initial relation mapping files.
> */
> RelationMapFinishBootstrap();
> + // TODO: should this be done for bootstrap?
> + pgstat_report_io_ops();

Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.

I've removed this and other occurrences which were before proc_exit()
(and thus redundant). (Though I did not explicitly check if it was
different for bootstrap.)

> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 2e146aac93..e6dbb1c4bb 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
> recentXid = ReadNextTransactionId();
> recentMulti = ReadNextMultiXactId();
> do_autovacuum();
> +
> + // TODO: should this be done more often somewhere in do_autovacuum()?
> + pgstat_report_io_ops();
> }

Don't think you need all these calls before process exit - it'll happen
via pgstat_shutdown_hook().

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.

noted and fixed.

> diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
> index 91e6f6ea18..87e4b9e9bd 100644
> --- a/src/backend/postmaster/bgwriter.c
> +++ b/src/backend/postmaster/bgwriter.c
> @@ -242,6 +242,7 @@ BackgroundWriterMain(void)
>
> /* Report pending statistics to the cumulative stats system */
> pgstat_report_bgwriter();
> + pgstat_report_io_ops();
>
> if (FirstCallSinceLastCheckpoint())
> {

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.

noted and fixed.

> +/*
> + * Flush out locally pending IO Operation statistics entries
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true. Writer processes are mutually excluded
> + * using LWLock, but readers are expected to use change-count protocol to avoid
> + * interference with writers.
> + *
> + * If nowait is true, this function returns true if the lock could not be
> + * acquired. Otherwise return false.
> + *
> + */
> +bool
> +pgstat_flush_io_ops(bool nowait)
> +{
> + PgStat_IOPathOps *dest_io_path_ops;
> + PgStatShared_BackendIOPathOps *stats_shmem;
> +
> + PgBackendStatus *beentry = MyBEEntry;
> +
> + if (!have_ioopstats)
> + return false;
> +
> + if (!beentry || beentry->st_backendType == B_INVALID)
> + return false;
> +
> + stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> + if (!nowait)
> + LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
> + else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
> + return true;

Wonder if it's worth making the lock specific to the backend type?

I've added another Lock into PgStat_IOPathOps so that each BackendType
can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.

> + dest_io_path_ops =
> + &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +

This could be done before acquiring the lock, right?

> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> + PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
> + PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
> + PgStat_IOPathOps *reset_ops;
> +
> + PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset;
> + PgStat_IOPathOps reset[BACKEND_NUM_TYPES];
> +
> + pgstat_copy_changecounted_stats(snapshot_ops,
> + &stats_shmem->stats, sizeof(stats_shmem->stats),
> + &stats_shmem->changecount);

This doesn't make sense - with multiple writers you can't use the
changecount approach (and you don't in the flush part above).

> + LWLockAcquire(&stats_shmem->lock, LW_SHARED);
> + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
> + LWLockRelease(&stats_shmem->lock);

Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock to
reset the stats (since there is no lock). With a lock you can just reset
the shared state.

Yes, I believe I have cleaned up all of this embarrassing mess. I use the
lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
locks in PgStat_IOPathOps for flush.

> +void
> +pgstat_count_io_op(IOOp io_op, IOPath io_path)
> +{
> + PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
> + PgStat_IOOpCounters *cumulative_counters =
> + &cumulative_IOOpStats.data[io_path];

the pending_/cumultive_ prefix before an uppercase-first camelcase name
seems ugly...

> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + pending_counters->allocs++;
> + cumulative_counters->allocs++;
> + break;
> + case IOOP_EXTEND:
> + pending_counters->extends++;
> + cumulative_counters->extends++;
> + break;
> + case IOOP_FSYNC:
> + pending_counters->fsyncs++;
> + cumulative_counters->fsyncs++;
> + break;
> + case IOOP_WRITE:
> + pending_counters->writes++;
> + cumulative_counters->writes++;
> + break;
> + }
> +
> + have_ioopstats = true;
> +}

Doing two math ops / memory accesses every time seems off. Seems better
to maintain cumultive_counters whenever reporting stats, just before
zeroing pending_counters?

I've gone ahead and cut the cumulative counters concept.

> +/*
> + * Report IO operation statistics
> + *
> + * This works in much the same way as pgstat_flush_io_ops() but is meant for
> + * BackendTypes like bgwriter for whom pgstat_report_stat() will not be called
> + * frequently enough to keep shared memory stats fresh.
> + * Backends not typically calling pgstat_report_stat() can invoke
> + * pgstat_report_io_ops() explicitly.
> + */
> +void
> +pgstat_report_io_ops(void)
> +{

This shouldn't be needed - the flush function above can be used.

Fixed.

> + PgStat_IOPathOps *dest_io_path_ops;
> + PgStatShared_BackendIOPathOps *stats_shmem;
> +
> + PgBackendStatus *beentry = MyBEEntry;
> +
> + Assert(!pgStatLocal.shmem->is_shutdown);
> + pgstat_assert_is_up();
> +
> + if (!have_ioopstats)
> + return;
> +
> + if (!beentry || beentry->st_backendType == B_INVALID)
> + return;

Is there a case where this may be called where we have no beentry?

Why not just use MyBackendType?

Fixed.

> + stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> + dest_io_path_ops =
> + &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +
> + pgstat_begin_changecount_write(&stats_shmem->changecount);

A mentioned before, the changecount stuff doesn't apply here. You need a
lock.

Fixed.

> +PgStat_IOPathOps *
> +pgstat_fetch_backend_io_path_ops(void)
> +{
> + pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
> + return pgStatLocal.snapshot.io_path_ops;
> +}
> +
> +PgStat_Counter
> +pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op)
> +{
> + PgStat_IOOpCounters *counters = &cumulative_IOOpStats.data[io_path];
> +
> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + return counters->allocs;
> + case IOOP_EXTEND:
> + return counters->extends;
> + case IOOP_FSYNC:
> + return counters->fsyncs;
> + case IOOP_WRITE:
> + return counters->writes;
> + default:
> + elog(ERROR, "IO Operation %s for IO Path %s is undefined.",
> + pgstat_io_op_desc(io_op), pgstat_io_path_desc(io_path));
> + }
> +}

There's currently no user for this, right? Maybe let's just defer the
cumulative stuff until we need it?

Removed.

> +const char *
> +pgstat_io_path_desc(IOPath io_path)
> +{
> + const char *io_path_desc = "Unknown IO Path";
> +

This should be unreachable, right?

Changed it to an error.

> From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 4 Jul 2022 15:44:17 -0400
> Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type

> Add pg_stat_buffers, a system view which tracks the number of IO
> operations (allocs, writes, fsyncs, and extends) done through each IO
> path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
> backend.

I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.

I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.

changed it to pg_stat_io

> +CREATE VIEW pg_stat_buffers AS
> +SELECT
> + b.backend_type,
> + b.io_path,
> + b.alloc,
> + b.extend,
> + b.fsync,
> + b.write,
> + b.stats_reset
> +FROM pg_stat_get_buffers() b;

Do we want to expose all data to all users? I guess pg_stat_bgwriter
does? But this does split things out a lot more...

I didn't see another similar example limiting access.

> DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
> DROP TABLE prevstats;
> +SELECT pg_stat_reset_shared('buffers');
> + pg_stat_reset_shared
> +----------------------
> +
> +(1 row)
> +
> +SELECT pg_stat_force_next_flush();
> + pg_stat_force_next_flush
> +--------------------------
> +
> +(1 row)
> +
> +SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
> + ?column?
> +----------
> + t
> +(1 row)

Don't think you can rely on that. The lookup of the view, functions
might have needed to load catalog data, which might have needed to evict
buffers. I think you can do something more reliable by checking that
there's more written buffers after a checkpoint than before, or such.

Yes, per an off list suggestion by you, I have changed the tests to use a
sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
the missing calls to count IO Operations for IOPATH_LOCAL and
IOPATH_STRATEGY.

I struggled to come up with a way to test writes for a particular
type of backend are counted correctly since a dirty buffer could be
written out by another type of backend before the target BackendType has
a chance to write it out.

I also struggled to come up with a way to test IO operations for
background workers. I'm not sure of a way to deterministically have a
background worker do a particular kind of IO in a test scenario.

I'm not sure how to cause a strategy "extend" for testing.

Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

I've added a test to test that reused strategy buffers are counted as
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.

Thanks,

Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Kyotaro Horiguchi

Дата:

12 июля 2022 г., 11:06:21

At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in 
> Hi,
> 
> In the attached patch set, I've added in missing IO operations for
> certain IO Paths as well as enumerating in the commit message which IO
> Paths and IO Operations are not currently counted and or not possible.
> 
> There is a TODO in HandleWalWriterInterrupts() about removing
> pgstat_report_wal() since it is immediately before a proc_exit()

Right. walwriter does that without needing the explicit call.

> I was wondering if LocalBufferAlloc() should increment the counter or if
> I should wait until GetLocalBufferStorage() to increment the counter.

Depends on what "allocate" means. Different from shared buffers, local
buffers are taken from OS then allocated to page.  OS-allcoated pages
are restricted by num_temp_buffers so I think what we're interested in
is the count incremented by LocalBuferAlloc(). (And it is the parallel
of alloc for shared-buffers)

> I also realized that I am not differentiating between IOPATH_SHARED and
> IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> of buffer we are fsync'ing by the time we call register_dirty_segment(),
> I'm not sure how we would fix this.

I think there scarcely happens flush for strategy-loaded buffers.  If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.

> On Wed, Jul 6, 2022 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:
> > > @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
> > >  {
> > >       Assert(!IsPostmasterEnvironment);
> > >
> > > +     MyBackendType = B_STANDALONE_BACKEND;
> >
> > Hm. This is used for singleuser mode as well as bootstrap. Should we
> > split those? It's not like bootstrap mode really matters for stats, so
> > I'm inclined not to.
> >
> >
> I have no opinion currently.
> It depends on how commonly you think developers might want separate
> bootstrap and single user mode IO stats.

Regarding to stats, I don't think separating them makes much sense.

> > > @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool
> > check_only)
> > >        * out the initial relation mapping files.
> > >        */
> > >       RelationMapFinishBootstrap();
> > > +     // TODO: should this be done for bootstrap?
> > > +     pgstat_report_io_ops();
> >
> > Hm. Not particularly useful, but also not harmful. But we don't need an
> > explicit call, because it'll be done at process exit too. At least I
> > think, it could be that it's different for bootstrap.
>
> I've removed this and other occurrences which were before proc_exit()
> (and thus redundant). (Though I did not explicitly check if it was
> different for bootstrap.)

pgstat_report_stat(true) is supposed to be called as needed via
before_shmem_hook so I think that's the right thing.

> > IMO it'd be a good idea to add pgstat_report_io_ops() to
> > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
> > autovac worker get updated more regularly.
> >
> 
> noted and fixed.

> > How about moving the pgstat_report_io_ops() into
> > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
> > unnecessary to have multiple pgstat_* calls in these places.
> >
> >
> >
> noted and fixed.

+     * Also report IO Operations statistics

I think that the function comment also should mention this.

> > Wonder if it's worth making the lock specific to the backend type?
> >
> 
> I've added another Lock into PgStat_IOPathOps so that each BackendType
> can be locked separately. But, I've also kept the lock in
> PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> done easily.

Looks fine about the lock separation.
By the way, in the following line:

+        &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];

backend_type_get_idx(x) is actually (x - 1) plus assertion on the
value range. And the only use-case is here. There's an reverse
function and also used only at one place.

+        Datum        backend_type_desc =
+            CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));

In this usage GetBackendTypeDesc() gracefully treats out-of-domain
values but idx_get_backend_type keenly kills the process for the
same. This is inconsistent.

My humbel opinion on this is we don't define the two functions and
replace the calls to them with (x +/- 1).  Addition to that, I think
we should not abort() by invalid backend types.  In that sense, I
wonder if we could use B_INVALIDth element for this purpose.

> > > +     LWLockAcquire(&stats_shmem->lock, LW_SHARED);
> > > +     memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
> > > +     LWLockRelease(&stats_shmem->lock);
> >
> > Which then also means that you don't need the reset offset stuff. It's
> > only there because with the changecount approach we can't take a lock to
> > reset the stats (since there is no lock). With a lock you can just reset
> > the shared state.
> >
> 
> Yes, I believe I have cleaned up all of this embarrassing mess. I use the
> lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
> locks in PgStat_IOPathOps for flush.

Looks fine, but I think pgstat_flush_io_ops() need more comments like
other pgstat_flush_* functions.

+    for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+        stats_shmem->stats[i].stat_reset_timestamp = ts;

I'm not sure we need a separate reset timestamp for each backend type
but SLRU counter does the same thing..

> > > +pgstat_report_io_ops(void)
> > > +{
> >
> > This shouldn't be needed - the flush function above can be used.
> >
> 
> Fixed.

The commit message of 0002 contains that name:p

> > > +const char *
> > > +pgstat_io_path_desc(IOPath io_path)
> > > +{
> > > +     const char *io_path_desc = "Unknown IO Path";
> > > +
> >
> > This should be unreachable, right?
> >
> 
> Changed it to an error.

+    elog(ERROR, "Attempt to describe an unknown IOPath");

I think we usually spell it as ("unrecognized IOPath value: %d", io_path).

> > > From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Mon, 4 Jul 2022 15:44:17 -0400
> > > Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type
> >
> > > Add pg_stat_buffers, a system view which tracks the number of IO
> > > operations (allocs, writes, fsyncs, and extends) done through each IO
> > > path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
> > > backend.
> >
> > I think I like pg_stat_io a bit better? Nearly everything in here seems
> > to fit better in that.
> >
> > I guess we could split out buffers allocated, but that's actually
> > interesting in the context of the kind of IO too.
> >
> 
> changed it to pg_stat_io

A bit different thing, but I felt a little uneasy about some uses of
"pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
other hand, actually iopath is used as an attribute of io_ops in many
places.  Couldn't we be more consistent about the relationship between
the names?

IOOp   -> PgStat_IOOpType
IOPath -> PgStat_IOPath
PgStat_IOOpCOonters -> PgStat_IOCounters
PgStat_IOPathOps    -> PgStat_IO
pgstat_count_io_op  -> pgstat_count_io
...

(Better wordings are welcome.)

> > > +CREATE VIEW pg_stat_buffers AS
> > > +SELECT
> > > +       b.backend_type,
> > > +       b.io_path,
> > > +       b.alloc,
> > > +       b.extend,
> > > +       b.fsync,
> > > +       b.write,
> > > +       b.stats_reset
> > > +FROM pg_stat_get_buffers() b;
> >
> > Do we want to expose all data to all users? I guess pg_stat_bgwriter
> > does? But this does split things out a lot more...
> >
> >
> I didn't see another similar example limiting access.

(The doc told me that) pg_buffercache view is restricted to
pg_monitor. But other activity-stats(aka stats collector:)-related
pg_stat_* views are not restricted to pg_monitor.

doc> pg_monitor    Read/execute various monitoring views and functions.

Hmm....

> > Don't think you can rely on that. The lookup of the view, functions
> > might have needed to load catalog data, which might have needed to evict
> > buffers.  I think you can do something more reliable by checking that
> > there's more written buffers after a checkpoint than before, or such.
> >
> >
> Yes, per an off list suggestion by you, I have changed the tests to use a
> sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
> the missing calls to count IO Operations for IOPATH_LOCAL and
> IOPATH_STRATEGY.
> 
> I struggled to come up with a way to test writes for a particular
> type of backend are counted correctly since a dirty buffer could be
> written out by another type of backend before the target BackendType has
> a chance to write it out.
> 
> I also struggled to come up with a way to test IO operations for
> background workers. I'm not sure of a way to deterministically have a
> background worker do a particular kind of IO in a test scenario.
> 
> I'm not sure how to cause a strategy "extend" for testing.

I'm not sure what you are expecting, but for example, "create table t
as select generate_series(0, 99999)" increments Strategy-extend by
about 400.  (I'm surprised that autovac worker-shared-extend has
non-zero number)


> > Would be nice to have something testing that the ringbuffer stats stuff
> > does something sensible - that feels not entirely trivial.
> >
> >
> I've added a test to test that reused strategy buffers are counted as
> allocs. I would like to add a test which checks that if a buffer in the
> ring is pinned and thus not reused, that it is not counted as a strategy
> alloc, but I found it challenging without a way to pause vacuuming, pin
> a buffer, then resume vacuuming.

===

If I'm not missing something, in BufferAlloc, when strategy is not
used and the victim is dirty, iopath is determined based on the
uninitialized from_ring.  It seems to me from_ring is equivalent to
strategy_current_was_in_ring.  And if StrategyGetBuffer has set
from_ring to false, StratetgyRejectBuffer may set it to true, which is
is wrong. The logic around there seems to need a rethink.

What can we read from the values separated to Shared and Strategy?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

12 июля 2022 г., 19:19:06

Thanks for the review!

On Tue, Jul 12, 2022 at 4:06 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in
> Hi,
>
> In the attached patch set, I've added in missing IO operations for
> certain IO Paths as well as enumerating in the commit message which IO
> Paths and IO Operations are not currently counted and or not possible.
>
> There is a TODO in HandleWalWriterInterrupts() about removing
> pgstat_report_wal() since it is immediately before a proc_exit()

Right. walwriter does that without needing the explicit call.

I have deleted it.

> I was wondering if LocalBufferAlloc() should increment the counter or if
> I should wait until GetLocalBufferStorage() to increment the counter.

Depends on what "allocate" means. Different from shared buffers, local
buffers are taken from OS then allocated to page. OS-allcoated pages
are restricted by num_temp_buffers so I think what we're interested in
is the count incremented by LocalBuferAlloc(). (And it is the parallel
of alloc for shared-buffers)

I've left it in LocalBufferAlloc().

> I also realized that I am not differentiating between IOPATH_SHARED and
> IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> of buffer we are fsync'ing by the time we call register_dirty_segment(),
> I'm not sure how we would fix this.

I think there scarcely happens flush for strategy-loaded buffers. If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.

Why would it be less likely for a backend to do its own fsync when
flushing a dirty strategy buffer than a regular dirty shared buffer?

> > IMO it'd be a good idea to add pgstat_report_io_ops() to
> > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
> > autovac worker get updated more regularly.
> >
>
> noted and fixed.

> > How about moving the pgstat_report_io_ops() into
> > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
> > unnecessary to have multiple pgstat_* calls in these places.
> >
> >
> >
> noted and fixed.

+ * Also report IO Operations statistics

I think that the function comment also should mention this.

I've added comments at the top of all these functions.

> > Wonder if it's worth making the lock specific to the backend type?
> >
>
> I've added another Lock into PgStat_IOPathOps so that each BackendType
> can be locked separately. But, I've also kept the lock in
> PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> done easily.

Looks fine about the lock separation.

Actually, I think it is not safe to use both of these locks. So for
picking one method, it is probably better to go with the locks in
PgStat_IOPathOps, it will be more efficient for flush (and not for
fetching and resetting), so that is probably the way to go, right?

By the way, in the following line:

+ &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];

backend_type_get_idx(x) is actually (x - 1) plus assertion on the
value range. And the only use-case is here. There's an reverse
function and also used only at one place.

+ Datum backend_type_desc =
+ CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));

In this usage GetBackendTypeDesc() gracefully treats out-of-domain
values but idx_get_backend_type keenly kills the process for the
same. This is inconsistent.

My humbel opinion on this is we don't define the two functions and
replace the calls to them with (x +/- 1). Addition to that, I think
we should not abort() by invalid backend types. In that sense, I
wonder if we could use B_INVALIDth element for this purpose.

I think that GetBackendTypeDesc() should probably also error out for an
unknown value.

I would be open to not using the helper functions. I thought it would be
less error-prone, but since it is limited to the code in
pgstat_io_ops.c, it is probably okay. Let me think a bit more.

Could you explain more about what you mean about using B_INVALID
BackendType?

> > > + LWLockAcquire(&stats_shmem->lock, LW_SHARED);
> > > + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
> > > + LWLockRelease(&stats_shmem->lock);
> >
> > Which then also means that you don't need the reset offset stuff. It's
> > only there because with the changecount approach we can't take a lock to
> > reset the stats (since there is no lock). With a lock you can just reset
> > the shared state.
> >
>
> Yes, I believe I have cleaned up all of this embarrassing mess. I use the
> lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
> locks in PgStat_IOPathOps for flush.

Looks fine, but I think pgstat_flush_io_ops() need more comments like
other pgstat_flush_* functions.

+ for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+ stats_shmem->stats[i].stat_reset_timestamp = ts;

I'm not sure we need a separate reset timestamp for each backend type
but SLRU counter does the same thing..

Yes, I think for SLRU stats it is because you can reset individual SLRU
stats. Also there is no wrapper data structure to put it in. I could
keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
operation stats at once, but I am thinking of getting rid of
PgStatShared_BackendIOPathOps since it is not needed if I only keep the
locks in PgStat_IOPathOps and make the global shared value an array of
PgStat_IOPathOps.

> > > +pgstat_report_io_ops(void)
> > > +{
> >
> > This shouldn't be needed - the flush function above can be used.
> >
>
> Fixed.

The commit message of 0002 contains that name:p

Thanks! Fixed.

> > > +const char *
> > > +pgstat_io_path_desc(IOPath io_path)
> > > +{
> > > + const char *io_path_desc = "Unknown IO Path";
> > > +
> >
> > This should be unreachable, right?
> >
>
> Changed it to an error.

+ elog(ERROR, "Attempt to describe an unknown IOPath");

I think we usually spell it as ("unrecognized IOPath value: %d", io_path).

I have changed to this.

> > > From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Mon, 4 Jul 2022 15:44:17 -0400
> > > Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type
> >
> > > Add pg_stat_buffers, a system view which tracks the number of IO
> > > operations (allocs, writes, fsyncs, and extends) done through each IO
> > > path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
> > > backend.
> >
> > I think I like pg_stat_io a bit better? Nearly everything in here seems
> > to fit better in that.
> >
> > I guess we could split out buffers allocated, but that's actually
> > interesting in the context of the kind of IO too.
> >
>
> changed it to pg_stat_io

A bit different thing, but I felt a little uneasy about some uses of
"pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
other hand, actually iopath is used as an attribute of io_ops in many
places. Couldn't we be more consistent about the relationship between
the names?

IOOp -> PgStat_IOOpType
IOPath -> PgStat_IOPath
PgStat_IOOpCOonters -> PgStat_IOCounters
PgStat_IOPathOps -> PgStat_IO
pgstat_count_io_op -> pgstat_count_io
...

(Better wordings are welcome.)

Let me think about naming and make changes in the next version.

> > Would be nice to have something testing that the ringbuffer stats stuff
> > does something sensible - that feels not entirely trivial.
> >
> >
> I've added a test to test that reused strategy buffers are counted as
> allocs. I would like to add a test which checks that if a buffer in the
> ring is pinned and thus not reused, that it is not counted as a strategy
> alloc, but I found it challenging without a way to pause vacuuming, pin
> a buffer, then resume vacuuming.

===

If I'm not missing something, in BufferAlloc, when strategy is not
used and the victim is dirty, iopath is determined based on the
uninitialized from_ring. It seems to me from_ring is equivalent to
strategy_current_was_in_ring. And if StrategyGetBuffer has set
from_ring to false, StratetgyRejectBuffer may set it to true, which is
is wrong. The logic around there seems to need a rethink.

What can we read from the values separated to Shared and Strategy?

Дата:

13 июля 2022 г., 20:14:52

Attached patch set is substantially different enough from previous
versions that I kept it as a new patch set.
Note that local buffer allocations are now correctly tracked.

On Tue, Jul 12, 2022 at 1:01 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:
> > > I also realized that I am not differentiating between IOPATH_SHARED and
> > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> > > of buffer we are fsync'ing by the time we call register_dirty_segment(),
> > > I'm not sure how we would fix this.
> >
> > I think there scarcely happens flush for strategy-loaded buffers. If
> > that is sensible, IOOP_FSYNC would not make much sense for
> > IOPATH_STRATEGY.
> >
>
> Why would it be less likely for a backend to do its own fsync when
> flushing a dirty strategy buffer than a regular dirty shared buffer?

We really just don't expect a backend to do many segment fsyncs at
all. Otherwise there's something wrong with the forwarding mechanism.

When a dirty strategy buffer is written out, if pendingOps sync queue is
full and the backend has to fsync the segment itself instead of relying
on the checkpointer, this will show in the statistics as an IOOP_FSYNC
for IOPATH_SHARED not IOPATH_STRATEGY.

IOPATH_STRATEGY + IOOP_FSYNC will always be 0 for all BackendTypes.

Does this seem right?

It'd be different if we tracked WAL fsyncs more granularly - which would be
quite interesting - but that's something for another day^Wpatch.

I do have a question about this.
So, if we were to start tracking WAL IO would it fit within this
paradigm to have a new IOPATH_WAL for WAL or would it add a separate
dimension?

I was thinking that we might want to consider calling this view
pg_stat_io_data because we might want to have a separate view,
pg_stat_io_wal and then, maybe eventually, convert pg_stat_slru to
pg_stat_io_slru (or a subset of what is in pg_stat_slru).
And maybe then later add pg_stat_io_[archiver/other]

Is pg_stat_io_data a good name that gives us flexibility to
introduce views which expose per-backend IO operation stats (maybe that
goes in pg_stat_activity, though [or maybe not because it wouldn't
include exited backends?]) and per query IO operation stats?

I would like to add roughly the same additional columns to all of
these during AIO development (basically the columns from iostat):
- average block size (will usually be 8kB for pg_stat_io_data but won't
necessarily for the others)
- IOPS/BW
- avg read/write wait time
- demand rate/completion rate
- merges
- maybe queue depth

And I would like to be able to see all of these per query, per backend,
per relation, per BackendType, per IOPath, per SLRU type, etc.

Basically, what I'm asking is
1) what can we name the view to enable these future stats to exist with
the least confusing/wordy view names?
2) will the current view layout and column titles work with minimal
changes for future stats extensions like what I mention above?

> > > > Wonder if it's worth making the lock specific to the backend type?
> > > >
> > >
> > > I've added another Lock into PgStat_IOPathOps so that each BackendType
> > > can be locked separately. But, I've also kept the lock in
> > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > > done easily.
> >
> > Looks fine about the lock separation.
> >
>
> Actually, I think it is not safe to use both of these locks. So for
> picking one method, it is probably better to go with the locks in
> PgStat_IOPathOps, it will be more efficient for flush (and not for
> fetching and resetting), so that is probably the way to go, right?

I think it's good to just use one kind of lock, and efficiency of snapshotting
/ resetting is nearly irrelevant. But I don't see why it's not safe to use
both kinds of locks?

The way I implemented it was not safe because I didn't use both locks
when resetting the stats.

In this new version of the patch, I've done the following: In shared
memory I've put the lock in PgStatShared_IOPathOps -- the data structure
which contains an array of PgStat_IOOpCounters for all IOOp types for
all IOPaths. Thus, different BackendType + IOPath combinations can be
updated concurrently without contending for the same lock.

To make this work, I made two versions of the PgStat_IOPathOps -- one
that has the lock, PgStatShared_IOPathOps, and one without,
PgStat_IOPathOps, so that I can persist it to the stats file without
writing and reading the LWLock and can have a local and snapshot version
of the data structure without the lock.

This also necessitated two versions of the data structure wrapping
PgStat_IOPathOps, PgStat_BackendIOPathOps, which contains an array with
a PgStat_IOPathOps for each BackendType, and
PgStatShared_BackendIOPathOps, containing an array of
PgStatShared_IOPathOps.

> > Looks fine, but I think pgstat_flush_io_ops() need more comments like
> > other pgstat_flush_* functions.
> >
> > + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> > + stats_shmem->stats[i].stat_reset_timestamp = ts;
> >
> > I'm not sure we need a separate reset timestamp for each backend type
> > but SLRU counter does the same thing..
> >
>
> Yes, I think for SLRU stats it is because you can reset individual SLRU
> stats. Also there is no wrapper data structure to put it in. I could
> keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
> operation stats at once, but I am thinking of getting rid of
> PgStatShared_BackendIOPathOps since it is not needed if I only keep the
> locks in PgStat_IOPathOps and make the global shared value an array of
> PgStat_IOPathOps.

I'm strongly against introducing super granular reset timestamps. I think that
was a mistake for SLRU stats, but we can't fix that as easily.

Since all stats in pg_stat_io must be reset at the same time, I've put
the reset timestamp can in the PgStat[Shared]_BackendIOPathOps and
removed it from each PgStat[Shared]_IOPathOps.

> Currently, strategy allocs count only reuses of a strategy buffer (not
> initial shared buffers which are added to the ring).
> strategy writes count only the writing out of dirty buffers which are
> already in the ring and are being reused.

That seems right to me.

> Alternatively, we could also count as strategy allocs all those buffers
> which are added to the ring and count as strategy writes all those
> shared buffers which are dirty when initially added to the ring.

I don't think that'd provide valuable information. The whole reason that
strategy writes are interesting is that they can lead to writing out data a
lot sooner than they would be written out without a strategy being used.

Then I agree that strategy writes should only count strategy buffers
that are written out in order to reuse the buffer (which is in lieu of
getting a new, potentially clean, shared buffer). This patch implements
that behavior.

However, for strategy allocs, it seems like we would want to count all
demand for buffers as part of a BufferAccessStrategy. So, that would
include allocating buffers to initially fill the ring, allocations of
new shared buffers after the ring was already full that are added to the
ring because all existing buffers in the ring are pinned, and buffers
already in the ring which are being reused.

This version of the patch only counts the third scenario as a strategy
allocation, but I think it would make more sense to count all three as
strategy allocs.

The downside of this behavior is that strategy allocs count different
scenarios than strategy writes, reads, and extends. But, I think that
this is okay.

I'll clarify it in the docs once there is a decision.

Also, note that, as stated above, there will never be any strategy
fsyncs (that is, IOPATH_STRATEGY + IOOP_FSYNC will always be 0) because
the code path starting with register_dirty_segment() which ends with a
regular backend doing its own fsync when pendingOps is full does not
know what the current IOPATH is and checkpointer does not use a
BufferAccessStrategy.

> Subject: [PATCH v24 2/3] Track IO operation statistics
>
> Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> location or type of IO done by a backend. For example, the checkpointer
> may write a shared buffer out. This would be counted as an IOOp write on
> an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
the file path. What about 'origin'?

Enough has changed in this version of the patch that I decided to defer
renaming until some of the other issues are resolved.

> Each IOOp (alloc, fsync, extend, write) is counted per IOPath
> (direct, local, shared, or strategy) through a call to
> pgstat_count_io_op().

It seems we should track reads too - it's quite interesting to know whether
reads happened because of a strategy, for example. You do reference reads in a
later part of the commit message even :)

I've added reads to what is counted.

> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.

We could extend this at a later stage, if we really want to. But I'm not sure
it's interesting or fully possible. E.g. the archiver's write are largely not
done by the archiver itself, but by a command (or module these days) it shells
out to.

I've added note of this to some of the comments and the commit message.
I also omit rows for these BackendTypes from the view. See my later
comment in this email for more detail on that.

> Note that this commit does not add code to increment IOPATH_DIRECT. A
> future patch adding wrappers for smgrwrite(), smgrextend(), and
> smgrimmedsync() would provide a good location to call
> pgstat_count_io_op() for unbuffered IO and avoid regressions for future
> users of these functions.

Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?

It's gone.

> Stats on IOOps for all IOPaths for a backend are initially accumulated
> locally.
>
> Later they are flushed to shared memory and accumulated with those from
> all other backends, exited and live.

Perhaps mention here that this later could be extended to make per-connection
stats visible?

Mentioned.

> Some BackendTypes will not execute pgstat_report_stat() and thus must
> explicitly call pgstat_flush_io_ops() in order to flush their backend
> local IO operation statistics to shared memory.

Maybe add "flush ... during ongoing operation" or such? Because they'd all
flush at commit, IIRC.

Added.

> diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
> index 088556ab54..963b05321e 100644
> --- a/src/backend/bootstrap/bootstrap.c
> +++ b/src/backend/bootstrap/bootstrap.c
> @@ -33,6 +33,7 @@
> #include "miscadmin.h"
> #include "nodes/makefuncs.h"
> #include "pg_getopt.h"
> +#include "pgstat.h"
> #include "storage/bufmgr.h"
> #include "storage/bufpage.h"
> #include "storage/condition_variable.h"

Hm?

Removed

> diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
> index e926f8c27c..beb46dcb55 100644
> --- a/src/backend/postmaster/walwriter.c
> +++ b/src/backend/postmaster/walwriter.c
> @@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
> }
>
> if (ShutdownRequestPending)
> - {
> - /*
> - * Force reporting remaining WAL statistics at process exit.
> - *
> - * Since pgstat_report_wal is invoked with 'force' is false in main
> - * loop to avoid overloading the cumulative stats system, there may
> - * exist unreported stats counters for the WAL writer.
> - */
> - pgstat_report_wal(true);
> -
> proc_exit(0);
> - }
>
> /* Perform logging of memory contexts of this process */
> if (LogMemoryContextPending)

Let's do this in a separate commit and get it out of the way...

I've put it in a separate commit.

> @@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
> * if this buffer should be written and re-used.
> */
> bool
> -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring)
> {
> - /* We only do this in bulkread mode */
> +
> + /*
> + * We only reject reusing and writing out the strategy buffer this in
> + * bulkread mode.
> + */
> if (strategy->btype != BAS_BULKREAD)
> + {
> + /*
> + * If the buffer was from the ring and we are not rejecting it, consider it
> + * a write of a strategy buffer.
> + */
> + if (strategy->current_was_in_ring)
> + *write_from_ring = true;

Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
StrategyRejectBuffer() to be called for clean buffers...

Yes, we do not expect it to be called for clean buffers.

I've added a comment about this assumption.

> /*
> diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
> index d9275611f0..d3963f59d0 100644
> --- a/src/backend/utils/activity/pgstat_database.c
> +++ b/src/backend/utils/activity/pgstat_database.c
> @@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
> }
>
> /*
> - * Called from autovacuum.c to report startup of an autovacuum process.
> + * Called from autovacuum.c to report startup of an autovacuum process and
> + * flush IO Operation statistics.
> * We are called before InitPostgres is done, so can't rely on MyDatabaseId;
> * the db OID must be passed in, instead.
> */
> @@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
> dbentry->stats.last_autovac_time = GetCurrentTimestamp();
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operation statistics
> + */
> + pgstat_flush_io_ops(false);
> }

Hm. I suspect this will always be zero - at this point we haven't connected to
a database, so there really can't have been much, if any, IO. I think I
suggested doing something here, but on a second look it really doesn't make
much sense.

Note that that's different from doing something in
pgstat_report_(vacuum|analyze) - clearly we've done something at that point.

I've removed this.

> /*
> - * Report that the table was just vacuumed.
> + * Report that the table was just vacuumed and flush IO Operation statistics.
> */
> void
> pgstat_report_vacuum(Oid tableoid, bool shared,
> @@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
> }
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operations statistics
> + */
> + pgstat_flush_io_ops(false);
> }
>
> /*
> - * Report that the table was just analyzed.
> + * Report that the table was just analyzed and flush IO Operation statistics.
> *
> * Caller must provide new live- and dead-tuples estimates, as well as a
> * flag indicating whether to reset the changes_since_analyze counter.
> @@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
> }
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operations statistics
> + */
> + pgstat_flush_io_ops(false);
> }

Think it'd be good to amend these comments to say that otherwise stats would
only get flushed after a multi-relatio autovacuum cycle is done / a
VACUUM/ANALYZE command processed all tables. Perhaps add the comment to one
of the two functions, and just reference it in the other place?

Done

> --- a/src/include/utils/backend_status.h
> +++ b/src/include/utils/backend_status.h
> @@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
> int buflen);
> extern uint64 pgstat_get_my_query_id(void);
>
> +/* Utility functions */
> +
> +/*
> + * When maintaining an array of information about all valid BackendTypes, in
> + * order to avoid wasting the 0th spot, use this helper to convert a valid
> + * BackendType to a valid location in the array (given that no spot is
> + * maintained for B_INVALID BackendType).
> + */
> +static inline int backend_type_get_idx(BackendType backend_type)
> +{
> + /*
> + * backend_type must be one of the valid backend types. If caller is
> + * maintaining backend information in an array that includes B_INVALID,
> + * this function is unnecessary.
> + */
> + Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
> + return backend_type - 1;
> +}

In function definitions (vs declarations) we put the 'static inline int' in a
separate line from the rest of the function signature.

Fixed.

> +/*
> + * When using a value from an array of information about all valid
> + * BackendTypes, add 1 to the index before using it as a BackendType to adjust
> + * for not maintaining a spot for B_INVALID BackendType.
> + */
> +static inline BackendType idx_get_backend_type(int idx)
> +{
> + int backend_type = idx + 1;
> + /*
> + * If the array includes a spot for B_INVALID BackendType this function is
> + * not required.

The comments around this seem a bit over the top, but I also don't mind them
much.

Feel free to change them to something shorter. I couldn't think of something I liked.

> Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
> writes, fsyncs, and extends) done through each IOPath (e.g. shared
> buffers, local buffers, unbuffered IO) by each type of backend.

Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
latter, except that we already have a bunch of views with that prefix.

I have thoughts on this but thought it best deferred until after the _data decision.

> Some of these should always be zero. For example, checkpointer does not
> use a BufferAccessStrategy (currently), so the "strategy" IOPath for
> checkpointer will be 0 for all IOOps.

What do you think about returning NULL for the values that we except to never
be non-zero? Perhaps with an assert against non-zero values? Seems like it
might be helpful for understanding the view.

Yes, I like this idea.

Beyond just setting individual cells to NULL, if an entire row would be
NULL, I have now dropped it from the view.

So far, I have omitted from the view all rows for BackendTypes
B_ARCHIVER, B_LOGGER, and B_STARTUP.

Should I also omit rows for B_WAL_RECEIVER and B_WAL_WRITER for now?

I have also omitted rows for IOPATH_STRATEGY for all BackendTypes
*except* B_AUTOVAC_WORKER, B_BACKEND, B_STANDALONE_BACKEND, and
B_BG_WORKER.

Do these seem correct?

I think there are some BackendTypes which will never do IO Operations on
IOPATH_LOCAL but I am not sure which. Do you know which?

As for individual cells which should be NULL, so far what I have is:
- IOPATH_LOCAL + IOOP_FSYNC
I am sure there are others as well. Can you think of any?

> +/*
> +* When adding a new column to the pg_stat_io view, add a new enum
> +* value here above IO_NUM_COLUMNS.
> +*/
> +enum
> +{
> + IO_COLUMN_BACKEND_TYPE,
> + IO_COLUMN_IO_PATH,
> + IO_COLUMN_ALLOCS,
> + IO_COLUMN_EXTENDS,
> + IO_COLUMN_FSYNCS,
> + IO_COLUMN_WRITES,
> + IO_COLUMN_RESET_TIME,
> + IO_NUM_COLUMNS,
> +};

We typedef pretty much every enum so the enum can be referenced without the
'enum' prefix. I'd do that here, even if we don't need it.

So, I left it anonymous because I didn't want it being used as a type
or referenced anywhere else.

I am interested to hear more about your SQL enums idea from upthread.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

15 июля 2022 г., 01:44:48

In addition to adding several new tests, the attached version 26 fixes a
major bug in constructing the view.

The only valid combination of IOPATH/IOOP that is not tested now is
IOPATH_STRATEGY + IOOP_WRITE. In most cases when I ran this in regress,
the checkpointer wrote out the dirty strategy buffer before VACUUM got
around to reusing and writing it out in my tests.

I've also changed the BACKEND_NUM_TYPES definition. Now arrays will have
that dead spot for B_INVALID, but I feel like it is much easier to
understand without trying to skip that spot and use those special helper
functions.

I also started skipping adding rows to the view for WAL_RECEIVER and
WAL_WRITER and for BackendTypes except B_BACKEND and WAL_SENDER for
IOPATH_LOCAL.

On Tue, Jul 12, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote:
> Yes, per an off list suggestion by you, I have changed the tests to use a
> sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
> the missing calls to count IO Operations for IOPATH_LOCAL and
> IOPATH_STRATEGY.
>
> I struggled to come up with a way to test writes for a particular
> type of backend are counted correctly since a dirty buffer could be
> written out by another type of backend before the target BackendType has
> a chance to write it out.

I guess temp file writes would be reliably done by one backend... Don't have a
good idea otherwise.

This was mainly an issue for IOPATH_STRATEGY writes as I mentioned. I
still have not solved this.

> I'm not sure how to cause a strategy "extend" for testing.

COPY into a table should work. But might be unattractive due to the size of of
the COPY ringbuffer.

Did it with a CTAS as Horiguchi-san suggested.

> > Would be nice to have something testing that the ringbuffer stats stuff
> > does something sensible - that feels not entirely trivial.
> >
> >
> I've added a test to test that reused strategy buffers are counted as
> allocs. I would like to add a test which checks that if a buffer in the
> ring is pinned and thus not reused, that it is not counted as a strategy
> alloc, but I found it challenging without a way to pause vacuuming, pin
> a buffer, then resume vacuuming.

Yea, that's probably too hard to make reliable to be worth it.

Yes, I have skipped this.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

15 июля 2022 г., 18:59:41

I am consolidating the various naming points from this thread into one
email:

From Horiguchi-san:

> A bit different thing, but I felt a little uneasy about some uses of
> "pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
> other hand, actually iopath is used as an attribute of io_ops in many
> places. Couldn't we be more consistent about the relationship between
> the names?
>
> IOOp -> PgStat_IOOpType
> IOPath -> PgStat_IOPath
> PgStat_IOOpCOonters -> PgStat_IOCounters
> PgStat_IOPathOps -> PgStat_IO
> pgstat_count_io_op -> pgstat_count_io

So, because of the way the data structures contain arrays of each other
the naming was meant to specify all the information contained in the
data structure:

PgStat_IOOpCounters are all IOOp (I could see removing the word
"counters" from the name for more consistency)

PgStat_IOPathOps are all IOOp for all IOPath

PgStat_BackendIOPathOps are all IOOp for all IOPath for all BackendType

The downside of this naming is that, when choosing a local variable name
for all of the IOOp for all IOPath for a single BackendType,
"backend_io_path_ops" seems accurate but is actually confusing if the
type name for all IOOp for all IOPath for all BackendType is
PgStat_BackendIOPathOps.

I would be open to changing PgStat_BackendIOPathOps to PgStat_IO, but I
don't see how I could omit Path or Op from PgStat_IOPathOps without
making its meaning unclear.

I'm not sure about the idea of prefixing the IOOp and IOPath enums with
Pg_Stat. I could see them being used outside of statistics (though they
are defined in pgstat.h) and could see myself using them in, for
example, calculations for the prefetcher.

From Andres:

Quoting me (Melanie):

> > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> > location or type of IO done by a backend. For example, the checkpointer
> > may write a shared buffer out. This would be counted as an IOOp write on
> > an IOPath IOPATH_SHARED by BackendType "checkpointer".

> I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
> the file path. What about 'origin'?

I can see the point about IOPATH.
I'm not wild about origin mostly because of the number of O's given that
IO Operation already has two O's. It gets kind of hard to read when
using Pascal Case: IOOrigin and IOOp.
Also, it doesn't totally make sense for alloc. I could be convinced,
though.

IOSOURCE doesn't have the O problem but does still not make sense for
alloc. I also thought of IOSITE and IOVENUE.

> Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
> latter, except that we already have a bunch of views with that prefix.

As far as pg_stat_io vs pg_statio, they are the only stats views which
don't have an underscore between stat and the rest of the view name, so
perhaps we should move away from statio to stat_io going forward anyway.
I am imagining adding to them with other iostat type metrics once direct
IO is introduced, so they may well be changing soon anyway.

- Melanie

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

15 июля 2022 г., 21:52:45

Hi,

On 2022-07-15 11:59:41 -0400, Melanie Plageman wrote:
> I'm not sure about the idea of prefixing the IOOp and IOPath enums with
> Pg_Stat. I could see them being used outside of statistics (though they
> are defined in pgstat.h)

+1


> From Andres:
> 
> Quoting me (Melanie):
> > > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> > > location or type of IO done by a backend. For example, the checkpointer
> > > may write a shared buffer out. This would be counted as an IOOp write on
> > > an IOPath IOPATH_SHARED by BackendType "checkpointer".
> 
> > I'm still not 100% happy with IOPath - seems a bit too easy to confuse
> with
> > the file path. What about 'origin'?
> 
> I can see the point about IOPATH.
> I'm not wild about origin mostly because of the number of O's given that
> IO Operation already has two O's. It gets kind of hard to read when
> using Pascal Case: IOOrigin and IOOp.
> Also, it doesn't totally make sense for alloc. I could be convinced,
> though.
> 
> IOSOURCE doesn't have the O problem but does still not make sense for
> alloc. I also thought of IOSITE and IOVENUE.

I like "source" - not too bothered by the alloc aspect. I can also see
"context" working.


> > Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting
> the
> > latter, except that we already have a bunch of views with that prefix.
> 
> As far as pg_stat_io vs pg_statio, they are the only stats views which
> don't have an underscore between stat and the rest of the view name, so
> perhaps we should move away from statio to stat_io going forward anyway.
> I am imagining adding to them with other iostat type metrics once direct
> IO is introduced, so they may well be changing soon anyway.

I don't think I have strong opinions on this one. I can see arguments for
either naming.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

20 июля 2022 г., 19:50:50

Hi,

On 2022-07-14 18:44:48 -0400, Melanie Plageman wrote:
> Subject: [PATCH v26 1/4] Add BackendType for standalone backends
> Subject: [PATCH v26 2/4] Remove unneeded call to pgstat_report_wal()

LGTM.


> Subject: [PATCH v26 3/4] Track IO operation statistics

> @@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>  
>      bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
>  
> +    if (isLocalBuf)
> +        io_path = IOPATH_LOCAL;
> +    else if (strategy != NULL)
> +        io_path = IOPATH_STRATEGY;
> +    else
> +        io_path = IOPATH_SHARED;

Seems a bit ugly to have an if (isLocalBuf) just after an isLocalBuf ?.


> +            /*
> +             * When a strategy is in use, reused buffers from the strategy ring will
> +             * be counted as allocations for the purposes of IO Operation statistics
> +             * tracking.
> +             *
> +             * However, even when a strategy is in use, if a new buffer must be
> +             * allocated from shared buffers and added to the ring, this is counted
> +             * as a IOPATH_SHARED allocation.
> +             */

There's a bit too much duplication between the paragraphs...

> @@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
>      /* flush database / relation / function / ... stats */
>      partial_flush |= pgstat_flush_pending_entries(nowait);
>  
> +    /* flush IO Operations stats */
> +    partial_flush |= pgstat_flush_io_ops(nowait);

Could you either add a note to the commit message that the stats file
version needs to be increased, or just iclude that in the patch.




> @@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
>      FILE       *fpin;
>      int32        format_id;
>      bool        found;
> +    PgStat_BackendIOPathOps io_stats;
>      const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
>      PgStat_ShmemControl *shmem = pgStatLocal.shmem;
> +    PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;
>  
>      /* shouldn't be called from postmaster */
>      Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
> @@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
>      if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
>          goto error;
>  
> +    /*
> +     * Read IO Operations stats struct
> +     */
> +    if (!read_chunk_s(fpin, &io_stats))
> +        goto error;
> +
> +    io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp;
> +
> +    for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> +    {
> +        PgStat_IOPathOps *stats = &io_stats.stats[i];
> +        PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i];
> +
> +        memcpy(stats_shmem->data, stats->data, sizeof(stats->data));
> +    }

Why can't the data be read directly into shared memory?


>      /*


> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> +    PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
> +    PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops;
> +
> +    for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> +    {
> +        PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
> +        PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i];
> +
> +        LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);

Why acquire the same lock repeatedly for each type, rather than once for
the whole?


> +        /*
> +         * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
> +         * reset timestamp as well.
> +         */
> +        if (i == 0)
> +            all_backend_stats_snap->stat_reset_timestamp = all_backend_stats_shmem->stat_reset_timestamp;

Which also would make this look a bit less awkward.

Starting to look pretty good...

- Andres

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

20 июля 2022 г., 20:40:40

On Wed, Jul 20, 2022 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-14 18:44:48 -0400, Melanie Plageman wrote:

> @@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
> FILE *fpin;
> int32 format_id;
> bool found;
> + PgStat_BackendIOPathOps io_stats;
> const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
> PgStat_ShmemControl *shmem = pgStatLocal.shmem;
> + PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;
>
> /* shouldn't be called from postmaster */
> Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
> @@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
> if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
> goto error;
>
> + /*
> + * Read IO Operations stats struct
> + */
> + if (!read_chunk_s(fpin, &io_stats))
> + goto error;
> +
> + io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp;
> +
> + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> + {
> + PgStat_IOPathOps *stats = &io_stats.stats[i];
> + PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i];
> +
> + memcpy(stats_shmem->data, stats->data, sizeof(stats->data));
> + }

Why can't the data be read directly into shared memory?

It is not the same lock. Each PgStatShared_IOPathOps has a lock so that
they can be accessed individually (per BackendType in
PgStatShared_BackendIOPathOps). It is optimized for the more common
operation of flushing at the expense of the snapshot operation (which
should be less common) and reset operation.

> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> + PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
> + PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops;
> +
> + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> + {
> + PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
> + PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i];
> +
> + LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);

Why acquire the same lock repeatedly for each type, rather than once for
the whole?

This is also because of having a LWLock in each PgStatShared_IOPathOps.
Because I don't want a lock in the backend local stats, I have two data
structures PgStatShared_IOPathOps and PgStat_IOPathOps. I thought it was
odd to write out the lock to the file, so when persisting the stats, I
write out the relevant data only and when reading it back in to shared
memory, I read in the data member of PgStatShared_IOPathOps.

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

12 августа 2022 г., 02:53:09

I've attached v27 of the patch.

I've renamed IOPATH to IOCONTEXT. I also have added assertions to
confirm that unexpected statistics are not being accumulated.

There are also assorted other cleanups and changes.

It would be good to confirm that the rows being skipped and cells that
are NULL in the view are the correct ones.
The startup process will never use a BufferAccessStrategy, right?

On Wed, Jul 20, 2022 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:

> Subject: [PATCH v26 3/4] Track IO operation statistics

> @@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>
> bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
>
> + if (isLocalBuf)
> + io_path = IOPATH_LOCAL;
> + else if (strategy != NULL)
> + io_path = IOPATH_STRATEGY;
> + else
> + io_path = IOPATH_SHARED;

Seems a bit ugly to have an if (isLocalBuf) just after an isLocalBuf ?.

Changed this.

> + /*
> + * When a strategy is in use, reused buffers from the strategy ring will
> + * be counted as allocations for the purposes of IO Operation statistics
> + * tracking.
> + *
> + * However, even when a strategy is in use, if a new buffer must be
> + * allocated from shared buffers and added to the ring, this is counted
> + * as a IOPATH_SHARED allocation.
> + */

There's a bit too much duplication between the paragraphs...

I actually think the two paragraphs are making separate points. I've
edited this, so see if you like it better now.

> @@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
> /* flush database / relation / function / ... stats */
> partial_flush |= pgstat_flush_pending_entries(nowait);
>
> + /* flush IO Operations stats */
> + partial_flush |= pgstat_flush_io_ops(nowait);

Could you either add a note to the commit message that the stats file
version needs to be increased, or just iclude that in the patch.

Bumped the stats file version in attached patchset.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

22 августа 2022 г., 20:15:18

v28 attached.

I've added the new structs I added to typedefs.list.

I've split the commit which adds all of the logic to track
IO operation statistics into two commits -- one which includes all of
the code to count IOOps for IOContexts locally in a backend and a second
which includes all of the code to accumulate and manage these with the
cumulative stats system.

A few notes about the commit which adds local IO Operation stats:

- There is a comment above pgstat_io_op_stats_collected() which mentions
the cumulative stats system even though this commit doesn't engage the
cumulative stats system. I wasn't sure if it was more or less
confusing to have two different versions of this comment.

- should pgstat_count_io_op() take BackendType as a parameter instead of
using MyBackendType internally?

- pgstat_count_io_op() Assert()s that the passed-in IOOp and IOContext
are valid for this BackendType, but it doesn't check that all of the
pending stats which should be zero are zero. I thought this was okay
because if I did add that zero-check, it would be added to
pgstat_count_ioop() as well, and we already Assert() there that we can
count the op. Thus, it doesn't seem like checking that the stats are
zero would add any additional regression protection.

- I've kept pgstat_io_context_desc() and pgstat_io_op_desc() in the
commit which adds those types (the local stats commit), however they
are not used in that commit. I wasn't sure if I should keep them in
that commit or move them to the first commit using them (the commit
adding the new view).

Notes on the commit which accumulates IO Operation stats in shared
memory:

- I've extended the usage of the Assert()s that IO Operation stats that
should be zero are. Previously we only checked the stats validity when
querying the view. Now we check it when flushing pending stats and
when reading the stats file into shared memory.

Note that the three locations with these validity checks (when
flushing pending stats, when reading stats file into shared memory,
and when querying the view) have similar looking code to loop through
and validate the stats. However, the actual action they perform if the
stats are valid is different for each site (adding counters together,
doing a read, setting nulls in a tuple column to true). Also, some of
these instances have other code interspersed in the loops which would
require additional looping if separated from this logic. So it was
difficult to see a way of combining these into a single helper
function.

- I've left pgstat_fetch_backend_io_context_ops() in the shared stats
commit, however it is not used until the commit which adds the view in

pg_stat_get_io(). I wasn't sure which way seemed better.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

23 августа 2022 г., 06:31:22

Hi,

On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote:
> v28 attached.

Pushed 0001, 0002. Thanks!

- Andres

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

25 августа 2022 г., 22:15:27

Hi,

On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote:
> v28 attached.
> 
> I've added the new structs I added to typedefs.list.
> 
> I've split the commit which adds all of the logic to track
> IO operation statistics into two commits -- one which includes all of
> the code to count IOOps for IOContexts locally in a backend and a second
> which includes all of the code to accumulate and manage these with the
> cumulative stats system.

Thanks!


> A few notes about the commit which adds local IO Operation stats:
> 
> - There is a comment above pgstat_io_op_stats_collected() which mentions
> the cumulative stats system even though this commit doesn't engage the
> cumulative stats system. I wasn't sure if it was more or less
> confusing to have two different versions of this comment.

Not worth being worried about...


> - should pgstat_count_io_op() take BackendType as a parameter instead of
> using MyBackendType internally?

I don't forsee a case where a different value would be passed in.


> - pgstat_count_io_op() Assert()s that the passed-in IOOp and IOContext
> are valid for this BackendType, but it doesn't check that all of the
> pending stats which should be zero are zero. I thought this was okay
> because if I did add that zero-check, it would be added to
> pgstat_count_ioop() as well, and we already Assert() there that we can
> count the op. Thus, it doesn't seem like checking that the stats are
> zero would add any additional regression protection.

It's probably ok.


> - I've kept pgstat_io_context_desc() and pgstat_io_op_desc() in the
> commit which adds those types (the local stats commit), however they
> are not used in that commit. I wasn't sure if I should keep them in
> that commit or move them to the first commit using them (the commit
> adding the new view).

> - I've left pgstat_fetch_backend_io_context_ops() in the shared stats
> commit, however it is not used until the commit which adds the view in
> pg_stat_get_io(). I wasn't sure which way seemed better.


Think that's fine.


> Notes on the commit which accumulates IO Operation stats in shared
> memory:
> 
> - I've extended the usage of the Assert()s that IO Operation stats that
> should be zero are. Previously we only checked the stats validity when
> querying the view. Now we check it when flushing pending stats and
> when reading the stats file into shared memory.

> Note that the three locations with these validity checks (when
> flushing pending stats, when reading stats file into shared memory,
> and when querying the view) have similar looking code to loop through
> and validate the stats. However, the actual action they perform if the
> stats are valid is different for each site (adding counters together,
> doing a read, setting nulls in a tuple column to true). Also, some of
> these instances have other code interspersed in the loops which would
> require additional looping if separated from this logic. So it was
> difficult to see a way of combining these into a single helper
> function.

All of them seem to repeat something like

> +                if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
> +                    !pgstat_io_context_io_op_valid(io_context, io_op))

perhaps those could be combined? Afaics nothing uses pgstat_bktype_io_op_valid
separately.


> Subject: [PATCH v28 3/5] Track IO operation statistics locally
> 
> Introduce "IOOp", an IO operation done by a backend, and "IOContext",
> the IO location source or target or IO type done by a backend. For
> example, the checkpointer may write a shared buffer out. This would be
> counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
> BackendType "checkpointer".
> 
> Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
> (local, shared, or strategy) through a call to pgstat_count_io_op().
> 
> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.

s/is/are/?


> Stats on IOOps for all IOContexts for a backend are counted in a
> backend's local memory. This commit does not expose any functions for
> aggregating or viewing these stats.

s/This commit does not/A subsequent commit will expose/...


> @@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>      BufferDesc *bufHdr;
>      Block        bufBlock;
>      bool        found;
> +    IOContext    io_context;
>      bool        isExtend;
>      bool        isLocalBuf = SmgrIsTemp(smgr);
>  
> @@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>       */
>      Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));    /* spinlock not needed */
>  
> -    bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
> +    if (isLocalBuf)
> +    {
> +        bufBlock = LocalBufHdrGetBlock(bufHdr);
> +        io_context = IOCONTEXT_LOCAL;
> +    }
> +    else
> +    {
> +        bufBlock = BufHdrGetBlock(bufHdr);
> +
> +        if (strategy != NULL)
> +            io_context = IOCONTEXT_STRATEGY;
> +        else
> +            io_context = IOCONTEXT_SHARED;
> +    }

There's a isLocalBuf block earlier on, couldn't we just determine the context
there? I guess there's a branch here already, so it's probably fine as is.


>      if (isExtend)
>      {
> +
> +        pgstat_count_io_op(IOOP_EXTEND, io_context);

Spurious newline.


> @@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
>   *
>   * If the caller has an smgr reference for the buffer's relation, pass it
>   * as the second parameter.  If not, pass NULL.
> + *
> + * IOContext will always be IOCONTEXT_SHARED except when a buffer access strategy is
> + * used and the buffer being flushed is a buffer from the strategy ring.
>   */
>  static void
> -FlushBuffer(BufferDesc *buf, SMgrRelation reln)
> +FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)

Too long line?

But also, why document the possible values here? Seems likely to get out of
date at some point, and it doesn't seem important to know?


> @@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
>                            localpage,
>                            false);
>  
> +                pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
> +
>                  buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
>                  pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>  

Probably not worth doing, but these made me wonder whether there should be a
function for counting N operations at once.



> @@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
>      if (strategy != NULL)
>      {
>          buf = GetBufferFromRing(strategy, buf_state);
> -        if (buf != NULL)
> +        *from_ring = buf != NULL;
> +        if (*from_ring)
> +        {

Don't really like the if (*from_ring) - why not keep it as buf != NULL? Seems
a bit confusing this way, making it less obvious what's being changed.


> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 014f644bf9..a3d76599bf 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -15,6 +15,7 @@
>   */
>  #include "postgres.h"
>  
> +#include "pgstat.h"
>  #include "access/parallel.h"
>  #include "catalog/catalog.h"
>  #include "executor/instrument.h"

Do most other places not put pgstat.h in the alphabetical order of headers?


> @@ -432,6 +432,15 @@ ProcessSyncRequests(void)
>                      total_elapsed += elapsed;
>                      processed++;
>  
> +                    /*
> +                     * Note that if a backend using a BufferAccessStrategy is
> +                     * forced to do its own fsync (as opposed to the
> +                     * checkpointer doing it), it will not be counted as an
> +                     * IOCONTEXT_STRATEGY IOOP_FSYNC and instead will be
> +                     * counted as an IOCONTEXT_SHARED IOOP_FSYNC.
> +                     */
> +                    pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);

Why is this noted here? Perhaps just point to the place where that happens
instead? I think it's also documented in ForwardSyncRequest()? Or just only
mention it there...


> @@ -0,0 +1,191 @@
> +/* -------------------------------------------------------------------------
> + *
> + * pgstat_io_ops.c
> + *      Implementation of IO operation statistics.
> + *
> + * This file contains the implementation of IO operation statistics. It is kept
> + * separate from pgstat.c to enforce the line between the statistics access /
> + * storage implementation and the details about individual types of
> + * statistics.
> + *
> + * Copyright (c) 2001-2022, PostgreSQL Global Development Group

Arguably this would just be 2021-2022


> +void
> +pgstat_count_io_op(IOOp io_op, IOContext io_context)
> +{
> +    PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_context];
> +
> +    Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
> +
> +    switch (io_op)
> +    {
> +        case IOOP_ALLOC:
> +            pending_counters->allocs++;
> +            break;
> +        case IOOP_EXTEND:
> +            pending_counters->extends++;
> +            break;
> +        case IOOP_FSYNC:
> +            pending_counters->fsyncs++;
> +            break;
> +        case IOOP_READ:
> +            pending_counters->reads++;
> +            break;
> +        case IOOP_WRITE:
> +            pending_counters->writes++;
> +            break;
> +    }
> +
> +}

How about replacing the breaks with a return and then erroring out if we reach
the end of the function? You did that below, and I think it makes sense.


> +bool
> +pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
> +{

Maybe add a tiny comment about what 'valid' means here? Something like
'return whether the backend type counts io in io_context'.


> +    /*
> +     * Only regular backends and WAL Sender processes executing queries should
> +     * use local buffers.
> +     */
> +    no_local = bktype == B_AUTOVAC_LAUNCHER || bktype ==
> +        B_BG_WRITER || bktype == B_CHECKPOINTER || bktype ==
> +        B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype ==
> +        B_STANDALONE_BACKEND || bktype == B_STARTUP;

I think BG_WORKERS could end up using local buffers, extensions can do just
about everything in them.


> +bool
> +pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op)
> +{
> +    if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
> +        IOOP_READ)
> +        return false;

Perhaps we should add an assertion about the backend type making sense here?
I.e. that it's not archiver, walwriter etc?


> +bool
> +pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op)
> +{
> +    /*
> +     * Temporary tables using local buffers are not logged and thus do not
> +     * require fsync'ing. Set this cell to NULL to differentiate between an
> +     * invalid combination and 0 observed IO Operations.

This comment feels a bit out of place?


> +bool
> +pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
> +{
> +    if (!pgstat_io_op_stats_collected(bktype))
> +        return false;
> +
> +    if (!pgstat_bktype_io_context_valid(bktype, io_context))
> +        return false;
> +
> +    if (!pgstat_bktype_io_op_valid(bktype, io_op))
> +        return false;
> +
> +    if (!pgstat_io_context_io_op_valid(io_context, io_op))
> +        return false;
> +
> +    /*
> +     * There are currently no cases of a BackendType, IOContext, IOOp
> +     * combination that are specifically invalid.
> +     */

"specifically"?


> From 0f141fa7f97a57b8628b1b6fd6029bd3782f16a1 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 22 Aug 2022 11:35:20 -0400
> Subject: [PATCH v28 4/5] Aggregate IO operation stats per BackendType
> 
> Stats on IOOps for all IOContexts for a backend are tracked locally. Add
> functionality for backends to flush these stats to shared memory and
> accumulate them with those from all other backends, exited and live.
> Also add reset and snapshot functions used by cumulative stats system
> for management of these statistics.
> 
> The aggregated stats in shared memory could be extended in the future
> with per-backend stats -- useful for per connection IO statistics and
> monitoring.
> 
> Some BackendTypes will not flush their pending statistics at regular
> intervals and explicitly call pgstat_flush_io_ops() during the course of
> normal operations to flush their backend-local IO Operation statistics
> to shared memory in a timely manner.

> Because not all BackendType, IOOp, IOContext combinations are valid, the
> validity of the stats are checked before flushing pending stats and
> before reading in the existing stats file to shared memory.

s/are checked/is checked/?



> @@ -1486,6 +1507,42 @@ pgstat_read_statsfile(void)
>      if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
>          goto error;
>  
> +    /*
> +     * Read IO Operations stats struct
> +     */
> +    if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
> +        goto error;
> +
> +    for (int backend_type = 0; backend_type < BACKEND_NUM_TYPES; backend_type++)
> +    {
> +        PgStatShared_IOContextOps *backend_io_context_ops = &shmem->io_ops.stats[backend_type];
> +        bool        expect_backend_stats = true;
> +
> +        if (!pgstat_io_op_stats_collected(backend_type))
> +            expect_backend_stats = false;
> +
> +        for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> +        {
> +            if (!expect_backend_stats ||
> +                !pgstat_bktype_io_context_valid(backend_type, io_context))
> +            {
> +                pgstat_io_context_ops_assert_zero(&backend_io_context_ops->data[io_context]);
> +                continue;
> +            }
> +
> +            for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> +            {
> +                if (!pgstat_bktype_io_op_valid(backend_type, io_op) ||
> +                    !pgstat_io_context_io_op_valid(io_context, io_op))
> +                    pgstat_io_op_assert_zero(&backend_io_context_ops->data[io_context],
> +                                             io_op);
> +            }
> +        }
> +
> +        if (!read_chunk_s(fpin, &backend_io_context_ops->data))
> +            goto error;
> +    }

Could we put the validation out of line? That's a lot of io stats specific
code to be in pgstat_read_statsfile().

> +/*
> + * Helper function to accumulate PgStat_IOOpCounters. If either of the
> + * passed-in PgStat_IOOpCounters are members of PgStatShared_IOContextOps, the
> + * caller is responsible for ensuring that the appropriate lock is held. This
> + * is not asserted because this function could plausibly be used to accumulate
> + * two local/pending PgStat_IOOpCounters.

What's "this" here?


> + */
> +static void
> +pgstat_accum_io_op(PgStat_IOOpCounters *shared, PgStat_IOOpCounters *local, IOOp io_op)

Given that the comment above says both of them may be local, it's a bit odd to
call it 'shared' here...


> +PgStat_BackendIOContextOps *
> +pgstat_fetch_backend_io_context_ops(void)
> +{
> +    pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
> +
> +    return &pgStatLocal.snapshot.io_ops;
> +}

Not for this patch series, but we really should replace this set of functions
with storing the relevant offset in the kind_info.


> @@ -496,6 +503,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
>   */
>  
>  extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
> +extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
> +extern bool pgstat_flush_io_ops(bool nowait);
>  extern const char *pgstat_io_context_desc(IOContext io_context);
>  extern const char *pgstat_io_op_desc(IOOp io_op);
>  

Is there any call to pgstat_flush_io_ops() from outside pgstat*.c? So possibly
it could be in pgstat_internal.h? Not that it's particularly important...


> @@ -506,6 +515,43 @@ extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op);
>  extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op);
>  extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
>  
> +/*
> + * Functions to assert that invalid IO Operation counters are zero. Used with
> + * the validation functions in pgstat_io_ops.c
> + */
> +static inline void
> +pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
> +{
> +    Assert(counters->allocs == 0 && counters->extends == 0 &&
> +           counters->fsyncs == 0 && counters->reads == 0 &&
> +           counters->writes == 0);
> +}
> +
> +static inline void
> +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
> +{
> +    switch (io_op)
> +    {
> +        case IOOP_ALLOC:
> +            Assert(counters->allocs == 0);
> +            return;
> +        case IOOP_EXTEND:
> +            Assert(counters->extends == 0);
> +            return;
> +        case IOOP_FSYNC:
> +            Assert(counters->fsyncs == 0);
> +            return;
> +        case IOOP_READ:
> +            Assert(counters->reads == 0);
> +            return;
> +        case IOOP_WRITE:
> +            Assert(counters->writes == 0);
> +            return;
> +    }
> +
> +    elog(ERROR, "unrecognized IOOp value: %d", io_op);

Hm. This means it'll emit code even in non-assertion builds - this should
probably just be an Assert(false) or pg_unreachable().


> Subject: [PATCH v28 5/5] Add system view tracking IO ops per backend type

> View stats are fetched from statistics incremented when a backend
> performs an IO Operation and maintained by the cumulative statistics
> subsystem.

"fetched from statistics incremented"?


> Each row of the view is stats for a particular BackendType for a
> particular IOContext (e.g. shared buffer accesses by checkpointer) and
> each column in the view is the total number of IO Operations done (e.g.
> writes).

s/is/shows/?

s/for a particular BackendType for a particular IOContext/for a particularl
BackendType and IOContext/? Somehow the repetition is weird.


> Note that some of the cells in the view are redundant with fields in
> pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
> pg_stat_bgwriter for backwards compatibility. Deriving the redundant
> pg_stat_bgwriter stats from the IO operations stats structures was also
> problematic due to the separate reset targets for 'bgwriter' and
> 'io'.

I suspect we should still consider doing that in the future, perhaps by
documenting that the relevant fields in pg_stat_bgwriter aren't reset by the
'bgwriter' target anymore? And noting that reliance on those fields is
"deprecated" and that pg_stat_io should be used instead?


> Suggested by Andres Freund
> 
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
> ---
>  doc/src/sgml/monitoring.sgml         | 115 ++++++++++++++-
>  src/backend/catalog/system_views.sql |  12 ++
>  src/backend/utils/adt/pgstatfuncs.c  | 100 +++++++++++++
>  src/include/catalog/pg_proc.dat      |   9 ++
>  src/test/regress/expected/rules.out  |   9 ++
>  src/test/regress/expected/stats.out  | 201 +++++++++++++++++++++++++++
>  src/test/regress/sql/stats.sql       | 103 ++++++++++++++
>  7 files changed, 548 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 9440b41770..9949011ba3 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
>       </entry>
>       </row>
>  
> +     <row>
> +      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
> +      <entry>A row for each IO Context for each backend type showing
> +      statistics about backend IO operations. See
> +       <link linkend="monitoring-pg-stat-io-view">
> +       <structname>pg_stat_io</structname></link> for details.
> +     </entry>
> +     </row>

The "for each for each" thing again :)


> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>io_context</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       IO Context used (e.g. shared buffers, direct).
> +      </para></entry>
> +     </row>

Wrong list of contexts.


> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>alloc</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of buffers allocated.
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>extend</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of blocks extended.
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>fsync</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of blocks fsynced.
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>read</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of blocks read.
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>write</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of blocks written.
> +      </para></entry>
> +     </row>

> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
> +      </para>
> +      <para>
> +       Time at which these statistics were last reset.
>        </para></entry>
>       </row>
>      </tbody>

Part of me thinks it'd be nicer if it were "allocated, read, written, extended,
fsynced, stats_reset", instead of alphabetical order. The order already isn't
alphabetical.


> +    /*
> +     * When adding a new column to the pg_stat_io view, add a new enum value
> +     * here above IO_NUM_COLUMNS.
> +     */
> +    enum
> +    {
> +        IO_COLUMN_BACKEND_TYPE,
> +        IO_COLUMN_IO_CONTEXT,
> +        IO_COLUMN_ALLOCS,
> +        IO_COLUMN_EXTENDS,
> +        IO_COLUMN_FSYNCS,
> +        IO_COLUMN_READS,
> +        IO_COLUMN_WRITES,
> +        IO_COLUMN_RESET_TIME,
> +        IO_NUM_COLUMNS,
> +    };

Given it's local and some of the lines are long, maybe just use COL?


> +#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1)

Undef'ing it probably worth doing.


> +    SetSingleFuncCall(fcinfo, 0);
> +    rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> +
> +    backends_io_stats = pgstat_fetch_backend_io_context_ops();
> +
> +    reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> +
> +    for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
> +    {
> +        Datum        bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
> +        bool        expect_backend_stats = true;
> +        PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
> +
> +        /*
> +         * For those BackendTypes without IO Operation stats, skip
> +         * representing them in the view altogether.
> +         */
> +        if (!pgstat_io_op_stats_collected(bktype))
> +            expect_backend_stats = false;

Why not just expect_backend_stats = pgstat_io_op_stats_collected()?


> +        for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> +        {
> +            PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
> +            Datum        values[IO_NUM_COLUMNS];
> +            bool        nulls[IO_NUM_COLUMNS];
> +
> +            /*
> +             * Some combinations of IOCONTEXT and BackendType are not valid
> +             * for any type of IO Operation. In such cases, omit the entire
> +             * row from the view.
> +             */
> +            if (!expect_backend_stats ||
> +                !pgstat_bktype_io_context_valid(bktype, io_context))
> +            {
> +                pgstat_io_context_ops_assert_zero(counters);
> +                continue;
> +            }
> +
> +            memset(values, 0, sizeof(values));
> +            memset(nulls, 0, sizeof(nulls));

I'd replace the memset with values[...] = {0} etc.


> +            values[IO_COLUMN_BACKEND_TYPE] = bktype_desc;
> +            values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum(
> +                                                               pgstat_io_context_desc(io_context));

Pgindent, I hate you.

Perhaps put it the context desc in a local var, so it doesn't look quite this
ugly?


> +            values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
> +            values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
> +            values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
> +            values[IO_COLUMN_READS] = Int64GetDatum(counters->reads);
> +            values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
> +            values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
> +
> +
> +            /*
> +             * Some combinations of BackendType and IOOp and of IOContext and
> +             * IOOp are not valid. Set these cells in the view NULL and assert
> +             * that these stats are zero as expected.
> +             */
> +            for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> +            {
> +                if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
> +                    !pgstat_io_context_io_op_valid(io_context, io_op))
> +                {
> +                    pgstat_io_op_assert_zero(counters, io_op);
> +                    nulls[io_op + IO_COLUMN_IOOP_OFFSET] = true;
> +                }
> +            }

A bit weird that we first assign a value and then set nulls separately. But
it's not obvious how to make it look nice otherwise.

> +-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
> +-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
> +SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
> +CREATE TABLE test_io_shared(a int);
> +INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
> +SELECT pg_stat_force_next_flush();
> + pg_stat_force_next_flush 
> +--------------------------
> + 
> +(1 row)
> +
> +-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
> +CHECKPOINT;

Does that work reliably? A checkpoint could have started just before the
CREATE TABLE, I think? Then it'd not have flushed those writes yet. I think
doing two checkpoints would protect against that.


> +DROP TABLE test_io_shared;
> +DROP TABLESPACE test_io_shared_stats_tblspc;

Tablespace creation is somewhat expensive, do we really need that? There
should be one set up in setup.sql or such.


> +-- Test that allocs, extends, reads, and writes of temporary tables are tracked
> +-- in pg_stat_io.
> +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
> +SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +-- Insert enough values that we need to reuse and write out dirty local
> +-- buffers.
> +INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
> +'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';

Could be abbreviated with repeat('a', some-number) :P

Can the table be smaller than this? That might show up on a slow machine.


> +SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;

Random q: Why are we uppercasing  the first letter of the context?



> +CREATE TABLE test_io_strategy(a INT, b INT);
> +ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');

I think you can specify that as part of the CREATE TABLE. Not sure if
otherwise there's not a race where autovac coul start before you do the ALTER.


> +INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
> +-- Ensure that the next VACUUM will need to perform IO by rewriting the table
> +-- first with VACUUM (FULL).

... because VACUUM FULL currently doesn't set all-visible etc on the pages,
which the subsequent vacuum will then do.


> +-- Hope that the previous value of wal_skip_threshold was the default. We
> +-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
> +-- block.
> +RESET wal_skip_threshold;

Nothing in this file set it before, so that's a pretty sure-to-be-fulfilled
hope.


> +-- Test that, when using a Strategy, if creating a relation, Strategy extends

s/if/when/?


Looks good!

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

26 августа 2022 г., 22:34:06

v29 attached

On Thu, Aug 25, 2022 at 3:15 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote:

> Notes on the commit which accumulates IO Operation stats in shared
> memory:
>
> - I've extended the usage of the Assert()s that IO Operation stats that
> should be zero are. Previously we only checked the stats validity when
> querying the view. Now we check it when flushing pending stats and
> when reading the stats file into shared memory.

> Note that the three locations with these validity checks (when
> flushing pending stats, when reading stats file into shared memory,
> and when querying the view) have similar looking code to loop through
> and validate the stats. However, the actual action they perform if the
> stats are valid is different for each site (adding counters together,
> doing a read, setting nulls in a tuple column to true). Also, some of
> these instances have other code interspersed in the loops which would
> require additional looping if separated from this logic. So it was
> difficult to see a way of combining these into a single helper
> function.

All of them seem to repeat something like

> + if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
> + !pgstat_io_context_io_op_valid(io_context, io_op))

perhaps those could be combined? Afaics nothing uses pgstat_bktype_io_op_valid
separately.

I've combined these into pgstat_io_op_valid().

> Subject: [PATCH v28 3/5] Track IO operation statistics locally
>
> Introduce "IOOp", an IO operation done by a backend, and "IOContext",
> the IO location source or target or IO type done by a backend. For
> example, the checkpointer may write a shared buffer out. This would be
> counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
> BackendType "checkpointer".
>
> Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
> (local, shared, or strategy) through a call to pgstat_count_io_op().
>
> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.

s/is/are/?

changed

> Stats on IOOps for all IOContexts for a backend are counted in a
> backend's local memory. This commit does not expose any functions for
> aggregating or viewing these stats.

s/This commit does not/A subsequent commit will expose/...

changed

> @@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> BufferDesc *bufHdr;
> Block bufBlock;
> bool found;
> + IOContext io_context;
> bool isExtend;
> bool isLocalBuf = SmgrIsTemp(smgr);
>
> @@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> */
> Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
>
> - bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
> + if (isLocalBuf)
> + {
> + bufBlock = LocalBufHdrGetBlock(bufHdr);
> + io_context = IOCONTEXT_LOCAL;
> + }
> + else
> + {
> + bufBlock = BufHdrGetBlock(bufHdr);
> +
> + if (strategy != NULL)
> + io_context = IOCONTEXT_STRATEGY;
> + else
> + io_context = IOCONTEXT_SHARED;
> + }

There's a isLocalBuf block earlier on, couldn't we just determine the context
there? I guess there's a branch here already, so it's probably fine as is.

I've added this as close as possible to the code where we use the
io_context. If I were to move it, it would make sense to move it all the
way to the top of ReadBuffer_common() where we first define isLocalBuf.

I've left it as is.

> if (isExtend)
> {
> +
> + pgstat_count_io_op(IOOP_EXTEND, io_context);

Spurious newline.

fixed

> @@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
> *
> * If the caller has an smgr reference for the buffer's relation, pass it
> * as the second parameter. If not, pass NULL.
> + *
> + * IOContext will always be IOCONTEXT_SHARED except when a buffer access strategy is
> + * used and the buffer being flushed is a buffer from the strategy ring.
> */
> static void
> -FlushBuffer(BufferDesc *buf, SMgrRelation reln)
> +FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)

Too long line?

But also, why document the possible values here? Seems likely to get out of
date at some point, and it doesn't seem important to know?

Deleted.

> @@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
> localpage,
> false);
>
> + pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
> +
> buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
> pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>

Probably not worth doing, but these made me wonder whether there should be a
function for counting N operations at once.

Would it be worth it here? We would need a local variable to track how
many local buffers we end up writing. Do you think that
pgstat_count_io_op() will not be inlined and thus we will end up with
lots of extra function calls if we do a pgstat_count_io_op() on every
iteration? And that it will matter in FlushRelationBuffers()?
The other times that pgstat_count_io_op() is used in a loop, it is
part of the branch that will exit the loop and only be called once-ish.

Or are you thinking that just generally it might be nice to have?

> @@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
> if (strategy != NULL)
> {
> buf = GetBufferFromRing(strategy, buf_state);
> - if (buf != NULL)
> + *from_ring = buf != NULL;
> + if (*from_ring)
> + {

Don't really like the if (*from_ring) - why not keep it as buf != NULL? Seems
a bit confusing this way, making it less obvious what's being changed.

Changed

> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 014f644bf9..a3d76599bf 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -15,6 +15,7 @@
> */
> #include "postgres.h"
>
> +#include "pgstat.h"
> #include "access/parallel.h"
> #include "catalog/catalog.h"
> #include "executor/instrument.h"

Do most other places not put pgstat.h in the alphabetical order of headers?

Fixed

> @@ -432,6 +432,15 @@ ProcessSyncRequests(void)
> total_elapsed += elapsed;
> processed++;
>
> + /*
> + * Note that if a backend using a BufferAccessStrategy is
> + * forced to do its own fsync (as opposed to the
> + * checkpointer doing it), it will not be counted as an
> + * IOCONTEXT_STRATEGY IOOP_FSYNC and instead will be
> + * counted as an IOCONTEXT_SHARED IOOP_FSYNC.
> + */
> + pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);

Why is this noted here? Perhaps just point to the place where that happens
instead? I think it's also documented in ForwardSyncRequest()? Or just only
mention it there...

Removed

> @@ -0,0 +1,191 @@
> +/* -------------------------------------------------------------------------
> + *
> + * pgstat_io_ops.c
> + * Implementation of IO operation statistics.
> + *
> + * This file contains the implementation of IO operation statistics. It is kept
> + * separate from pgstat.c to enforce the line between the statistics access /
> + * storage implementation and the details about individual types of
> + * statistics.
> + *
> + * Copyright (c) 2001-2022, PostgreSQL Global Development Group

Arguably this would just be 2021-2022

Changed

> +void
> +pgstat_count_io_op(IOOp io_op, IOContext io_context)
> +{
> + PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_context];
> +
> + Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
> +
> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + pending_counters->allocs++;
> + break;
> + case IOOP_EXTEND:
> + pending_counters->extends++;
> + break;
> + case IOOP_FSYNC:
> + pending_counters->fsyncs++;
> + break;
> + case IOOP_READ:
> + pending_counters->reads++;
> + break;
> + case IOOP_WRITE:
> + pending_counters->writes++;
> + break;
> + }
> +
> +}

How about replacing the breaks with a return and then erroring out if we reach
the end of the function? You did that below, and I think it makes sense.

I used breaks because in the subsequent commit I introduce the variable
"have_ioopstats", and I set have_ioopstats to false in
pgstat_count_io_op() after counting.
It is probably safe to set have_ioopstats to true before incrementing it
since this backend is the only one that can see have_ioopstats and it
shouldn't fail while incrementing the counter but it seems less clear
than doing it after.

Instead of erroring out for an unknown IOOp, I decided to add Asserts
about the IOContext and IOOp being valid and that the combination of
MyBackendType, IOContext, and IOOp are valid. I think it will be good to
assert that the IOContext is valid before using it as an array index for
lookup in pending stats.

> +bool
> +pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
> +{

Maybe add a tiny comment about what 'valid' means here? Something like
'return whether the backend type counts io in io_context'.

Changed

> + /*
> + * Only regular backends and WAL Sender processes executing queries should
> + * use local buffers.
> + */
> + no_local = bktype == B_AUTOVAC_LAUNCHER || bktype ==
> + B_BG_WRITER || bktype == B_CHECKPOINTER || bktype ==
> + B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype ==
> + B_STANDALONE_BACKEND || bktype == B_STARTUP;

I think BG_WORKERS could end up using local buffers, extensions can do just
about everything in them.

Fixed and added comment.

> +bool
> +pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op)
> +{
> + if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
> + IOOP_READ)
> + return false;

Perhaps we should add an assertion about the backend type making sense here?
I.e. that it's not archiver, walwriter etc?

Done

> +bool
> +pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op)
> +{
> + /*
> + * Temporary tables using local buffers are not logged and thus do not
> + * require fsync'ing. Set this cell to NULL to differentiate between an
> + * invalid combination and 0 observed IO Operations.

This comment feels a bit out of place?

Deleted

> +bool
> +pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
> +{
> + if (!pgstat_io_op_stats_collected(bktype))
> + return false;
> +
> + if (!pgstat_bktype_io_context_valid(bktype, io_context))
> + return false;
> +
> + if (!pgstat_bktype_io_op_valid(bktype, io_op))
> + return false;
> +
> + if (!pgstat_io_context_io_op_valid(io_context, io_op))
> + return false;
> +
> + /*
> + * There are currently no cases of a BackendType, IOContext, IOOp
> + * combination that are specifically invalid.
> + */

"specifically"?

I removed this and mentioned it (rephrased) above pgstat_io_op_valid()

> From 0f141fa7f97a57b8628b1b6fd6029bd3782f16a1 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 22 Aug 2022 11:35:20 -0400
> Subject: [PATCH v28 4/5] Aggregate IO operation stats per BackendType
>
> Stats on IOOps for all IOContexts for a backend are tracked locally. Add
> functionality for backends to flush these stats to shared memory and
> accumulate them with those from all other backends, exited and live.
> Also add reset and snapshot functions used by cumulative stats system
> for management of these statistics.
>
> The aggregated stats in shared memory could be extended in the future
> with per-backend stats -- useful for per connection IO statistics and
> monitoring.
>
> Some BackendTypes will not flush their pending statistics at regular
> intervals and explicitly call pgstat_flush_io_ops() during the course of
> normal operations to flush their backend-local IO Operation statistics
> to shared memory in a timely manner.

> Because not all BackendType, IOOp, IOContext combinations are valid, the
> validity of the stats are checked before flushing pending stats and
> before reading in the existing stats file to shared memory.

s/are checked/is checked/?

Fixed

> @@ -1486,6 +1507,42 @@ pgstat_read_statsfile(void)
> if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
> goto error;
>
> + /*
> + * Read IO Operations stats struct
> + */
> + if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
> + goto error;
> +
> + for (int backend_type = 0; backend_type < BACKEND_NUM_TYPES; backend_type++)
> + {
> + PgStatShared_IOContextOps *backend_io_context_ops = &shmem->io_ops.stats[backend_type];
> + bool expect_backend_stats = true;
> +
> + if (!pgstat_io_op_stats_collected(backend_type))
> + expect_backend_stats = false;
> +
> + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + {
> + if (!expect_backend_stats ||
> + !pgstat_bktype_io_context_valid(backend_type, io_context))
> + {
> + pgstat_io_context_ops_assert_zero(&backend_io_context_ops->data[io_context]);
> + continue;
> + }
> +
> + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> + {
> + if (!pgstat_bktype_io_op_valid(backend_type, io_op) ||
> + !pgstat_io_context_io_op_valid(io_context, io_op))
> + pgstat_io_op_assert_zero(&backend_io_context_ops->data[io_context],
> + io_op);
> + }
> + }
> +
> + if (!read_chunk_s(fpin, &backend_io_context_ops->data))
> + goto error;
> + }

Could we put the validation out of line? That's a lot of io stats specific
code to be in pgstat_read_statsfile().

Done.

> +/*
> + * Helper function to accumulate PgStat_IOOpCounters. If either of the
> + * passed-in PgStat_IOOpCounters are members of PgStatShared_IOContextOps, the
> + * caller is responsible for ensuring that the appropriate lock is held. This
> + * is not asserted because this function could plausibly be used to accumulate
> + * two local/pending PgStat_IOOpCounters.

What's "this" here?

I rephrased it.

> @@ -496,6 +503,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
> */
>
> extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
> +extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
> +extern bool pgstat_flush_io_ops(bool nowait);
> extern const char *pgstat_io_context_desc(IOContext io_context);
> extern const char *pgstat_io_op_desc(IOOp io_op);
>

Is there any call to pgstat_flush_io_ops() from outside pgstat*.c? So possibly
it could be in pgstat_internal.h? Not that it's particularly important...

Moved it.

> @@ -506,6 +515,43 @@ extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op);
> extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op);
> extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
>
> +/*
> + * Functions to assert that invalid IO Operation counters are zero. Used with
> + * the validation functions in pgstat_io_ops.c
> + */
> +static inline void
> +pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
> +{
> + Assert(counters->allocs == 0 && counters->extends == 0 &&
> + counters->fsyncs == 0 && counters->reads == 0 &&
> + counters->writes == 0);
> +}
> +
> +static inline void
> +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
> +{
> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + Assert(counters->allocs == 0);
> + return;
> + case IOOP_EXTEND:
> + Assert(counters->extends == 0);
> + return;
> + case IOOP_FSYNC:
> + Assert(counters->fsyncs == 0);
> + return;
> + case IOOP_READ:
> + Assert(counters->reads == 0);
> + return;
> + case IOOP_WRITE:
> + Assert(counters->writes == 0);
> + return;
> + }
> +
> + elog(ERROR, "unrecognized IOOp value: %d", io_op);

Hm. This means it'll emit code even in non-assertion builds - this should
probably just be an Assert(false) or pg_unreachable().

Fixed.

> Subject: [PATCH v28 5/5] Add system view tracking IO ops per backend type

> View stats are fetched from statistics incremented when a backend
> performs an IO Operation and maintained by the cumulative statistics
> subsystem.

"fetched from statistics incremented"?

Rephrased it.

> Each row of the view is stats for a particular BackendType for a
> particular IOContext (e.g. shared buffer accesses by checkpointer) and
> each column in the view is the total number of IO Operations done (e.g.
> writes).

s/is/shows/?

s/for a particular BackendType for a particular IOContext/for a particularl
BackendType and IOContext/? Somehow the repetition is weird.

Both of the above wordings are now changed.

> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 9440b41770..9949011ba3 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -448,6 +448,15 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
> </entry>
> </row>
>
> + <row>
> + <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
> + <entry>A row for each IO Context for each backend type showing
> + statistics about backend IO operations. See
> + <link linkend="monitoring-pg-stat-io-view">
> + <structname>pg_stat_io</structname></link> for details.
> + </entry>
> + </row>

The "for each for each" thing again :)

Changed it.

> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>io_context</structfield> <type>text</type>
> + </para>
> + <para>
> + IO Context used (e.g. shared buffers, direct).
> + </para></entry>
> + </row>

Wrong list of contexts.

Fixed it.

> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>alloc</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of buffers allocated.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>extend</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks extended.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>fsync</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks fsynced.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>read</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks read.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>write</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks written.
> + </para></entry>
> + </row>

> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
> + </para>
> + <para>
> + Time at which these statistics were last reset.
> </para></entry>
> </row>
> </tbody>

Part of me thinks it'd be nicer if it were "allocated, read, written, extended,
fsynced, stats_reset", instead of alphabetical order. The order already isn't
alphabetical.

I've updated the order in the view and docs.

> + /*
> + * When adding a new column to the pg_stat_io view, add a new enum value
> + * here above IO_NUM_COLUMNS.
> + */
> + enum
> + {
> + IO_COLUMN_BACKEND_TYPE,
> + IO_COLUMN_IO_CONTEXT,
> + IO_COLUMN_ALLOCS,
> + IO_COLUMN_EXTENDS,
> + IO_COLUMN_FSYNCS,
> + IO_COLUMN_READS,
> + IO_COLUMN_WRITES,
> + IO_COLUMN_RESET_TIME,
> + IO_NUM_COLUMNS,
> + };

Given it's local and some of the lines are long, maybe just use COL?

I've shortened COLUMN to COL. However, I've also moved this enum outside
of the function and typedef'd it. I did this because, upon changing the
order of the columns in the view, I could no longer use
IO_COLUMN_IOOP_OFFSET and the IOOp value in the loop at the bottom of
pg_sta_get_io() to set the correct column to NULL. So, I created a
helper function which translates IOOp to io_stat_col.

> +#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1)

Undef'ing it probably worth doing.

It's gone now anyway.

> + SetSingleFuncCall(fcinfo, 0);
> + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> +
> + backends_io_stats = pgstat_fetch_backend_io_context_ops();
> +
> + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> +
> + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
> + {
> + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
> + bool expect_backend_stats = true;
> + PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
> +
> + /*
> + * For those BackendTypes without IO Operation stats, skip
> + * representing them in the view altogether.
> + */
> + if (!pgstat_io_op_stats_collected(bktype))
> + expect_backend_stats = false;

Why not just expect_backend_stats = pgstat_io_op_stats_collected()?

Updated this everywhere it occurred.

> + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + {
> + PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
> + Datum values[IO_NUM_COLUMNS];
> + bool nulls[IO_NUM_COLUMNS];
> +
> + /*
> + * Some combinations of IOCONTEXT and BackendType are not valid
> + * for any type of IO Operation. In such cases, omit the entire
> + * row from the view.
> + */
> + if (!expect_backend_stats ||
> + !pgstat_bktype_io_context_valid(bktype, io_context))
> + {
> + pgstat_io_context_ops_assert_zero(counters);
> + continue;
> + }
> +
> + memset(values, 0, sizeof(values));
> + memset(nulls, 0, sizeof(nulls));

I'd replace the memset with values[...] = {0} etc.

Done.

> + values[IO_COLUMN_BACKEND_TYPE] = bktype_desc;
> + values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum(
> + pgstat_io_context_desc(io_context));

Pgindent, I hate you.

Perhaps put it the context desc in a local var, so it doesn't look quite this
ugly?

Did this.

> +-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
> +-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
> +SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
> +CREATE TABLE test_io_shared(a int);
> +INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
> +SELECT pg_stat_force_next_flush();
> + pg_stat_force_next_flush
> +--------------------------
> +
> +(1 row)
> +
> +-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
> +CHECKPOINT;

Does that work reliably? A checkpoint could have started just before the
CREATE TABLE, I think? Then it'd not have flushed those writes yet. I think
doing two checkpoints would protect against that.

If the first checkpoint starts just before creating the table and those
buffers are dirtied during that checkpoint and thus not written out by
checkpointer during that checkpoint, then the test's (single) explicit
checkpoint would end up picking up those dirty buffers and writing them
out, right?

> +DROP TABLE test_io_shared;
> +DROP TABLESPACE test_io_shared_stats_tblspc;

Tablespace creation is somewhat expensive, do we really need that? There
should be one set up in setup.sql or such.

The only ones I see in regress are for tablespace.sql which drops them
in the same test and is testing dropping tablespaces.

> +-- Test that allocs, extends, reads, and writes of temporary tables are tracked
> +-- in pg_stat_io.
> +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
> +SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +-- Insert enough values that we need to reuse and write out dirty local
> +-- buffers.
> +INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
> +'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';

Could be abbreviated with repeat('a', some-number) :P

Done.

Can the table be smaller than this? That might show up on a slow machine.

Setting temp_buffers to 1MB, 7500 tuples of this width seem like enough.
I inserted 8000 to be safe -- seems like an order of magnitude less
should be good.

> +SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;

Random q: Why are we uppercasing the first letter of the context?

hmm. dunno. I changed it to be lowercase now.

> +CREATE TABLE test_io_strategy(a INT, b INT);
> +ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');

I think you can specify that as part of the CREATE TABLE. Not sure if
otherwise there's not a race where autovac coul start before you do the ALTER.

Done.

> +INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
> +-- Ensure that the next VACUUM will need to perform IO by rewriting the table
> +-- first with VACUUM (FULL).

... because VACUUM FULL currently doesn't set all-visible etc on the pages,
which the subsequent vacuum will then do.

It is true that the second VACUUM will set all-visible while VACUUM FULL
will not. However, I didn't think that that writing was what allowed us
to test strategy reads and allocs. It would theoretically allow us to
test strategy writes, however, in practice, checkpointer or background
writer often wrote out these dirty pages with all-visible set before
this backend had a chance to reuse them and write them out itself.

Unless you are saying that the subsequent VACUUM would be a no-op were
VACUUM FULL to set all-visible on the rewritten pages?

> +-- Hope that the previous value of wal_skip_threshold was the default. We
> +-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
> +-- block.
> +RESET wal_skip_threshold;

Nothing in this file set it before, so that's a pretty sure-to-be-fulfilled
hope.

I've removed the comment.

> +-- Test that, when using a Strategy, if creating a relation, Strategy extends

s/if/when/?

Changed this.

Thanks for the detailed review!

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

27 сентября 2022 г., 21:20:44

v30 attached
rebased and pgstat_io_ops.c builds with meson now
also, I tested with pgstat_report_stat() only flushing when forced and
tests still pass

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Lukas Fittl

Дата:

01 октября 2022 г., 02:17:25

On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman <melanieplageman@gmail.com> wrote:

v30 attached
rebased and pgstat_io_ops.c builds with meson now
also, I tested with pgstat_report_stat() only flushing when forced and
tests still pass

First of all, I'm excited about this patch, and I think it will be a big help to understand better which part of Postgres is producing I/O (and why).

I've paired up with Maciek (CCed) on a review of this patch and had a few comments, focused on the user experience:

The term "strategy" as an "io_context" is hard to understand, as its not a concept an end-user / DBA would be familiar with. Since this comes from BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as "strategy"), maybe we could instead split this out into the individual strategy types? i.e. making "strategy" three different I/O contexts instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum", retaining "shared" to mean NULL / BAS_NORMAL.

Separately, could we also track buffer hits without incurring extra overhead? (not just allocs and reads) -- Whilst we already have shared read and hit counters in a few other places, this would help make the common "What's my cache hit ratio" question more accurate to answer in the presence of different shared buffer access strategies. Tracking hits could also help for local buffers (e.g. to tune temp_buffers based on seeing a low cache hit ratio).

Additionally, some minor notes:

- Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in the past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realistically one would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which all use the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix)

- "alloc" as a name doesn't seem intuitive (and it may be confused with memory allocations) - whilst this is already named this way in pg_stat_bgwriter, it feels like this is an opportunity to eventually deprecate the column there and make this easier to understand - specifically, maybe we can clarify that this means buffer *acquisitions*? (either by renaming the field to "blks_acquired", or clarifying in the documentation)

- Assuming we think this view could realistically cover all I/O produced by Postgres in the future (thus warranting the name "pg_stat_io"), it may be best to have an explicit list of things that are not currently tracked in the documentation, to reduce user confusion (i.e. WAL writes are not tracked, temporary files are not tracked, and some forms of direct writes are not tracked, e.g. when a table moves to a different tablespace)

- In the view documentation, it would be good to explain the different values for "io_strategy" (and what they mean)

- Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the pg_stat_io view description, and explains how the I/O statistics tie into the various concepts of shared buffers / buffer access strategies / etc (and what is not tracked today)

Thanks,

Lukas

Lukas Fittl

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

02 октября 2022 г., 20:24:04

Hi,

On 2022-09-27 14:20:44 -0400, Melanie Plageman wrote:
> v30 attached
> rebased and pgstat_io_ops.c builds with meson now
> also, I tested with pgstat_report_stat() only flushing when forced and
> tests still pass

Unfortunately tests fail in CI / cfbot. E.g.,
https://cirrus-ci.com/task/5816109319323648

https://api.cirrus-ci.com/v1/artifact/task/5816109319323648/testrun/build/testrun/main/regress/regression.diffs
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/stats.out
/tmp/cirrus-ci-build/build/testrun/main/regress/results/stats.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/stats.out    2022-10-01 12:07:47.779183501 +0000
+++ /tmp/cirrus-ci-build/build/testrun/main/regress/results/stats.out    2022-10-01 12:11:38.686433303 +0000
@@ -997,6 +997,8 @@
 -- Set temp_buffers to a low value so that we can trigger writes with fewer
 -- inserted tuples.
 SET temp_buffers TO '1MB';
+ERROR:  invalid value for parameter "temp_buffers": 128
+DETAIL:  "temp_buffers" cannot be changed after any temporary tables have been accessed in the session.
 CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
 SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'local' \gset
 SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
@@ -1037,7 +1039,7 @@
 SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
  ?column? 
 ----------
- t
+ f
 (1 row)
 
 SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;


So the problem is just that something else accesses temp buffers earlier in
the same test.

That's likely because since you sent your email

commit d7e39d72ca1c6f188b400d7d58813ff5b5b79064
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2022-09-29 12:14:39 -0400
 
    Use actual backend IDs in pg_stat_get_backend_idset() and friends.

was applied, which adds a temp table earlier in the same session.


I think the easiest way to make this robust would be to just add a reconnect
before the place you need to set temp_buffers, that way additional temp tables
won't cause a problem.

Setting the patch to waiting-for-author for now.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

06 октября 2022 г., 20:42:09

v31 attached
I've also addressed failing test mentioned by Andres in [1]

On Fri, Sep 30, 2022 at 7:18 PM Lukas Fittl <lukas@fittl.com> wrote:
>
> On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> First of all, I'm excited about this patch, and I think it will be a big help to understand better which part of
Postgresis producing I/O (and why). 
>

Thanks! I'm happy to hear that.

> I've paired up with Maciek (CCed) on a review of this patch and had a few comments, focused on the user experience:
>

Thanks for taking the time to review!

> The term "strategy" as an "io_context" is hard to understand, as its not a concept an end-user / DBA would be
familiarwith. Since this comes from BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as
"strategy"),maybe we could instead split this out into the individual strategy types? i.e. making "strategy" three
differentI/O contexts instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum", retaining "shared" to mean
NULL/ BAS_NORMAL. 

I have split strategy out into "vacuum", "bulkread", and "bulkwrite". I
thought it was less clear with shared as a prefix. If we were to have
BufferAccessStrategies in the future which acquire local buffers (for
example), we could start prefixing the columns to differentiate.

This opened up some new questions about which BufferAccessStrategies
will be employed by which BackendTypes and which IOOps will be valid in
a given BufferAccessStrategy.

I've excluded IOCONTEXT_BULKREAD and IOCONTEXT_BULKWRITE for autovacuum
worker -- though those may not be inherently invalid, they seem not to
be done now and added extra rows to the view.

I've also disallowed IOOP_EXTEND for IOCONTEXT_BULKREAD.

> Separately, could we also track buffer hits without incurring extra overhead? (not just allocs and reads) -- Whilst
wealready have shared read and hit counters in a few other places, this would help make the common "What's my cache hit
ratio"question more accurate to answer in the presence of different shared buffer access strategies. Tracking hits
couldalso help for local buffers (e.g. to tune temp_buffers based on seeing a low cache hit ratio). 

I've started tracking hits and added "hit" to the view.
I added IOOP_HIT and IOOP_ACQUIRE to those IOOps disallowed for
checkpointer and bgwriter.

I have added tests for hit, but I'm not sure I can keep them. It seems
like they might fail if the blocks are evicted between the first and
second time I try to read them.

> Additionally, some minor notes:
>
> - Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in
thepast tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced"
(realisticallyone would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which
alluse the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix) 

I have changed the column names to be in the past tense.

There are no columns equivalent to "dirty" or "misses" from the other
views containing information on buffer hits/block reads/writes/etc. I'm
not sure whether or not those make sense in this context.

Because we want to add non-block-oriented IO in the future (like
temporary file IO) to this view and want to use the same "read",
"written", "extended" columns, I would prefer not to prefix the columns
with "blks_". I have added a column "unit" which would contain the unit
in which read, written, and extended are in. Unfortunately, fsyncs are
not per block, so "unit" doesn't really work for this. I documented
this.

The most correct thing to do to accommodate block-oriented and
non-block-oriented IO would be to specify all the values in bytes.
However, I would like this view to be usable visually (as opposed to
just in scripts and by tools). The only current value of unit is
"block_size" which could potentially be combined with the value of the
GUC to get bytes.

I've hard-coded the string "block_size" into the view generation
function pg_stat_get_io(), so, if this idea makes sense, perhaps I
should do something better there.

> - "alloc" as a name doesn't seem intuitive (and it may be confused with memory allocations) - whilst this is already
namedthis way in pg_stat_bgwriter, it feels like this is an opportunity to eventually deprecate the column there and
makethis easier to understand - specifically, maybe we can clarify that this means buffer *acquisitions*? (either by
renamingthe field to "blks_acquired", or clarifying in the documentation) 

I have renamed it to acquired. It doesn't overlap completely with
buffers_alloc in pg_stat_bgwriter, so I didn't mention that in docs.

> - Assuming we think this view could realistically cover all I/O produced by Postgres in the future (thus warranting
thename "pg_stat_io"), it may be best to have an explicit list of things that are not currently tracked in the
documentation,to reduce user confusion (i.e. WAL writes are not tracked, temporary files are not tracked, and some
formsof direct writes are not tracked, e.g. when a table moves to a different tablespace) 

I have added this to the docs. The list is not exhaustive, so I would
love to get feedback on if there are other specific examples of IO which
is using smgr* directly that users will wonder about and I should call
out.

> - In the view documentation, it would be good to explain the different values for "io_strategy" (and what they mean)

I have added this and would love feedback on my docs additions.

> - Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the
pg_stat_ioview description, and explains how the I/O statistics tie into the various concepts of shared buffers /
bufferaccess strategies / etc (and what is not tracked today) 

I haven't done this yet. How specific were you thinking -- like
interpretations of all the combinations and what to do with what you
see? Like you should run pg_prewarm if you see X? Specific checkpointer
or bgwriter GUCs to change? Or just links to other docs pages on
recommended tunings?

Were you imagining the other IO statistics views (like
pg_statio_all_tables and pg_stat_database) also being included in this
page? Like would it be a comprehensive guide to IO statistics and what
their significance/purposes are?

- Melanie

[1] https://www.postgresql.org/message-id/20221002172404.xyzhftbedh4zpio2%40awork3.anarazel.de

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

07 октября 2022 г., 01:23:53

v31 failed in CI, so
I've attached v32 which has a few issues fixed:
- addressed some compiler warnings I hadn't noticed locally
- autovac launcher and worker do indeed use bulkread strategy if they
  end up starting before critical indexes have loaded and end up doing a
  sequential scan of some catalog tables, so I have changed the
  restrictions on BackendTypes allowed to track IO Operations in
  IOCONTEXT_BULKREAD
- changed the name of the column "fsynced" to "files_synced" to make it
  more clear what unit it is in (and that the unit differs from that of
  the "unit" column)

In an off-list discussion with Andres, he mentioned that he thought
buffers reused by a BufferAccessStrategy should be split from buffers
"acquired" and that "acquired" should be renamed "clocksweeps".

I have started doing this, but for BufferAccessStrategy IO there are a
few choices about how we want to count the clocksweeps:

Currently the following situations are counted under the following
IOContexts and IOOps:

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_ACQUIRE
- reuse a buffer from the ring

IOCONTEXT_SHARED, IOOP_ACQUIRE
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
  the ring are pinned

And in the new paradigm, I think these are two good options:

1)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
  the ring are pinned

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

2)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

IOCONTEXT SHARED, IOOP_CLOCKSWEEP
- add a new shared buffer to the ring when all the existing buffers in
  the ring are pinned

However, if we want to differentiate between buffers initially added to
the ring and buffers taken from shared buffers and added to the ring
because all strategy ring buffers are pinned or have a usage count above
one, then we would need to either do so inside of GetBufferFromRing() or
propagate this distinction out somehow (easy enough if we care to do
it).

There are other combinations that I could come up with a justification
for as well, but I wanted to know what other people thought made sense
(and would make sense to users).

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

10 октября 2022 г., 21:48:49

I've gone ahead and implemented option 1 (commented below).

On Thu, Oct 6, 2022 at 6:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> v31 failed in CI, so
> I've attached v32 which has a few issues fixed:
> - addressed some compiler warnings I hadn't noticed locally
> - autovac launcher and worker do indeed use bulkread strategy if they
>   end up starting before critical indexes have loaded and end up doing a
>   sequential scan of some catalog tables, so I have changed the
>   restrictions on BackendTypes allowed to track IO Operations in
>   IOCONTEXT_BULKREAD
> - changed the name of the column "fsynced" to "files_synced" to make it
>   more clear what unit it is in (and that the unit differs from that of
>   the "unit" column)
>
> In an off-list discussion with Andres, he mentioned that he thought
> buffers reused by a BufferAccessStrategy should be split from buffers
> "acquired" and that "acquired" should be renamed "clocksweeps".
>
> I have started doing this, but for BufferAccessStrategy IO there are a
> few choices about how we want to count the clocksweeps:
>
> Currently the following situations are counted under the following
> IOContexts and IOOps:
>
> IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_ACQUIRE
> - reuse a buffer from the ring
>
> IOCONTEXT_SHARED, IOOP_ACQUIRE
> - add a buffer to the strategy ring initially
> - add a new shared buffer to the ring when all the existing buffers in
>   the ring are pinned
>
> And in the new paradigm, I think these are two good options:
>
> 1)
> IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
> - add a buffer to the strategy ring initially
> - add a new shared buffer to the ring when all the existing buffers in
>   the ring are pinned
>
> IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
> - reuse a buffer from the ring
>

I've implemented this option in attached v33.

> 2)
> IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
> - add a buffer to the strategy ring initially
>
> IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
> - reuse a buffer from the ring
>
> IOCONTEXT SHARED, IOOP_CLOCKSWEEP
> - add a new shared buffer to the ring when all the existing buffers in
>   the ring are pinned


- Melanie

v34 is attached.
I think the column names need discussion. Also, the docs need more work
(I added a lot of new content there). I could use feedback on the column
names and definitions and review/rephrasing ideas for the docs
additions.

On Mon, Oct 17, 2022 at 1:28 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> On Thu, Oct 13, 2022 at 10:29 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I think that it makes sense to count both the initial buffers added to
> > the ring and subsequent shared buffers added to the ring (either when
> > the current strategy buffer is pinned or in use or when a bulkread
> > rejects dirty strategy buffers in favor of new shared buffers) as
> > strategy clocksweeps because of how the statistic would be used.
> >
> > Clocksweeps give you an idea of how much of your working set is cached
> > (setting aside initially reading data into shared buffers when you are
> > warming up the db). You may use clocksweeps to determine if you need to
> > make shared buffers larger.
> >
> > Distinguishing strategy buffer clocksweeps from shared buffer
> > clocksweeps allows us to avoid enlarging shared buffers if most of the
> > clocksweeps are to bring in blocks for the strategy operation.
> >
> > However, I could see an argument that discounting strategy clocksweeps
> > done because the current strategy buffer is pinned makes the number of
> > shared buffer clocksweeps artificially low since those other queries
> > using the buffer would have suffered a cache miss were it not for the
> > strategy. And, in this case, you would take strategy clocksweeps
> > together with shared clocksweeps to make your decision. And if we
> > include buffers initially added to the strategy ring in the strategy
> > clocksweep statistic, this number may be off because those blocks may
> > not be needed in the main shared working set. But you won't know that
> > until you try to reuse the buffer and it is pinned. So, I think we don't
> > have a better option than counting initial buffers added to the ring as
> > strategy clocksweeps (as opposed to as reuses).
> >
> > So, in answer to your question, no, I cannot think of a scenario like
> > that.
>
> That analysis makes sense to me; thanks.

I have made some major changes in this area to make the columns more
useful. I have renamed and split "clocksweeps". It is now "evicted" and
"freelist acquired". This makes it clear when a block must be evicted
from a shared buffer must be and may help to identify misconfiguration
of shared buffers.

There is some nuance here that I tried to make clear in the docs.
"freelist acquired" in a shared context is straightforward.
"freelist acquired" in a strategy context is counted when a shared
buffer is added to the strategy ring (not when it is reused).

"freelist acquired" in the local buffer context is actually the initial
allocation of a local buffer (in contrast with reuse).

"evicted" in the shared IOContext is a block being evicted from a shared
buffer in order to reuse that buffer when not using a strategy.

"evicted" in a strategy IOContext is a block being evicted from
a shared buffer in order to add that shared buffer to the strategy ring.

This is in contrast with "reused" in a strategy IOContext which is when
an existing buffer in the strategy ring has a block evicted in order to
reuse that buffer in a strategy context.

"evicted" in a local IOContext is when an existing local buffer has a
block evicted in order to reuse that local buffer.

"freelist_acquired" is confusing for local buffers but I wanted to
distinguish between reuse/eviction of local buffers and initial
allocation. "freelist_acquired" seemed more fitting because there is a
clocksweep to find a local buffer and if it hasn't been allocated yet it
is allocated in a place similar to where shared buffers acquire a buffer
from the freelist. If I didn't count it here, I would need to make a new
column only for local buffers called "allocated" or something like that.

I chose not to call "evicted" "sb_evicted"
because then we would need a separate "local_evicted". I could instead
make "local_evicted", "sb_evicted", and rename "reused" to
"strat_evicted". If I did that we would end up with separate columns for
every IO Context describing behavior when a buffer is initially acquired
vs when it is reused.

It would look something like this:

shared buffers:
    initial: freelist_acquired
    reused: sb_evicted

local buffers:
    initial: allocated
    reused: local_evicted

strategy buffers:
    initial: sb_evicted | freelist_acquired
    reused: strat_evicted
    replaced: sb_evicted | freelist_acquired

This seems not too bad at first, but if you consider that later we will
add other kinds of IO -- eg WAL IO or temporary file IO, we won't be
able to use these existing columns and will need to add even more
columns describing the exact behavior in those cases.

I wanted to devise a paradigm which allowed for reuse of columns across
IOContexts even if with slightly different meanings.

I have also added the columns "repossessed" and "rejected". "rejected"
is when a bulkread rejects a strategy buffer because it is dirty and
requires flush. Seeing a lot of rejections could indicate you need to
vacuum. "repossessed" is the number of times a strategy buffer was
pinned or in use by another backend and had to be removed from the
strategy ring and replaced with a new shared buffer. This gives you some
indication that there is contention on blocks recently used by a
strategy.

I've also added some descriptions to the docs of how these columns might
be used or what a large value in one of them may mean.

I haven't added tests for repossessed or rejected yet. I can add tests
for repossessed if we decide to keep it. Rejected is hard to write a
test for because we can't guarantee checkpointer won't clean up the
buffer before we can reject it

>
> > It also made me remember that I am incorrectly counting rejected buffers
> > as reused. I'm not sure if it is a good idea to subtract from reuses
> > when a buffer is rejected. Waiting until after it is rejected to count
> > the reuse will take some other code changes. Perhaps we could also count
> > rejections in the stats?
>
> I'm not sure what makes sense here.

I have fixed the counting of rejected and have made a new column
dedicated to rejected.

>
> > > From the io_context column description:
> > >
> > > +       The autovacuum daemon, explicit <command>VACUUM</command>,
> > > explicit
> > > +       <command>ANALYZE</command>, many bulk reads, and many bulk
> > > writes use a
> > > +       fixed amount of memory, acquiring the equivalent number of
> > > shared
> > > +       buffers and reusing them circularly to avoid occupying an
> > > undue portion
> > > +       of the main shared buffer pool.
> > > +      </para></entry>
> > >
> > > I don't understand how this is relevant to the io_context column.
> > > Could you expand on that, or am I just missing something obvious?
> > >
> >
> > I'm trying to explain why those other IO Contexts exist (bulkread,
> > bulkwrite, vacuum) and why they are separate from shared buffers.
> > Should I cut it altogether or preface it with something like: these are
> > counted separate from shared buffers because...?
>
> Oh I see. That makes sense; it just wasn't obvious to me this was
> talking about the last three values of io_context. I think a brief
> preface like that would be helpful (maybe explicitly with "these last
> three values", and I think "counted separately").

I've done this. Thanks for the suggested wording.

>
> > > +     <row>
> > > +      <entry role="catalog_table_entry"><para
> > > role="column_definition">
> > > +       <structfield>extended</structfield> <type>bigint</type>
> > > +      </para>
> > > +      <para>
> > > +       Extends of relations done by this
> > > <varname>backend_type</varname> in
> > > +       order to write data in this <varname>io_context</varname>.
> > > +      </para></entry>
> > > +     </row>
> > >
> > > I understand what this is, but not why this is something I might want
> > > to know about.
> >
> > Unlike writes, backends largely have to do their own extends, so
> > separating this from writes lets us determine whether or not we need to
> > change checkpointer/bgwriter to be more aggressive using the writes
> > without the distraction of the extends. Should I mention this in the
> > docs? The other stats views don't seems to editorialize at all, and I
> > wasn't sure if this was an objective enough point to include in docs.
>
> Thanks for the clarification. Just to make sure I understand, you mean
> that if I see a high extended count, that may be interesting in terms
> of write activity, but I can't fix that by tuning--it's just the
> nature of my workload?

That is correct.

>
> > > That seems broadly reasonable, but pg_settings also has a 'unit'
> > > field, and in that view, unit is '8kB' on my system--i.e., it
> > > (presumably) reflects the block size. Is that something we should try
> > > to be consistent with (not sure if that's a good idea, but thought it
> > > was worth asking)?
> > >
> >
> > I think this idea is a good option. I am wondering if it would be clear
> > when mixed with non-block-oriented IO. Block-oriented IO would say 8kB
> > (or whatever the build-time value of a block was) and non-block-oriented
> > IO would say B or kB. The math would work out.
>
> Right, yeah. Although maybe that's a little confusing? When you
> originally added "unit", you had said:
>
> >The most correct thing to do to accommodate block-oriented and
> >non-block-oriented IO would be to specify all the values in bytes.
> >However, I would like this view to be usable visually (as opposed to
> >just in scripts and by tools). The only current value of unit is
> >"block_size" which could potentially be combined with the value of the
> >GUC to get bytes.
>
> Is this still usable visually if you have to compare values across
> units? I don't really have any great ideas here (and maybe this is
> still the best option), just pointing it out.
>
> > Looking at pg_settings now though, I am confused about
> > how the units for wal_buffers is 8kB but then the value of wal_buffers
> > when I show it in psql is "16MB"...
>
> You mean the difference between
>
> maciek=# select setting, unit from pg_settings where name = 'wal_buffers';
>  setting | unit
> ---------+------
>  512     | 8kB
> (1 row)
>
> and
>
> maciek=# show wal_buffers;
>  wal_buffers
> -------------
>  4MB
> (1 row)
>
> ?
>
> Poking around, I think it looks like that's due to
> convert_int_from_base_unit (indirectly called from SHOW /
> current_setting):
>
> /*
>  * Convert an integer value in some base unit to a human-friendly
> unit.
>  *
>  * The output unit is chosen so that it's the greatest unit that can
> represent
>  * the value without loss.  For example, if the base unit is
> GUC_UNIT_KB, 1024
>  * is converted to 1 MB, but 1025 is represented as 1025 kB.
>  */

I've implemented a change using the same function pg_settings uses to
turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name())
using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse
than "block_size". I am feeling very conflicted about this column.

>
> > Though the units for the pg_stat_io view for block-oriented IO would be
> > the build-time values for block size, so it wouldn't line up exactly
> > with pg_settings.
>
> I don't follow--what would be the discrepancy?

I got confused.
You are right -- pg_settings does seem to use the build-time value of
BLCKSZ to derive this. I was confused because the description of
pg_settings says:

"The view pg_settings provides access to run-time parameters of the server."

- Melanie

v35 is attached

On Mon, Oct 24, 2022 at 2:38 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Oct 20, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote:
> >   I wonder if we should add a "source" output argument to
> >   StrategyGetBuffer(). Then nearly all the counting can happen in
> >   BufferAlloc().
>
> I think we can just check for BM_VALID being set before invalidating it
> in order to claim the buffer at the end of BufferAlloc(). Then we can
> count it as an eviction or reuse.

Done this in attached version

>
> > On 2022-10-19 15:26:51 -0400, Melanie Plageman wrote:
> > > I have made some major changes in this area to make the columns more
> > > useful. I have renamed and split "clocksweeps". It is now "evicted" and
> > > "freelist acquired". This makes it clear when a block must be evicted
> > > from a shared buffer must be and may help to identify misconfiguration
> > > of shared buffers.
> >
> > I'm not sure freelist acquired is really that useful? If we don't add it, we
> > should however definitely not count buffers from the freelist as evictions.
> >
> >
> > > There is some nuance here that I tried to make clear in the docs.
> > > "freelist acquired" in a shared context is straightforward.
> > > "freelist acquired" in a strategy context is counted when a shared
> > > buffer is added to the strategy ring (not when it is reused).
> >
> > Not sure what the second half here means - why would a buffer that's not from
> > the freelist ever be counted as being from the freelist?
> >
> >
> > > "freelist_acquired" is confusing for local buffers but I wanted to
> > > distinguish between reuse/eviction of local buffers and initial
> > > allocation. "freelist_acquired" seemed more fitting because there is a
> > > clocksweep to find a local buffer and if it hasn't been allocated yet it
> > > is allocated in a place similar to where shared buffers acquire a buffer
> > > from the freelist. If I didn't count it here, I would need to make a new
> > > column only for local buffers called "allocated" or something like that.
> >
> > I think you're making this too granular. We need to have more detail than
> > today. But we don't necessarily need to catch every nuance.

I cut freelist_acquired in attached version.

> I am fine with cutting freelist_acquired. The same actionable
> information that it could provide could be provided by "read", right?
> Also, removing it means I can remove the complicated explanation of how
> freelist_acquired should be interpreted in IOCONTEXT_LOCAL.
>
> Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call
> it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What
> if, in the future, we want to track other IO done using data in local
> memory? Also, what if we want to track other IO done using data from
> shared memory that is not in shared buffers? Would IOCONTEXT_SB and
> IOCONTEXT_TEMP be better? Should IOContext literally describe the
> context of the IO being done and there be a separate column which
> indicates the source of the data for the IO?
> Like wal_buffer, local_buffer, shared_buffer? Then if it is not
> block-oriented, it could be shared_mem, local_mem, or bypass?

pg_stat_statements uses local_blks_read and temp_blks_read for local
buffers for temp tables and temp file IO respectively -- so perhaps we
should stick to that

Other updates in this version:

I've also updated the unit column to bytes_conversion.

I've made quite a few updates to the docs including more information
on overlaps between pg_stat_database, pg_statio_*, and
pg_stat_statements.

Let me know if there are other configuration tip resources from the
existing docs that I could link in the column "files_synced".

I still need to look at the docs with fresh eyes and do another round of
cleanup (probably).

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

26 октября 2022 г., 20:54:44

okay, so I realized v35 had an issue where I wasn't counting strategy
evictions correctly. fixed in attached v36. This made me wonder if there
is actually a way to add a test for evictions (in strategy and shared
contexts) that is not flakey.

On Sun, Oct 23, 2022 at 6:48 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> On Thu, Oct 20, 2022 at 10:31 AM Andres Freund <andres@anarazel.de> wrote:
> > - "repossession" is a very unintuitive name for me. If we want something like
> >   it, can't we just name it reuse_failed or such?
>
> +1, I think "repossessed" is awkward. I think "reuse_failed" works,
> but no strong opinions on an alternate name.

Also, re: repossessed, I can change it to reuse_failed but I do think it
is important to give users a way to distinguish between bulkread
rejections of dirty buffers and strategies failing to reuse buffers due
to concurrent pinning (since the reaction to these two scenarios would
likely be different).

If we added another column called something like "claim_failed" which
counts buffers which we failed to reuse because of concurrent pinning or
usage, we could recommend use of this column together with
"reuse_failed" to determine the cause of the failed reuses for a
bulkread. We could also use "claim_failed" in IOContext shared to
provide information on shared buffer contention.

- Melanie

v37 attached

On Sun, Oct 30, 2022 at 9:09 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> On Wed, Oct 26, 2022 at 10:55 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> + The <structname>pg_statio_</structname> and
> + <structname>pg_stat_io</structname> views are primarily useful to determine
> + the effectiveness of the buffer cache. When the number of actual disk reads
>
> Totally nitpicking, but this reads a little funny to me. Previously
> the trailing underscore suggested this is a group, and now with
> pg_stat_io itself added (stupid question: should this be
> "pg_statio"?), it sounds like we're talking about two views:
> pg_stat_io and "pg_statio_". Maybe something like "The pg_stat_io view
> and the pg_statio_ set of views are primarily..."?

I decided not to call it pg_statio because all of the other stats views
have an underscore after stat and I thought it was an opportunity to be
consistent with them.

> + by that backend type in that IO context. Currently only a subset of IO
> + operations are tracked here. WAL IO, IO on temporary files, and some forms
> + of IO outside of shared buffers (such as when building indexes or moving a
> + table from one tablespace to another) could be added in the future.
>
> Again nitpicking, but should this be "may be added"? I think "could"
> suggests the possibility of implementation, whereas "may" feels more
> like a hint as to how the feature could evolve.

I've adopted the wording you suggested.

> + portion of the main shared buffer pool. This pattern is called a
> + <quote>Buffer Access Strategy</quote> in the
> + <productname>PostgreSQL</productname> source code and the fixed-size
> + ring buffer is referred to as a <quote>strategy ring buffer</quote> for
> + the purposes of this view's documentation.
> + </para></entry>
>
> Nice, I think this explanation is very helpful. You also use the term
> "strategy context" and "strategy operation" below. I think it's fairly
> obvious what those mean, but pointing it out in case we want to note
> that here, too.

Thanks! I've added definitions of those as well.

> + <varname>read</varname> and <varname>extended</varname> for
>
> Maybe "plus" instead of "and" here for clarity (I'm assuming that's
> what the "and" means)?

Modified this -- in some cases by adding the lists mentioned below

> + <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
> + <literal>autovacuum worker</literal>, <literal>client backend</literal>,
> + <literal>standalone backend</literal>, <literal>background
> + worker</literal>, and <literal>walsender</literal> for all
> + <varname>io_context</varname>s is similar to the sum of
>
> I'm reviewing the rendered docs now, and I noticed sentences like this
> are a bit hard to scan: they force the reader to parse a big list of
> backend types before even getting to the meat of what this is talking
> about. Should we maybe reword this so that the backend list comes at
> the end of the sentence? Or maybe even use a list (e.g., like in the
> "state" column description in pg_stat_activity)?

Good idea with the bullet points.
For the lengthy lists, I've added bullet point lists to the docs for
several of the columns. It is quite long now but, hopefully, clearer?
Let me know if you think it improves the readability.

> + <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>,
> + <varname>tidx_blks_read</varname>, and
> + <varname>toast_blks_read</varname> in <link
> + linkend="monitoring-pg-statio-all-tables-view">
> + <structname>pg_statio_all_tables</structname></link>. and
> + <varname>blks_read</varname> from <link
>
> I think that's a stray period before the "and."

Fixed!

> + Normal client backends should be able to rely on maintenance processes
> + like the checkpointer and background writer to write out dirty data as
>
> Nice--it's great to see this mentioned. But I think these are
> generally referred to as "auxiliary" not "maintenance" processes, no?

Thanks! Fixed.

> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>bytes_conversion</structfield> <type>bigint</type>
> + </para>
>
> I think this general approach works (instead of unit). I'm not wild
> about the name, but I don't really have a better suggestion. Maybe
> "op_bytes" (since each cell is counting the number of I/O operations)?
> But I think bytes_conversion is okay.

I really like op_bytes and have changed it to this. Thanks for the
suggestion!

> Also, is this (in the middle of the table) the right place for this
> column? I would have expected to see it before or after all the actual
> I/O op cells.

I put it after read, write, and extend columns because it applies to
them. It doesn't apply to files_synced. For reused and evicted, I didn't
think bytes reused and evicted made sense. Also, when we add non-block
oriented IO, reused and evicted won't be used but op_bytes will be. So I
thought it made more sense to place it after the operations it applies
to.

> + <varname>io_context</varname>s. When a <quote>Buffer Access
> + Strategy</quote> reuses a buffer in the strategy ring, it must evict its
> + contents, incrementing <varname>reused</varname>. When a <quote>Buffer
> + Access Strategy</quote> adds a new shared buffer to the strategy ring
> + and this shared buffer is occupied, the <quote>Buffer Access
> + Strategy</quote> must evict the contents of the shared buffer,
> + incrementing <varname>evicted</varname>.
>
> I think the parallel phrasing here makes this a little hard to follow.
> Specifically, I think "must evict its contents" for the strategy case
> sounds like a bad thing, but in fact this is a totally normal thing
> that happens as part of strategy access, no? The idea is you probably
> won't need that buffer again, so it's fine to evict it. I'm not sure
> how to reword, but I think the current phrasing is misleading.

I had trouble rephrasing this. I changed a few words. I see what you
mean. It is worth noting that reusing strategy buffers when there are
buffers on the freelist may not be the best behavior, so I wouldn't
necessarily consider "reused" a good thing. However, I'm not sure how
much the user could really do about this. I would at least like this
phrasing to be clear (evicted is for shared buffers, reused is for
strategy buffers), so, perhaps this section requires more work.

> + The number of times a <literal>bulkread</literal> found the current
> + buffer in the fixed-size strategy ring dirty and requiring flush.
>
> Maybe "...found ... to be dirty..."?

Changed to this wording.

> + frequent vacuuming or more aggressive autovacuum settings, as buffers are
> + dirtied during a bulkread operation when updating the hint bit or when
> + performing on-access pruning.
>
> Are there docs to cross-reference here, especially for pruning? I
> couldn't find much except a few un-explained mentions in the page
> layout docs [2], and most of the search results refer to partition
> pruning. Searching for hint bits at least gives some info in blog
> posts and the wiki.

yes, I don't see anything explaining this either -- below the page
layout it discusses tuple layout but that doesn't mention hint bits.

> + again. A high number of repossessions is a sign of contention for the
> + blocks operated on by the strategy operation.
>
> This (and in general the repossession description) makes sense, but
> I'm not sure what to do with the information. Maybe Andres is right
> that we could skip this in the first version?

I've removed repossessed and rejected in attached v37. I am a bit sad
about this because I don't see a good way forward and I think those
could be useful for users.

I have added the new column Andres recommended in [1] ("io_object") to
clarify temp and local buffers and pave the way for bypass IO (IO not
done through a buffer pool), which can be done on temp or permanent
files for temp or permanent relations, and spill file IO which is done
on temporary files but isn't related to temporary tables.

IOObject has increased the memory footprint and complexity of the code
around tracking and accumulating the statistics, though it has not
increased the number of rows in the view.

One question I still have about this additional dimension is how much
enumeration we need of the various combinations of IO operations, IO
objects, IO ops, and backend types which are allowed and not allowed.
Currently because it is only valid to operate on both IOOBJECT_RELATION
and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
various functions asserting and validating what is "allowed" in terms of
combinations of ops, objects, contexts, and backend types aren't much
different than they were without IO Object. However, once we begin
adding other objects and contexts, we will need to make this logic more
comprehensive. I'm not sure whether or not I should do that
preemptively.

> On Mon, Oct 24, 2022 at 12:39 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > > I don't quite follow this: does this mean that I should expect
> > > 'reused' and 'evicted' to be equal in the 'shared' context, because
> > > they represent the same thing? Or will 'reused' just be null because
> > > it's not distinct from 'evicted'? It looks like it's null right now,
> > > but I find the wording here confusing.
> >
> > You should only see evictions when the strategy evicts shared buffers
> > and reuses when the strategy evicts existing strategy buffers.
> >
> > How about this instead in this docs?
> >
> > the number of times an existing buffer in the strategy ring was reused
> > as part of an operation in the <literal>bulkread</literal>,
> > <literal>bulkwrite</literal>, or <literal>vacuum</literal>
> > <varname>io_context</varname>s. when a buffer access strategy
> > <quote>reuses</quote> a buffer in the strategy ring, it must evict its
> > contents, incrementing <varname>reused</varname>. when a buffer access
> > strategy adds a new shared buffer to the strategy ring and this shared
> > buffer is occupied, the buffer access strategy must evict the contents
> > of the shared buffer, incrementing <varname>evicted</varname>.
>
> It looks like you ended up with different wording in the patch, but
> both this explanation and what's in the patch now make sense to me.
> Thanks for clarifying.

Yes, I tried to rework it and your suggestion and feedback was very
helpful.

> Also, I noticed that the commit message explains missing rows for some
> backend_type / io_context combinations and NULL (versus 0) in some
> cells, but the docs don't really talk about that. Do you think that
> should be in there as well?

Thanks for pointing this out. I have added notes about this to the
relevant columns in the docs.

- Melanie

[1] https://www.postgresql.org/message-id/20221026185808.4qnxowtn35x43u7u%40awork3.anarazel.de

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Maciek Sakrejda

Дата:

07 ноября 2022 г., 21:26:06

On Thu, Nov 3, 2022 at 10:00 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I decided not to call it pg_statio because all of the other stats views
> have an underscore after stat and I thought it was an opportunity to be
> consistent with them.

Oh, got it. Makes sense.

> > I'm reviewing the rendered docs now, and I noticed sentences like this
> > are a bit hard to scan: they force the reader to parse a big list of
> > backend types before even getting to the meat of what this is talking
> > about. Should we maybe reword this so that the backend list comes at
> > the end of the sentence? Or maybe even use a list (e.g., like in the
> > "state" column description in pg_stat_activity)?
>
> Good idea with the bullet points.
> For the lengthy lists, I've added bullet point lists to the docs for
> several of the columns. It is quite long now but, hopefully, clearer?
> Let me know if you think it improves the readability.

Hmm, I should have tried this before suggesting it. I think the lists
break up the flow of the column description too much. What do you
think about the attached (on top of your patches--attaching it as a
.diff to hopefully not confuse cfbot)? I kept the lists for backend
types but inlined the others as a middle ground. I also added a few
omitted periods and reworded "read plus extended" to avoid starting
the sentence with a (lowercase) varname (I think in general it's fine
to do that, but the more complicated sentence structure here makes it
easier to follow if the sentence starts with a capital).

Alternately, what do you think about pulling equivalencies to existing
views out of the main column descriptions, and adding them after the
main table as a sort of footnote? Most view docs don't have anything
like that, but pg_stat_replication does and it might be a good pattern
to follow.

Thoughts?

> > Also, is this (in the middle of the table) the right place for this
> > column? I would have expected to see it before or after all the actual
> > I/O op cells.
>
> I put it after read, write, and extend columns because it applies to
> them. It doesn't apply to files_synced. For reused and evicted, I didn't
> think bytes reused and evicted made sense. Also, when we add non-block
> oriented IO, reused and evicted won't be used but op_bytes will be. So I
> thought it made more sense to place it after the operations it applies
> to.

Got it, makes sense.

> > + <varname>io_context</varname>s. When a <quote>Buffer Access
> > + Strategy</quote> reuses a buffer in the strategy ring, it must evict its
> > + contents, incrementing <varname>reused</varname>. When a <quote>Buffer
> > + Access Strategy</quote> adds a new shared buffer to the strategy ring
> > + and this shared buffer is occupied, the <quote>Buffer Access
> > + Strategy</quote> must evict the contents of the shared buffer,
> > + incrementing <varname>evicted</varname>.
> >
> > I think the parallel phrasing here makes this a little hard to follow.
> > Specifically, I think "must evict its contents" for the strategy case
> > sounds like a bad thing, but in fact this is a totally normal thing
> > that happens as part of strategy access, no? The idea is you probably
> > won't need that buffer again, so it's fine to evict it. I'm not sure
> > how to reword, but I think the current phrasing is misleading.
>
> I had trouble rephrasing this. I changed a few words. I see what you
> mean. It is worth noting that reusing strategy buffers when there are
> buffers on the freelist may not be the best behavior, so I wouldn't
> necessarily consider "reused" a good thing. However, I'm not sure how
> much the user could really do about this. I would at least like this
> phrasing to be clear (evicted is for shared buffers, reused is for
> strategy buffers), so, perhaps this section requires more work.

Oh, I see. I think the updated wording works better. Although I think
we can drop the quotes around "Buffer Access Strategy" here. They're
useful when defining the term originally, but after that I think it's
clearer to use the term unquoted.

Just to understand this better myself, though: can you clarify when
"reused" is not a normal, expected part of the strategy execution? I
was under the impression that a ring buffer is used because each page
is needed only "once" (i.e., for one set of operations) for the
command using the strategy ring buffer. Naively, in that situation, it
seems better to reuse a no-longer-needed buffer than to claim another
buffer from the freelist (where other commands may eventually make
better use of it).

> > + again. A high number of repossessions is a sign of contention for the
> > + blocks operated on by the strategy operation.
> >
> > This (and in general the repossession description) makes sense, but
> > I'm not sure what to do with the information. Maybe Andres is right
> > that we could skip this in the first version?
>
> I've removed repossessed and rejected in attached v37. I am a bit sad
> about this because I don't see a good way forward and I think those
> could be useful for users.

I can see that, but I think as long as we're not doing anything to
preclude adding this in the future, it's better to get something out
there and expand it later. For what it's worth, I don't feel it needs
to be excluded, just that it's not worth getting hung up on.

> I have added the new column Andres recommended in [1] ("io_object") to
> clarify temp and local buffers and pave the way for bypass IO (IO not
> done through a buffer pool), which can be done on temp or permanent
> files for temp or permanent relations, and spill file IO which is done
> on temporary files but isn't related to temporary tables.
>
> IOObject has increased the memory footprint and complexity of the code
> around tracking and accumulating the statistics, though it has not
> increased the number of rows in the view.
>
> One question I still have about this additional dimension is how much
> enumeration we need of the various combinations of IO operations, IO
> objects, IO ops, and backend types which are allowed and not allowed.
> Currently because it is only valid to operate on both IOOBJECT_RELATION
> and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
> various functions asserting and validating what is "allowed" in terms of
> combinations of ops, objects, contexts, and backend types aren't much
> different than they were without IO Object. However, once we begin
> adding other objects and contexts, we will need to make this logic more
> comprehensive. I'm not sure whether or not I should do that
> preemptively.

It's definitely something to consider, but I have no useful input here.

Some more notes on the docs patch:

+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>io_context</structfield> <type>text</type>
+ </para>
+ <para>
+ The context or location of an IO operation.
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <varname>io_context</varname> <literal>buffer pool</literal> refers to
+ IO operations on data in both the shared buffer pool and process-local
+ buffer pools used for temporary relation data.
+ </para>
+ <para>
+ Operations on temporary relations are tracked in
+ <varname>io_context</varname> <literal>buffer pool</literal> and
+ <varname>io_object</varname> <literal>temp relation</literal>.
+ </para>
+ <para>
+ Operations on permanent relations are tracked in
+ <varname>io_context</varname> <literal>buffer pool</literal> and
+ <varname>io_object</varname> <literal>relation</literal>.
+ </para>
+ </listitem>

For this column, you repeat "io_context" in the list describing the
possible values of the column. Enum-style columns in other tables
don't do that (e.g., the pg_stat_activty "state" column). I think it
might read better to omit "io_context" from the list.

+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>io_object</structfield> <type>text</type>
+ </para>
+ <para>
+ Object operated on in a given <varname>io_context</varname> by a given
+ <varname>backend_type</varname>.
+ </para>

Is this a fixed set of objects we should list, like for io_context?

Thanks,
Maciek

Вложения

v37-pg_stat_io-delta.diff

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

21 ноября 2022 г., 03:38:15

Hi,


One good follow up patch will be to rip out the accounting for
pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps
buffers_alloc and replace it with a subselect getting the equivalent data from
pg_stat_io.  It might not be quite worth doing for buffers_alloc because of
the way that's tied into bgwriter pacing.


On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote:
> > + again. A high number of repossessions is a sign of contention for the +
> > blocks operated on by the strategy operation.
> >
> > This (and in general the repossession description) makes sense, but
> > I'm not sure what to do with the information. Maybe Andres is right
> > that we could skip this in the first version?
>
> I've removed repossessed and rejected in attached v37. I am a bit sad
> about this because I don't see a good way forward and I think those
> could be useful for users.

Let's get the basic patch in and then check whether we can find a way to have
something providing at least some more information like repossessed and
rejected. I think it'll be easier to analyze in isolation.


> I have added the new column Andres recommended in [1] ("io_object") to
> clarify temp and local buffers and pave the way for bypass IO (IO not
> done through a buffer pool), which can be done on temp or permanent
> files for temp or permanent relations, and spill file IO which is done
> on temporary files but isn't related to temporary tables.

> IOObject has increased the memory footprint and complexity of the code
> around tracking and accumulating the statistics, though it has not
> increased the number of rows in the view.

It doesn't look too bad from here. Is there a specific portion of the code
where it concerns you the most?


> One question I still have about this additional dimension is how much
> enumeration we need of the various combinations of IO operations, IO
> objects, IO ops, and backend types which are allowed and not allowed.
>
> Currently because it is only valid to operate on both IOOBJECT_RELATION
> and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
> various functions asserting and validating what is "allowed" in terms of
> combinations of ops, objects, contexts, and backend types aren't much
> different than they were without IO Object. However, once we begin
> adding other objects and contexts, we will need to make this logic more
> comprehensive. I'm not sure whether or not I should do that
> preemptively.

I'd not do it preemptively.



> @@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>
>      isExtend = (blockNum == P_NEW);
>
> +    if (isLocalBuf)
> +    {
> +        /*
> +         * Though a strategy object may be passed in, no strategy is employed
> +         * when using local buffers. This could happen when doing, for example,
> +         * CREATE TEMPORRARY TABLE AS ...
> +         */
> +        io_context = IOCONTEXT_BUFFER_POOL;
> +        io_object = IOOBJECT_TEMP_RELATION;
> +    }
> +    else
> +    {
> +        io_context = IOContextForStrategy(strategy);
> +        io_object = IOOBJECT_RELATION;
> +    }

I think given how frequently ReadBuffer_common() is called in some workloads,
it'd be good to make IOContextForStrategy inlinable. But I guess that's not
easily doable, because struct BufferAccessStrategyData is only defined in
freelist.c.

Could we defer this until later, given that we don't currently need this in
case of buffer hits afaict?


> @@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>              BufferAccessStrategy strategy,
>              bool *foundPtr)
>  {
> +    bool        from_ring;
> +    IOContext    io_context;
>      BufferTag    newTag;            /* identity of requested block */
>      uint32        newHash;        /* hash value for newTag */
>      LWLock       *newPartitionLock;    /* buffer partition lock for it */
> @@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>       */
>      LWLockRelease(newPartitionLock);
>
> +    io_context = IOContextForStrategy(strategy);

Hm - doesn't this mean we do IOContextForStrategy() twice? Once in
ReadBuffer_common() and then again here?


>      /* Loop here in case we have to try another victim buffer */
>      for (;;)
>      {
> +
>          /*
>           * Ensure, while the spinlock's not yet held, that there's a free
>           * refcount entry.
> @@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>           * Select a victim buffer.  The buffer is returned with its header
>           * spinlock still held!
>           */
> -        buf = StrategyGetBuffer(strategy, &buf_state);
> +        buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
>
>          Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
>

I think patch 0001 relies on this change already having been made, If I am not misunderstanding?


> @@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>                      }
>                  }
>
> +                /*
> +                 * When a strategy is in use, only flushes of dirty buffers
> +                 * already in the strategy ring are counted as strategy writes
> +                 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
> +                 * purpose of IO operation statistics tracking.
> +                 *
> +                 * If a shared buffer initially added to the ring must be
> +                 * flushed before being used, this is counted as an
> +                 * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
> +                 *
> +                 * If a shared buffer added to the ring later because the

Missing word?


> +                 * current strategy buffer is pinned or in use or because all
> +                 * strategy buffers were dirty and rejected (for BAS_BULKREAD
> +                 * operations only) requires flushing, this is counted as an
> +                 * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false).

I think this makes sense for now, but it'd be good if somebody else could
chime in on this...

> +                 *
> +                 * When a strategy is not in use, the write can only be a
> +                 * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL
> +                 * IOOP_WRITE).
> +                 */
> +
>                  /* OK, do the I/O */
>                  TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
>                                                            smgr->smgr_rlocator.locator.spcOid,
>                                                            smgr->smgr_rlocator.locator.dbOid,
>                                                            smgr->smgr_rlocator.locator.relNumber);
>
> -                FlushBuffer(buf, NULL);
> +                FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
>                  LWLockRelease(BufferDescriptorGetContentLock(buf));
>                  ScheduleBufferTagForWriteback(&BackendWritebackContext,



> +    if (oldFlags & BM_VALID)
> +    {
> +        /*
> +        * When a BufferAccessStrategy is in use, evictions adding a
> +        * shared buffer to the strategy ring are counted in the
> +        * corresponding strategy's context.

Perhaps "adding a shared buffer to the ring are counted in the corresponding
context"? "strategy's context" sounds off to me.


> This includes the evictions
> +        * done to add buffers to the ring initially as well as those
> +        * done to add a new shared buffer to the ring when current
> +        * buffer is pinned or otherwise in use.

I think this sentence could use a few commas, but not sure.

s/current/the current/?



> +        * We wait until this point to count reuses and evictions in order to
> +        * avoid incorrectly counting a buffer as reused or evicted when it was
> +        * released because it was concurrently pinned or in use or counting it
> +        * as reused when it was rejected or when we errored out.
> +        */

I can't quite parse this sentence.


> +        IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
> +
> +        pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context);
> +    }

I'd just inline the variable, but ...


> @@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>                  LocalRefCount[b]++;
>                  ResourceOwnerRememberBuffer(CurrentResourceOwner,
>                                              BufferDescriptorGetBuffer(bufHdr));
> +
>                  break;
>              }
>          }

Spurious change.


>      pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>
>      *foundPtr = false;
> +
>      return bufHdr;
>  }

Dito.



> +/*
> +* IO Operation statistics are not collected for all BackendTypes.
> +*
> +* The following BackendTypes do not participate in the cumulative stats
> +* subsystem or do not do IO operations worth reporting statistics on:

s/worth reporting/we currently report/?


> +    /*
> +     * In core Postgres, only regular backends and WAL Sender processes
> +     * executing queries will use local buffers and operate on temporary
> +     * relations. Parallel workers will not use local buffers (see
> +     * InitLocalBuffers()); however, extensions leveraging background workers
> +     * have no such limitation, so track IO Operations on
> +     * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
> +     */
> +    no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
> +        == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
> +        B_STANDALONE_BACKEND || bktype == B_STARTUP;
> +
> +    if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object ==
> +            IOOBJECT_TEMP_RELATION)
> +        return false;

Personally I don't like line breaks on the == and would rather break earlier
on the && or ||.



> +    for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> +    {
> +        PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
> +        PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
> +
> +        for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
> +        {

Is there any compiler that'd complain if you used IOContext/IOObject/IOOp as the
type in the for loop? I don't think so? Then you'd not need the casts in other
places, which I think would make the code easier to read.


> +            PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
> +            PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
> +
> +            if (!expect_backend_stats ||
> +                !pgstat_bktype_io_context_io_object_valid(MyBackendType,
> +                    (IOContext) io_context, (IOObject) io_object))
> +            {
> +                pgstat_io_context_ops_assert_zero(sharedent);
> +                pgstat_io_context_ops_assert_zero(pendingent);
> +                continue;
> +            }
> +
> +            for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> +            {
> +                if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
> +                                (IOObject) io_object, (IOOp) io_op)))

Superfluous parens after the !, I think?


>  void
>  pgstat_report_vacuum(Oid tableoid, bool shared,
> @@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
>      }
>
>      pgstat_unlock_entry(entry_ref);
> +
> +    /*
> +     * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
> +     * Operation stats, however this will not be called after an entire

Missing "until"?

> +static inline void
> +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
> +{

Does this need to be in pgstat.h? Perhaps pgstat_internal.h would suffice,
afaict it's not used outside of pgstat code?

> +
> +/*
> + * Assert that stats have not been counted for any combination of IOContext,
> + * IOObject, and IOOp which is not valid for the passed-in BackendType. The
> + * passed-in array of PgStat_IOOpCounters must contain stats from the
> + * BackendType specified by the second parameter. Caller is responsible for
> + * locking of the passed-in PgStatShared_IOContextOps, if needed.
> + */
> +static inline void
> +pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
> +        BackendType bktype)
> +{

This doesn't look like it should be an inline function - it's quite long.

I think it's also too complicated for the compiler to optimize out if
assertions are disabled. So you'd need to handle this with an explicit #ifdef
USE_ASSERT_CHECKING.



> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>io_context</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       The context or location of an IO operation.
> +      </para>
> +       <itemizedlist>
> +        <listitem>
> +        <para>
> +        <varname>io_context</varname> <literal>buffer pool</literal> refers to
> +        IO operations on data in both the shared buffer pool and process-local
> +        buffer pools used for temporary relation data.
> +        </para>
> +         <para>

The indentation in the sgml part of the patch seems to be a bit wonky.


> +       <para>
> +       These last three <varname>io_context</varname>s are counted separately
> +       because the autovacuum daemon, explicit <command>VACUUM</command>,
> +       explicit <command>ANALYZE</command>, many bulk reads, and many bulk
> +       writes use a fixed amount of memory, acquiring the equivalent number of

s/memory/buffers/? The amount of memory isn't really fixed.



> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>read</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Reads by this <varname>backend_type</varname> into buffers in this
> +       <varname>io_context</varname>.
> +       <varname>read</varname> plus <varname>extended</varname> for
> +       <varname>backend_type</varname>s
> +
> +       <itemizedlist>
> +
> +       <listitem>
> +        <para>
> +       <literal>autovacuum launcher</literal>
> +        </para>
> +       </listitem>

Hm. ISTM that we should not document the set of valid backend types as part of
this view. Couldn't we share it with pg_stat_activity.backend_type?


> +       The difference is that reads done as part of <command>CREATE
> +       DATABASE</command> are not counted in
> +       <structname>pg_statio_all_tables</structname> and
> +       <structname>pg_stat_database</structname>
> +       </para>

Hm, this seems a bit far into the weeds?




> +Datum
> +pg_stat_get_io(PG_FUNCTION_ARGS)
> +{
> +    PgStat_BackendIOContextOps *backends_io_stats;
> +    ReturnSetInfo *rsinfo;
> +    Datum        reset_time;
> +
> +    InitMaterializedSRF(fcinfo, 0);
> +    rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> +
> +    backends_io_stats = pgstat_fetch_backend_io_context_ops();
> +
> +    reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> +
> +    for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
> +    {
> +        Datum        bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
> +        bool        expect_backend_stats = true;
> +        PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
> +
> +        /*
> +         * For those BackendTypes without IO Operation stats, skip
> +         * representing them in the view altogether.
> +         */
> +        expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
> +                                                            bktype);
> +
> +        for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> +        {
> +            const char *io_context_str = pgstat_io_context_desc(io_context);
> +            PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
> +
> +            for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
> +            {
> +                PgStat_IOOpCounters *counters = &io_objs->data[io_object];
> +                const char *io_obj_str = pgstat_io_object_desc(io_object);
> +
> +                Datum        values[IO_NUM_COLUMNS] = {0};
> +                bool        nulls[IO_NUM_COLUMNS] = {0};
> +
> +                /*
> +                * Some combinations of IOContext, IOObject, and BackendType are
> +                * not valid for any type of IOOp. In such cases, omit the
> +                * entire row from the view.
> +                */
> +                if (!expect_backend_stats ||
> +                    !pgstat_bktype_io_context_io_object_valid((BackendType) bktype,
> +                        (IOContext) io_context, (IOObject) io_object))
> +                {
> +                    pgstat_io_context_ops_assert_zero(counters);
> +                    continue;
> +                }

Perhaps mention in a comment two loops up that we don't skip the nested loops
despite !expect_backend_stats because we want to assert here?



Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Justin Pryzby

Дата:

23 ноября 2022 г., 08:43:29

Note that 001 fails to compile without 002:

../src/backend/storage/buffer/bufmgr.c:1257:43: error: ‘from_ring’ undeclared (first use in this function)
 1257 |       StrategyRejectBuffer(strategy, buf, from_ring))

My "warnings" script informed me about these gripes from MSVC:

[03:42:30.607] c:\cirrus>call sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0' 
[03:42:30.749] c:\cirrus\src\backend\storage\buffer\freelist.c(699) : warning C4715: 'IOContextForStrategy': not all
controlpaths return a value
 
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(190) : warning C4715: 'pgstat_io_context_desc': not
allcontrol paths return a value
 
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(204) : warning C4715: 'pgstat_io_object_desc': not
allcontrol paths return a value
 
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(226) : warning C4715: 'pgstat_io_op_desc': not all
controlpaths return a value
 
[03:42:30.749] c:\cirrus\src\backend\utils\adt\pgstatfuncs.c(1816) : warning C4715: 'pgstat_io_op_get_index': not all
controlpaths return a value
 

In the docs table, you say things like:
| io_context vacuum refers to the IO operations incurred while vacuuming and analyzing.

..but it's a bit unclear (maybe due to the way the docs are rendered).
I think it may be more clear to say "when <io_context> is
<vacuum>, ..."

| acquiring the equivalent number of shared buffers

I don't think "equivelent" fits here, since it's actually acquiring a
different number of buffers.

There's a missing period before " The difference is"

The sentence beginning "read plus extended for backend_types" is difficult to
parse due to having a bulleted list in its middle.

There aren't many references to "IOOps", which is good, because I
started to read it as "I oops".

+        * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+        * Operation stats, however this will not be called after an entire

=> I think that's intended to say *until* after ?

+ * Functions to assert that invalid IO Operation counters are zero.

=> There's a missing newline above this comment.

+       Assert(counters->evictions == 0 && counters->extends == 0 &&
+                       counters->fsyncs == 0 && counters->reads == 0 && counters->reuses
+                       == 0 && counters->writes == 0);

=> It'd be more readable and also maybe help debugging if these were separate
assertions.  I wondered in the past if that should be a general policy
for all assertions.

+pgstat_io_op_stats_collected(BackendType bktype)
+{
+       return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+               bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;

Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return
false, else return true.  But YMMV.

+                * CREATE TEMPORRARY TABLE AS ...

=> typo: temporary

+       if (strategy_io_context  && io_op == IOOP_FSYNC)

=> Extra space.

pgstat_count_io_op() has a superflous newline before "}".

I think there may be a problem/deficiency with hint bits:

|postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT generate_series(1,999999)a; SELECT pg_stat_reset_shared('io');
explain(analyze,buffers) SELECT * FROM u2;
 
|...
| Seq Scan on u2  (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.111..458.239 rows=999999 loops=1)
|   Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345

|postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM pg_buffercache b LEFT JOIN pg_class c ON
pg_relation_filenode(c.oid)=b.relfilenodeGROUP BY 2 ORDER BY 1 DESC LIMIT 11;
 
| count |             relname             | count
|-------+---------------------------------+-------
| 13619 |                                 |     0
|  2080 | u2                              |  2080
|   104 | pg_attribute                    |     4
|    71 | pg_statistic                    |     1
|    51 | pg_class                        |     1

It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are
associated with the new table in pg_buffercache.

|postgres=# SELECT * FROM pg_stat_io WHERE backend_type!~'autovac|archiver|logger|standalone|startup|^wal|background
worker'or true ORDER BY 2;
 
|    backend_type     | io_context  |   io_object   | read | written | extended | op_bytes | evicted | reused |
files_synced|          stats_reset
 
|...
| client backend      | bulkread    | relation      | 2377 |    2345 |          |     8192 |       0 |   2345 |
    | 2022-11-22 22:32:33.044552-06
 

I think it's a known behavior that hint bits do not use the strategy
ring buffer.  For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but
there's 2080 dirty pages in the buffercache (~16MB).

But the IO view says that 2345 of the pages were "reused", which seems
misleading to me.  Maybe that just follows from the behavior and the view is
fine.  If the view is fine, maybe this case should still be specifically
mentioned in the docs.

-- 
Justin

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

26 ноября 2022 г., 01:46:11

Hi,

On 2022-11-22 23:43:29 -0600, Justin Pryzby wrote:
> I think there may be a problem/deficiency with hint bits:
> 
> |postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT generate_series(1,999999)a; SELECT pg_stat_reset_shared('io');
explain(analyze,buffers) SELECT * FROM u2;
 
> |...
> | Seq Scan on u2  (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.111..458.239 rows=999999 loops=1)
> |   Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345
> 
> |postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM pg_buffercache b LEFT JOIN pg_class c ON
pg_relation_filenode(c.oid)=b.relfilenodeGROUP BY 2 ORDER BY 1 DESC LIMIT 11;
 
> | count |             relname             | count
> |-------+---------------------------------+-------
> | 13619 |                                 |     0
> |  2080 | u2                              |  2080
> |   104 | pg_attribute                    |     4
> |    71 | pg_statistic                    |     1
> |    51 | pg_class                        |     1
> 
> It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are
> associated with the new table in pg_buffercache.

Note that there's 2048 dirty buffers for u2 in shared_buffers before the
SELECT, despite the relation being 4425 blocks long, due to the CTAS using
BAS_BULKWRITE.


> |postgres=# SELECT * FROM pg_stat_io WHERE backend_type!~'autovac|archiver|logger|standalone|startup|^wal|background
worker'or true ORDER BY 2;
 
> |    backend_type     | io_context  |   io_object   | read | written | extended | op_bytes | evicted | reused |
files_synced|          stats_reset
 
> |...
> | client backend      | bulkread    | relation      | 2377 |    2345 |          |     8192 |       0 |   2345 |
      | 2022-11-22 22:32:33.044552-06
 
> 
> I think it's a known behavior that hint bits do not use the strategy
> ring buffer.  For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but
> there's 2080 dirty pages in the buffercache (~16MB).

I don't think there's any "circumvention" of the ringbuffer here. There's 2048
buffers for u2 in s_b before, all dirty, there's 2080 after, also all
dirty. So the ringbuffer restricted the increase in shared buffers used for u2
to 2080-2048=32 additional buffers.

The reason hint bits don't prevent pages from being written out here is that a
BAS_BULKREAD strategy doesn't cause all buffer writes to be rejected, it just
causes buffer writes to be rejected when the page LSN would require a WAL
flush. And that's not typically the case when you just set a hint bit, unless
you use wal_log_hint_bits = true.

If I turn on wal_log_hints=true and add a CHECKPOINT after the CTAS I see 0
reuses (and 4425 dirty buffers), which is what I'd expect.


> But the IO view says that 2345 of the pages were "reused", which seems
> misleading to me.  Maybe that just follows from the behavior and the view is
> fine.  If the view is fine, maybe this case should still be specifically
> mentioned in the docs.

I think that's just confusing due to the reset. 2048 + 2345 = 4393, but we
only have 2080 buffers for u2 in s_b.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

29 ноября 2022 г., 05:05:33

v38 attached.

On Sun, Nov 20, 2022 at 7:38 PM Andres Freund <andres@anarazel.de> wrote:
> One good follow up patch will be to rip out the accounting for
> pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps
> buffers_alloc and replace it with a subselect getting the equivalent data from
> pg_stat_io.  It might not be quite worth doing for buffers_alloc because of
> the way that's tied into bgwriter pacing.

I don't see how it will make sense to have buffers_backend and
buffers_backend_fsync respond to a different reset target than the rest
of the fields in pg_stat_bgwriter.

> On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote:
> > @@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >
> >       isExtend = (blockNum == P_NEW);
> >
> > +     if (isLocalBuf)
> > +     {
> > +             /*
> > +              * Though a strategy object may be passed in, no strategy is employed
> > +              * when using local buffers. This could happen when doing, for example,
> > +              * CREATE TEMPORRARY TABLE AS ...
> > +              */
> > +             io_context = IOCONTEXT_BUFFER_POOL;
> > +             io_object = IOOBJECT_TEMP_RELATION;
> > +     }
> > +     else
> > +     {
> > +             io_context = IOContextForStrategy(strategy);
> > +             io_object = IOOBJECT_RELATION;
> > +     }
>
> I think given how frequently ReadBuffer_common() is called in some workloads,
> it'd be good to make IOContextForStrategy inlinable. But I guess that's not
> easily doable, because struct BufferAccessStrategyData is only defined in
> freelist.c.

Correct

> Could we defer this until later, given that we don't currently need this in
> case of buffer hits afaict?

Yes, you are right. In ReadBuffer_common(), we can easily move the
IOContextForStrategy() call to directly before using io_context. I've
done that in the attached version.

> > @@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >                       BufferAccessStrategy strategy,
> >                       bool *foundPtr)
> >  {
> > +     bool            from_ring;
> > +     IOContext       io_context;
> >       BufferTag       newTag;                 /* identity of requested block */
> >       uint32          newHash;                /* hash value for newTag */
> >       LWLock     *newPartitionLock;   /* buffer partition lock for it */
> > @@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >        */
> >       LWLockRelease(newPartitionLock);
> >
> > +     io_context = IOContextForStrategy(strategy);
>
> Hm - doesn't this mean we do IOContextForStrategy() twice? Once in
> ReadBuffer_common() and then again here?

Yes. So, there are a few options for addressing this.

- if the goal is to call IOStrategyForContext() exactly once in a
  given codepath, BufferAlloc() can set IOContext
  (passed by reference as an output parameter). I don't like this much
  because it doesn't make sense to me that BufferAlloc() would set the
  "io_context" parameter -- especially given that strategy is already
  passed as a parameter and is obviously available to the caller.
  I also don't see a good way of waiting until BufferAlloc() returns to count
  the IO operations counted in FlushBuffer() and BufferAlloc() itself.

- if the goal is to avoid calling IOStrategyForContext() in more common
  codepaths or to call it as close to its use as possible, then we can
  push down its call in BufferAlloc() to the two locations where it is
  used -- when a dirty buffer must be flushed and when a block was
  evicted or reused. This will avoid calling it when we are not evicting
  a block from a valid buffer.

  However, if we do that, I don't know how to avoid calling it twice in
  that codepath. Even though we can assume io_context was set in the
  first location by the time we get to the second location, we would
  need to initialize the variable with something if we only plan to set
  it in some branches and there is no "invalid" or "default" value of
  the IOContext enum.

  Given the above, I've left the call in BufferAlloc() as is in the
  attached version.

>
>
> >       /* Loop here in case we have to try another victim buffer */
> >       for (;;)
> >       {
> > +
> >               /*
> >                * Ensure, while the spinlock's not yet held, that there's a free
> >                * refcount entry.
> > @@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >                * Select a victim buffer.  The buffer is returned with its header
> >                * spinlock still held!
> >                */
> > -             buf = StrategyGetBuffer(strategy, &buf_state);
> > +             buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
> >
> >               Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
> >
>
> I think patch 0001 relies on this change already having been made, If I am not misunderstanding?

Fixed.

>
>
> > @@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >                                       }
> >                               }
> >
> > +                             /*
> > +                              * When a strategy is in use, only flushes of dirty buffers
> > +                              * already in the strategy ring are counted as strategy writes
> > +                              * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
> > +                              * purpose of IO operation statistics tracking.
> > +                              *
> > +                              * If a shared buffer initially added to the ring must be
> > +                              * flushed before being used, this is counted as an
> > +                              * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
> > +                              *
> > +                              * If a shared buffer added to the ring later because the
>
> Missing word?

Fixed.

>
>
> > +                              * current strategy buffer is pinned or in use or because all
> > +                              * strategy buffers were dirty and rejected (for BAS_BULKREAD
> > +                              * operations only) requires flushing, this is counted as an
> > +                              * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false).
>
> I think this makes sense for now, but it'd be good if somebody else could
> chime in on this...
>
> > +                              *
> > +                              * When a strategy is not in use, the write can only be a
> > +                              * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL
> > +                              * IOOP_WRITE).
> > +                              */
> > +
> >                               /* OK, do the I/O */
> >                               TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
> >
smgr->smgr_rlocator.locator.spcOid,
> >
smgr->smgr_rlocator.locator.dbOid,
> >
smgr->smgr_rlocator.locator.relNumber);
> >
> > -                             FlushBuffer(buf, NULL);
> > +                             FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
> >                               LWLockRelease(BufferDescriptorGetContentLock(buf));
> >                               ScheduleBufferTagForWriteback(&BackendWritebackContext,
>
>
>
> > +     if (oldFlags & BM_VALID)
> > +     {
> > +             /*
> > +             * When a BufferAccessStrategy is in use, evictions adding a
> > +             * shared buffer to the strategy ring are counted in the
> > +             * corresponding strategy's context.
>
> Perhaps "adding a shared buffer to the ring are counted in the corresponding
> context"? "strategy's context" sounds off to me.

Fixed.

> > This includes the evictions
> > +             * done to add buffers to the ring initially as well as those
> > +             * done to add a new shared buffer to the ring when current
> > +             * buffer is pinned or otherwise in use.
>
> I think this sentence could use a few commas, but not sure.
>
> s/current/the current/?

Reworded.

>
> > +             * We wait until this point to count reuses and evictions in order to
> > +             * avoid incorrectly counting a buffer as reused or evicted when it was
> > +             * released because it was concurrently pinned or in use or counting it
> > +             * as reused when it was rejected or when we errored out.
> > +             */
>
> I can't quite parse this sentence.

I've reworded the whole comment.
I think it is clearer now.

>
> > +             IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
> > +
> > +             pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context);
> > +     }
>
> I'd just inline the variable, but ...

Done.

> > @@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> >                               LocalRefCount[b]++;
> >                               ResourceOwnerRememberBuffer(CurrentResourceOwner,
> >
BufferDescriptorGetBuffer(bufHdr));
> > +
> >                               break;
> >                       }
> >               }
>
> Spurious change.

Removed.

> >       pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
> >
> >       *foundPtr = false;
> > +
> >       return bufHdr;
> >  }
>
> Dito.

Removed.

> > +/*
> > +* IO Operation statistics are not collected for all BackendTypes.
> > +*
> > +* The following BackendTypes do not participate in the cumulative stats
> > +* subsystem or do not do IO operations worth reporting statistics on:
>
> s/worth reporting/we currently report/?

Updated

> > +     /*
> > +      * In core Postgres, only regular backends and WAL Sender processes
> > +      * executing queries will use local buffers and operate on temporary
> > +      * relations. Parallel workers will not use local buffers (see
> > +      * InitLocalBuffers()); however, extensions leveraging background workers
> > +      * have no such limitation, so track IO Operations on
> > +      * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
> > +      */
> > +     no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
> > +             == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
> > +             B_STANDALONE_BACKEND || bktype == B_STARTUP;
> > +
> > +     if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object ==
> > +                     IOOBJECT_TEMP_RELATION)
> > +             return false;
>
> Personally I don't like line breaks on the == and would rather break earlier
> on the && or ||.

I've gone through and fixed all of these that I could find.

> > +     for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> > +     {
> > +             PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
> > +             PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
> > +
> > +             for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
> > +             {
>
> Is there any compiler that'd complain if you used IOContext/IOObject/IOOp as the
> type in the for loop? I don't think so? Then you'd not need the casts in other
> places, which I think would make the code easier to read.

I changed the type and currently get no compiler warnings, however, on
a previous CI run,
with the type changed to an enum I got the following warning:

/tmp/cirrus-ci-build/src/include/utils/pgstat_internal.h:605:48:
error: no ‘operator++(int)’ declared for postfix ‘++’ [-fpermissive]
 605 |    io_context < IOCONTEXT_NUM_TYPES; io_context++)

I'm not sure why I am no longer getting it.

> > +                     PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
> > +                     PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
> > +
> > +                     if (!expect_backend_stats ||
> > +                             !pgstat_bktype_io_context_io_object_valid(MyBackendType,
> > +                                     (IOContext) io_context, (IOObject) io_object))
> > +                     {
> > +                             pgstat_io_context_ops_assert_zero(sharedent);
> > +                             pgstat_io_context_ops_assert_zero(pendingent);
> > +                             continue;
> > +                     }
> > +
> > +                     for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> > +                     {
> > +                             if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
> > +                                                             (IOObject) io_object, (IOOp) io_op)))
>
> Superfluous parens after the !, I think?

Thanks! I've looked for other occurrences as well and fixed them.

> >  void
> >  pgstat_report_vacuum(Oid tableoid, bool shared,
> > @@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
> >       }
> >
> >       pgstat_unlock_entry(entry_ref);
> > +
> > +     /*
> > +      * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
> > +      * Operation stats, however this will not be called after an entire
>
> Missing "until"?

Fixed.

> > +static inline void
> > +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
> > +{
>
> Does this need to be in pgstat.h? Perhaps pgstat_internal.h would suffice,
> afaict it's not used outside of pgstat code?

It is used in pgstatfuncs.c during the view creation.

> > +
> > +/*
> > + * Assert that stats have not been counted for any combination of IOContext,
> > + * IOObject, and IOOp which is not valid for the passed-in BackendType. The
> > + * passed-in array of PgStat_IOOpCounters must contain stats from the
> > + * BackendType specified by the second parameter. Caller is responsible for
> > + * locking of the passed-in PgStatShared_IOContextOps, if needed.
> > + */
> > +static inline void
> > +pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
> > +             BackendType bktype)
> > +{
>
> This doesn't look like it should be an inline function - it's quite long.
>
> I think it's also too complicated for the compiler to optimize out if
> assertions are disabled. So you'd need to handle this with an explicit #ifdef
> USE_ASSERT_CHECKING.

I've made it a static helper function in pgstat.c.

>
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>io_context</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       The context or location of an IO operation.
> > +      </para>
> > +       <itemizedlist>
> > +        <listitem>
> > +        <para>
> > +        <varname>io_context</varname> <literal>buffer pool</literal> refers to
> > +        IO operations on data in both the shared buffer pool and process-local
> > +        buffer pools used for temporary relation data.
> > +        </para>
> > +         <para>
>
> The indentation in the sgml part of the patch seems to be a bit wonky.

I'll address this and the other docs feedback in a separate patchset and email.

> > +Datum
> > +pg_stat_get_io(PG_FUNCTION_ARGS)
> > +{
> > +     PgStat_BackendIOContextOps *backends_io_stats;
> > +     ReturnSetInfo *rsinfo;
> > +     Datum           reset_time;
> > +
> > +     InitMaterializedSRF(fcinfo, 0);
> > +     rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> > +
> > +     backends_io_stats = pgstat_fetch_backend_io_context_ops();
> > +
> > +     reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> > +
> > +     for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
> > +     {
> > +             Datum           bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
> > +             bool            expect_backend_stats = true;
> > +             PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
> > +
> > +             /*
> > +              * For those BackendTypes without IO Operation stats, skip
> > +              * representing them in the view altogether.
> > +              */
> > +             expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
> > +
 bktype); 
> > +
> > +             for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> > +             {
> > +                     const char *io_context_str = pgstat_io_context_desc(io_context);
> > +                     PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
> > +
> > +                     for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
> > +                     {
> > +                             PgStat_IOOpCounters *counters = &io_objs->data[io_object];
> > +                             const char *io_obj_str = pgstat_io_object_desc(io_object);
> > +
> > +                             Datum           values[IO_NUM_COLUMNS] = {0};
> > +                             bool            nulls[IO_NUM_COLUMNS] = {0};
> > +
> > +                             /*
> > +                             * Some combinations of IOContext, IOObject, and BackendType are
> > +                             * not valid for any type of IOOp. In such cases, omit the
> > +                             * entire row from the view.
> > +                             */
> > +                             if (!expect_backend_stats ||
> > +                                     !pgstat_bktype_io_context_io_object_valid((BackendType) bktype,
> > +                                             (IOContext) io_context, (IOObject) io_object))
> > +                             {
> > +                                     pgstat_io_context_ops_assert_zero(counters);
> > +                                     continue;
> > +                             }
>
> Perhaps mention in a comment two loops up that we don't skip the nested loops
> despite !expect_backend_stats because we want to assert here?

Done.

I've also removed the test for bulkread reads from regress because
CREATE DATABASE is expensive and added it to the verify_heapam test
since it is one of the only users of a BULKREAD strategy which
unconditionally uses a BULKREAD strategy.

Thanks,
Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

29 ноября 2022 г., 05:08:36

On Wed, Nov 23, 2022 at 12:43 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> Note that 001 fails to compile without 002:
>
> ../src/backend/storage/buffer/bufmgr.c:1257:43: error: ‘from_ring’ undeclared (first use in this function)
>  1257 |       StrategyRejectBuffer(strategy, buf, from_ring))

Thanks!
I fixed this in version 38 attached in response to Andres upthread [1].

> My "warnings" script informed me about these gripes from MSVC:
>
> [03:42:30.607] c:\cirrus>call sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0'
> [03:42:30.749] c:\cirrus\src\backend\storage\buffer\freelist.c(699) : warning C4715: 'IOContextForStrategy': not all
controlpaths return a value 
> [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(190) : warning C4715: 'pgstat_io_context_desc':
notall control paths return a value 
> [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(204) : warning C4715: 'pgstat_io_object_desc':
notall control paths return a value 
> [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(226) : warning C4715: 'pgstat_io_op_desc': not
allcontrol paths return a value 
> [03:42:30.749] c:\cirrus\src\backend\utils\adt\pgstatfuncs.c(1816) : warning C4715: 'pgstat_io_op_get_index': not all
controlpaths return a value 

Thanks, I forgot to look at those warnings in CI.
I added pg_unreachable() and think it silenced the warnings.

> In the docs table, you say things like:
> | io_context vacuum refers to the IO operations incurred while vacuuming and analyzing.
>
> ..but it's a bit unclear (maybe due to the way the docs are rendered).
> I think it may be more clear to say "when <io_context> is
> <vacuum>, ..."

So, because I use this language [column name] [column value] so often in
the docs, I would prefer a pattern that is as concise as possible. I
agree it may be hard to see due to the rendering. Currently, I am using
<varname> tags for the column name and <literal> tags for the column
value. Is there another tag type I could use to perhaps make this more
clear without adding additional words?

This is what the code looks like for the above docs text:
<varname>io_context</varname> <literal>vacuum</literal> refers to the IO

> | acquiring the equivalent number of shared buffers
>
> I don't think "equivelent" fits here, since it's actually acquiring a
> different number of buffers.

I'm planning to do docs changes in a separate patchset after addressing
code feedback. I plan to change "equivalent" to "corresponding" here.

> There's a missing period before " The difference is"
>
> The sentence beginning "read plus extended for backend_types" is difficult to
> parse due to having a bulleted list in its middle.

Will address in future version.

> There aren't many references to "IOOps", which is good, because I
> started to read it as "I oops".

Grep'ing for this in the code, I only use the word IOOp(s) in the code
when I very clearly want to use the type name -- and never in the docs.
But, yes, it does look like "I oops" :)

>
> +        * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
> +        * Operation stats, however this will not be called after an entire
>
> => I think that's intended to say *until* after ?

Fixed in v38.

> + * Functions to assert that invalid IO Operation counters are zero.
>
> => There's a missing newline above this comment.

Fixed in v38.

> +       Assert(counters->evictions == 0 && counters->extends == 0 &&
> +                       counters->fsyncs == 0 && counters->reads == 0 && counters->reuses
> +                       == 0 && counters->writes == 0);
>
> => It'd be more readable and also maybe help debugging if these were separate
> assertions.

I have made this change.

> +pgstat_io_op_stats_collected(BackendType bktype)
> +{
> +       return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
> +               bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
>
> Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return
> false, else return true.  But YMMV.

I don't know that separating it into multiple if statements or a switch
would make it more clear to me or help me with debugging here.

Separately, since this is used in non-assert builds, I would like to
ensure it is efficient. Do you know if a switch or if statements will
be compiled to the exact same thing as this at useful optimization
levels?

>
> +                * CREATE TEMPORRARY TABLE AS ...
>
> => typo: temporary

Fixed in v38.

>
> +       if (strategy_io_context  && io_op == IOOP_FSYNC)
>
> => Extra space.

Fixed.

>
> pgstat_count_io_op() has a superflous newline before "}".

I couldn't find the one you are referencing.
Do you mind pasting in the code?

Thanks,
Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Zvaj_yFA_eiSRrLZsjhT0J8cJ044QhZfKuXq6WN5bu5g%40mail.gmail.com

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

30 ноября 2022 г., 04:12:47

Thanks for the review, Maciek!

I've attached a new version 39 of the patch which addresses your docs
feedback from this email as well as docs feedback from Andres in [1] and
Justin in [2].

I've made some additional code changes addressing a few of their other
points as well, and I've moved the verify_heapam test to a plain sql
test in contrib/amcheck instead of putting it in the perl test.

This patchset also includes various cleanup, pgindenting, and addressing
the sgml indentation issue brought up in the thread.

On Mon, Nov 7, 2022 at 1:26 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> On Thu, Nov 3, 2022 at 10:00 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > > I'm reviewing the rendered docs now, and I noticed sentences like this
> > > are a bit hard to scan: they force the reader to parse a big list of
> > > backend types before even getting to the meat of what this is talking
> > > about. Should we maybe reword this so that the backend list comes at
> > > the end of the sentence? Or maybe even use a list (e.g., like in the
> > > "state" column description in pg_stat_activity)?
> >
> > Good idea with the bullet points.
> > For the lengthy lists, I've added bullet point lists to the docs for
> > several of the columns. It is quite long now but, hopefully, clearer?
> > Let me know if you think it improves the readability.
>
> Hmm, I should have tried this before suggesting it. I think the lists
> break up the flow of the column description too much. What do you
> think about the attached (on top of your patches--attaching it as a
> .diff to hopefully not confuse cfbot)? I kept the lists for backend
> types but inlined the others as a middle ground. I also added a few
> omitted periods and reworded "read plus extended" to avoid starting
> the sentence with a (lowercase) varname (I think in general it's fine
> to do that, but the more complicated sentence structure here makes it
> easier to follow if the sentence starts with a capital).
>
> Alternately, what do you think about pulling equivalencies to existing
> views out of the main column descriptions, and adding them after the
> main table as a sort of footnote? Most view docs don't have anything
> like that, but pg_stat_replication does and it might be a good pattern
> to follow.
>
> Thoughts?

Thanks for including a patch!
In the attached v39, I've taken your suggestion of flattening some of
the lists and done some rewording as well. I have also moved the note
about equivalence with pg_stat_statements columns to the
pg_stat_statements documentation. The result is quite a bit different
than what I had before, so I would be interested to hear your thoughts.

My concern with the blue "note" section like you mentioned is that it
would be harder to read the lists of backend types than it was in the
tabular format.

> > > + <varname>io_context</varname>s. When a <quote>Buffer Access
> > > + Strategy</quote> reuses a buffer in the strategy ring, it must evict its
> > > + contents, incrementing <varname>reused</varname>. When a <quote>Buffer
> > > + Access Strategy</quote> adds a new shared buffer to the strategy ring
> > > + and this shared buffer is occupied, the <quote>Buffer Access
> > > + Strategy</quote> must evict the contents of the shared buffer,
> > > + incrementing <varname>evicted</varname>.
> > >
> > > I think the parallel phrasing here makes this a little hard to follow.
> > > Specifically, I think "must evict its contents" for the strategy case
> > > sounds like a bad thing, but in fact this is a totally normal thing
> > > that happens as part of strategy access, no? The idea is you probably
> > > won't need that buffer again, so it's fine to evict it. I'm not sure
> > > how to reword, but I think the current phrasing is misleading.
> >
> > I had trouble rephrasing this. I changed a few words. I see what you
> > mean. It is worth noting that reusing strategy buffers when there are
> > buffers on the freelist may not be the best behavior, so I wouldn't
> > necessarily consider "reused" a good thing. However, I'm not sure how
> > much the user could really do about this. I would at least like this
> > phrasing to be clear (evicted is for shared buffers, reused is for
> > strategy buffers), so, perhaps this section requires more work.
>
> Oh, I see. I think the updated wording works better. Although I think
> we can drop the quotes around "Buffer Access Strategy" here. They're
> useful when defining the term originally, but after that I think it's
> clearer to use the term unquoted.

Thanks! I've fixed this.

> Just to understand this better myself, though: can you clarify when
> "reused" is not a normal, expected part of the strategy execution? I
> was under the impression that a ring buffer is used because each page
> is needed only "once" (i.e., for one set of operations) for the
> command using the strategy ring buffer. Naively, in that situation, it
> seems better to reuse a no-longer-needed buffer than to claim another
> buffer from the freelist (where other commands may eventually make
> better use of it).

You are right: reused is a normal, expected part of strategy
execution. And you are correct: the idea behind reusing existing
strategy buffers instead of taking buffers off the freelist is to leave
those buffers for blocks that we might expect to be accessed more than
once.

In practice, however, if you happen to not be using many shared buffers,
and then do a large COPY, for example, you will end up doing a bunch of
writes (in order to reuse the strategy buffers) that you perhaps didn't
need to do at that time had you leveraged the freelist. I think the
decision about which tradeoff to make is quite contentious, though.

> Some more notes on the docs patch:
>
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>io_context</structfield> <type>text</type>
> + </para>
> + <para>
> + The context or location of an IO operation.
> + </para>
> + <itemizedlist>
> + <listitem>
> + <para>
> + <varname>io_context</varname> <literal>buffer pool</literal> refers to
> + IO operations on data in both the shared buffer pool and process-local
> + buffer pools used for temporary relation data.
> + </para>
> + <para>
> + Operations on temporary relations are tracked in
> + <varname>io_context</varname> <literal>buffer pool</literal> and
> + <varname>io_object</varname> <literal>temp relation</literal>.
> + </para>
> + <para>
> + Operations on permanent relations are tracked in
> + <varname>io_context</varname> <literal>buffer pool</literal> and
> + <varname>io_object</varname> <literal>relation</literal>.
> + </para>
> + </listitem>
>
> For this column, you repeat "io_context" in the list describing the
> possible values of the column. Enum-style columns in other tables
> don't do that (e.g., the pg_stat_activty "state" column). I think it
> might read better to omit "io_context" from the list.

I changed this.

> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>io_object</structfield> <type>text</type>
> + </para>
> + <para>
> + Object operated on in a given <varname>io_context</varname> by a given
> + <varname>backend_type</varname>.
> + </para>
>
> Is this a fixed set of objects we should list, like for io_context?

I've added this.

- Melanie

[1] https://www.postgresql.org/message-id/20221121003815.qnwlnz2lhkow2e5w%40awork3.anarazel.de
[2] https://www.postgresql.org/message-id/20221123054329.GG11463%40telsasoft.com

Attached is v40.

I have addressed the feedback from Justin [1] and Maciek [2] as well.
I took all of the suggestions regarding the docs that Maciek made,
including the following:

>  +       default. Future values could include those derived from
>  +       <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
>  +       constant multipliers once non-block-oriented IO (e.g. temporary file IO)
>  +       is tracked here.
>
>
>  I know Lukas had commented that we should communicate that the goal is
>  to eventually provide relatively comprehensive I/O stats in this view
>  (you do that in the view description and I think that works), and this
>  is sort of along those lines, but I think speculative documentation
>  like this is not all that helpful. I'd drop this last sentence. Just
>  my two cents.

I have removed this and added the relevant part of this as a comment to
the view generating function pg_stat_get_io().

On Mon, Dec 5, 2022 at 2:32 PM Andres Freund <andres@anarazel.de> wrote:
> - I think it might be worth to rename IOCONTEXT_BUFFER_POOL to
>   IOCONTEXT_{NORMAL, PLAIN, DEFAULT}. I'd like at some point to track WAL IO ,
>   temporary file IO etc, and it doesn't seem useful to define a version of
>   BUFFER_POOL for each of them. And it'd make it less confusing, because all
>   the other existing contexts are also in the buffer pool (for now, can't wait
>   for "bypass" or whatever to be tracked as well).

In attached v40, I've renamed IOCONTEXT_BUFFER_POOL to IOCONTEXT_NORMAL.

> - given that IOContextForStrategy() is defined in freelist.c, and that
>   declaring it in pgstat.h requires including buf.h, I think it's probably
>   better to move IOContextForStrategy()'s declaration to freelist.h (doesn't
>   exist, but whatever the right one is)

I have moved it to buf_internals.h.

> - pgstat_backend_io_stats_assert_well_formed() doesn't seem to belong in
>   pgstat.c. Why not pgstat_io_ops.c?

I put it in pgstat.c because it is only used there -- so I made it
static. I've moved it to pg_stat_io_ops.c and declared it in
pgstat_internal.h

> - Do pgstat_io_context_ops_assert_zero(), pgstat_io_op_assert_zero() have to
>   be in pgstat.h?

They are used in pgstatfuncs.c, which I presume should not include
pgstat_internal.h. Or did you mean that I should not put them in a
header file at all?

- Melanie

[1] https://www.postgresql.org/message-id/20221130025113.GD24131%40telsasoft.com
[2] https://www.postgresql.org/message-id/CAOtHd0BfFdMqO7-zDOk%3DiJTatzSDgVcgYcaR1_wk0GS4NN%2BRUQ%40mail.gmail.com

On Wed, Dec 28, 2022 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:
>
> FWIW, I've been hacking on this code a bunch, mostly around renaming things
> and changing the 'stacking' of the patches. My current state is at
> https://github.com/anarazel/postgres/tree/pg_stat_io
> A bit more to do before posting the edited version...

Here is the bit more done.
I've attached a new version 42 which incorporates all of Andres' changes
on his branch (which I am considering version 41).
I have fixed various issues with counting fsyncs and added more comments
and done cosmetic cleanup.

The docs have substantial changes but still require more work:

- The comparisons between columns in pg_stat_io and pg_stat_statements
  have been removed, since the granularity and lifetime are so
  different, comparing them isn't quite correct.

- The lists of backend types still take up a lot of visual space in the
  definitions, which doesn't look great. I'm not sure what to do about
  that.

- Andres has pointed out that it is difficult to read the definitions of
  the columns because of the added clutter of the interpretations and
  the comparisons to other stats views. I'm not sure if I should cut
  these. He and I tried adding that information as a note and in other
  various table types, however none of the alternatives were an
  improvement.

Besides docs, there is one large change to the code which I am currently
working on, which is to change PgStat_IOOpCounters into an array of
PgStatCounters instead of having individual members for each IOOp type.
I hadn't done this previously because the additional level of nesting
seemed confusing. However, it seems it would simplify the code quite a
bit and is probably worth doing.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

03 января 2023 г., 04:15:54

On Mon, Jan 2, 2023 at 5:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Besides docs, there is one large change to the code which I am currently
> working on, which is to change PgStat_IOOpCounters into an array of
> PgStatCounters instead of having individual members for each IOOp type.
> I hadn't done this previously because the additional level of nesting
> seemed confusing. However, it seems it would simplify the code quite a
> bit and is probably worth doing.

As described above, attached v43 uses an array for the PgStatCounters of
IOOps instead of struct members.

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

05 января 2023 г., 01:56:07

On Mon, Jan 2, 2023 at 8:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jan 2, 2023 at 5:46 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Besides docs, there is one large change to the code which I am currently
> > working on, which is to change PgStat_IOOpCounters into an array of
> > PgStatCounters instead of having individual members for each IOOp type.
> > I hadn't done this previously because the additional level of nesting
> > seemed confusing. However, it seems it would simplify the code quite a
> > bit and is probably worth doing.
>
> As described above, attached v43 uses an array for the PgStatCounters of
> IOOps instead of struct members.

This wasn't quite a multi-dimensional array. Attached is v44, in which I
have removed all of the granular struct types -- PgStat_IOOps,
PgStat_IOContext, and PgStat_IOObject by collapsing them into a single
array of PgStat_Counters in a new struct PgStat_BackendIO. I needed to
keep this in addition to PgStat_IO to have a data type for backends to
track their stats in locally.

I've also done another round of cleanup.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

10 января 2023 г., 00:10:47

Attached is v45 of the patchset. I've done some additional code cleanup
and changes. The most significant change, however, is the docs. I've
separated the docs into its own patch for ease of review.

The docs patch here was edited and co-authored by Samay Sharma.
I'm not sure if the order of pg_stat_io in the docs is correct.

The significant changes are removal of all "correspondence" or
"equivalence"-related sections (those explaining how other IO stats were
the same or different from pg_stat_io columns).

I've tried to remove references to "strategies" and "Buffer Access
Strategy" as much as possible.

I've moved the advice and interpretation section to the bottom --
outside of the table of definitions. Since this page is primarily a
reference page, I agree with Samay that incorporating interpretation
into the column definitions adds clutter and confusion.

I think the best course would be to have an "Interpreting Statistics"
section.

I suggest a structure like the following for this section:
    - Statistics Collection Configuration
    - Viewing Statistics
    - Statistics Views Reference
    - Statistics Functions Reference
    - Interpreting Statistics

As an aside, this section of the docs has some other structural issues
as well.

For example, I'm not sure it makes sense to have the dynamic statistics
views as sub-sections under 28.2, which is titled "The Cumulative
Statistics System."

In fact the docs say this under Section 28.2
https://www.postgresql.org/docs/current/monitoring-stats.html

"PostgreSQL also supports reporting dynamic information about exactly
what is going on in the system right now, such as the exact command
currently being executed by other server processes, and which other
connections exist in the system. This facility is independent of the
cumulative statistics system."

So, it is a bit weird that they are defined under the section titled
"The Cumulative Statistics System".

In this version of the patchset, I have not attempted a new structure
but instead moved the advice/interpretation for pg_stat_io to below the
table containing the column definitions.

- Melanie

Attached is v46.

On Wed, Dec 28, 2022 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-10-06 13:42:09 -0400, Melanie Plageman wrote:
> > > Additionally, some minor notes:
> > >
> > > - Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them
inthe past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced"
(realisticallyone would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which
alluse the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix) 
> >
> > I have changed the column names to be in the past tense.
>
> For a while I was convinced by the consistency argument (after Melanie
> pointing it out to me). But the more I look, the less convinced I am. The
> existing IO related stats in pg_stat_database, pg_stat_bgwriter aren't past
> tense, just the ones in pg_stat_statements. pg_stat_database uses past tense
> for tup_*, but not xact_*, deadlocks, checksum_failures etc.
>
> And even pg_stat_statements isn't consistent about it - otherwise it'd be
> 'planned' instead of 'plans', 'called' instead of 'calls' etc.
>
> I started to look at the naming "tense" issue again, after I got "confused"
> about "extended", because that somehow makes me think about more detailed
> stats or such, rather than files getting extended.
>
> ISTM that 'evictions', 'extends', 'fsyncs', 'reads', 'reuses', 'writes' are
> clearer than the past tense versions, and about as consistent with existing
> columns.

I have updated the column names to the above recommendation.

On Wed, Jan 11, 2023 at 11:32 AM vignesh C <vignesh21@gmail.com> wrote:
>
> For some reason cfbot is not able to apply this patch as in [1],
> please have a look and post an updated patch if required:
> === Applying patches on top of PostgreSQL commit ID
> 3c6fc58209f24b959ee18f5d19ef96403d08f15c ===
> === applying patch
> ./v45-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patch
> patching file src/backend/storage/buffer/bufmgr.c
> patching file src/backend/storage/buffer/localbuf.c
> patching file src/backend/utils/activity/pgstat.c
> patching file src/backend/utils/activity/pgstat_relation.c
> patching file src/backend/utils/adt/pgstatfuncs.c
> patching file src/include/pgstat.h
> patching file src/include/utils/pgstat_internal.h
> === applying patch ./v45-0002-pgstat-Infrastructure-to-track-IO-operations.patch
> gpatch: **** Only garbage was found in the patch input.
>
> [1] - http://cfbot.cputube.org/patch_41_3272.log
>

This was an issue with cfbot that Thomas has now fixed as he describes
in [1].

On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> > Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type
>
> The patch can/will fail with:
>
> CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
> +WARNING:  tablespaces created by regression test cases should have names starting with "regress_"
>
> CREATE TABLESPACE test_stats LOCATION '';
> +WARNING:  tablespaces created by regression test cases should have names starting with "regress_"
>
> (I already sent patches to address the omission in cirrus.yml)

Thanks. I've fixed this
I make a tablespace in amcheck -- are there recommendations for naming
tablespaces in contrib also?

>
> 1760             :                  errhint("Target must be \"archiver\", \"io\", \"bgwriter\",
\"recovery_prefetch\",or \"wal\"."))); 
> => Do you want to put these in order?

Thanks. Fixed.

> pgstat_get_io_op_name() isn't currently being hit by tests; actually,
> it's completely unused.

Deleted it.

> FlushRelationBuffers() isn't being hit for local buffers.

I added a test.

> > +      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
> > +      <entry>
> > +       One row per backend type, context, target object combination showing
> > +       cluster-wide I/O statistics.
>
> I suggest: "One row for each combination of of .."

I have made this change.

> > +   The <structname>pg_stat_io</structname> and
> > +   <structname>pg_statio_</structname> set of views are especially useful for
> > +   determining the effectiveness of the buffer cache.  When the number of actual
> > +   disk reads is much smaller than the number of buffer hits, then the cache is
> > +   satisfying most read requests without invoking a kernel call.
>
> I would change this say "Postgres' own buffer cache is satisfying ..."

So, this is existing copy to which I added the pg_stat_io view name and
re-flowed the indentation.
However, I think your suggestions are a good idea, so I've taken them
and just rewritten this paragraph altogether.

>
> > However, these
> > +   statistics do not give the entire story: due to the way in which
> > +   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
> > +   the <productname>PostgreSQL</productname> buffer cache might still reside in
> > +   the kernel's I/O cache, and might therefore still be fetched without
>
> I suggest to refer to "the kernel's page cache"

same applies here.

>
> > +   The <structname>pg_stat_io</structname> view will contain one row for each
> > +   backend type, I/O context, and target I/O object combination showing
> > +   cluster-wide I/O statistics. Combinations which do not make sense are
> > +   omitted.
>
> "..for each combination of .."

I have changed this.

>
> > +          <varname>io_context</varname> for a type of I/O operation. For
>
> "for I/O operations"

So I actually mean for a type of I/O operation -- that is, relation data
is normally written to a shared buffer but sometimes we bypass shared
buffers and just call write and sometimes we use a buffer access
strategy and write it to a special ring buffer (made up of buffers
stolen from shared buffers, but still). So I don't want to say "for I/O
operations" because I think that would imply that writes of relation
data will always be in the same IO Context.

>
> > +          <literal>vacuum</literal>: I/O operations done outside of shared
> > +          buffers incurred while vacuuming and analyzing permanent relations.
>
> s/incurred/performed/

I changed this.

>
> > +          <literal>bulkread</literal>: Qualifying large read I/O operations
> > +          done outside of shared buffers, for example, a sequential scan of a
> > +          large table.
>
> I don't think it's correct to say that it's "outside of" shared-buffers.

I suppose "outside of" gives the wrong idea. But I need to make clear
that this I/O is to and from buffers which are not a part of shared
buffers right now -- they may still be accessible from the same data
structures which access shared buffers but they are currently being used
in a different way.

> s/Qualifying/Certain/

I feel like qualifying is more specific than certain, but I would be open
to changing it if there was a specific reason you don't like it.

>
> > +          <literal>bulkwrite</literal>: Qualifying large write I/O operations
> > +          done outside of shared buffers, such as <command>COPY</command>.
>
> Same
>
> > +        Target object of an I/O operation. Possible values are:
> > +       <itemizedlist>
> > +        <listitem>
> > +         <para>
> > +          <literal>relation</literal>: This includes permanent relations.
>
> It says "includes permanent" but what seems to mean is that it
> "exclusive of temporary relations".

I've changed this.

>
> > +     <row>
> > +      <entry role="catalog_table_entry">
> > +       <para role="column_definition">
> > +        <structfield>read</structfield> <type>bigint</type>
> > +       </para>
> > +       <para>
> > +        Number of read operations in units of <varname>op_bytes</varname>.
>
> This looks too much like it means "bytes".
> Should say: "in number of blocks of size >op_bytes<"
>
> But wait - is it the number of read operations "in units of op_bytes"
> (which would means this already multiplied by op_bytes, and is in units
> of bytes).
>
> Or the "number of read operations" *of* op_bytes chunks ?  Which would
> mean this is a "pure" number, and could be multipled by op_bytes to
> obtain a size in bytes.

It is the number of read operations of op_bytes size -- thanks so much
for pointing this out. The wording was really unclear.
The idea is that you can do something like:
SELECT pg_size_pretty(reads * op_bytes) FROM pg_stat_io;
and get it in bytes.

The view will contain other types of IO that are not in BLCKSZ chunks,
which is where this column will be handy.

>
> > +        Number of write operations in units of <varname>op_bytes</varname>.
>
> > +        Number of relation extend operations in units of
> > +        <varname>op_bytes</varname>.
>
> same
>
> > +        In <varname>io_context</varname> <literal>normal</literal>, this counts
> > +        the number of times a block was evicted from a buffer and replaced with
> > +        another block. In <varname>io_context</varname>s
> > +        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
> > +        <literal>vacuum</literal>, this counts the number of times a block was
> > +        evicted from shared buffers in order to add the shared buffer to a
> > +        separate size-limited ring buffer.
>
> This never defines what "evicted" means.  Does it mea that a dirty
> buffer was written out ?

Thanks. I've updated this.

>
> > +        The number of times an existing buffer in a size-limited ring buffer
> > +        outside of shared buffers was reused as part of an I/O operation in the
> > +        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
> > +        <literal>vacuum</literal> <varname>io_context</varname>s.
>
> Maybe say "as part of a bulk I/O operation (bulkread, bulkwrite, or
> vacuum)."

I've changed this.

>
> > +  <para>
> > +   <structname>pg_stat_io</structname> can be used to inform database tuning.
>
> > +   For example:
> > +   <itemizedlist>
> > +    <listitem>
> > +     <para>
> > +      A high <varname>evicted</varname> count can indicate that shared buffers
> > +      should be increased.
> > +     </para>
> > +    </listitem>
> > +    <listitem>
> > +     <para>
> > +      Client backends rely on the checkpointer to ensure data is persisted to
> > +      permanent storage. Large numbers of <varname>files_synced</varname> by
> > +      <literal>client backend</literal>s could indicate a misconfiguration of
> > +      shared buffers or of checkpointer. More information on checkpointer
>
> of *the* checkpointer
>
> > +      Normally, client backends should be able to rely on auxiliary processes
> > +      like the checkpointer and background writer to write out dirty data as
>
> *the* bg writer
>
> > +      much as possible. Large numbers of writes by client backends could
> > +      indicate a misconfiguration of shared buffers or of checkpointer. More
>
> *the* ckpointer

I've made most of these changes.

> Should this link to various docs for checkpointer/bgwriter ?

I couldn't find docs related to tuning checkpointer outside of the WAL
configuration docs. There is the docs page for the CHECKPOINT command --
but I don't think that is very relevant here.

> Maybe the docs for ALTER/COPY/VACUUM/CREATE/etc should be updated to
> refer to some central description of ring buffers.  Maybe something
> should be included to the appendix.

I agree it would be nice to explain Buffer Access Strategies in the docs.

- Melanie

[1] https://www.postgresql.org/message-id/CA%2BhUKGLiY1e%2B1%3DpB7hXJOyGj1dJOfgde%2BHmiSnv3gDKayUFJMA%40mail.gmail.com

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Justin Pryzby

Дата:

13 января 2023 г., 08:23:05

On Thu, Jan 12, 2023 at 09:19:36PM -0500, Melanie Plageman wrote:
> On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > > Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type
> >
> > The patch can/will fail with:
> >
> > CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
> > +WARNING:  tablespaces created by regression test cases should have names starting with "regress_"
> >
> > CREATE TABLESPACE test_stats LOCATION '';
> > +WARNING:  tablespaces created by regression test cases should have names starting with "regress_"
> >
> > (I already sent patches to address the omission in cirrus.yml)
> 
> Thanks. I've fixed this
> I make a tablespace in amcheck -- are there recommendations for naming
> tablespaces in contrib also?

That's the test_stats one I mentioned.

Check with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

> > > +          <literal>bulkread</literal>: Qualifying large read I/O operations
> > > +          done outside of shared buffers, for example, a sequential scan of a
> > > +          large table.
> >
> > I don't think it's correct to say that it's "outside of" shared-buffers.
> 
> I suppose "outside of" gives the wrong idea. But I need to make clear
> that this I/O is to and from buffers which are not a part of shared
> buffers right now -- they may still be accessible from the same data
> structures which access shared buffers but they are currently being used
> in a different way.

This would be a good place to link to a description of the ringbuffer,
if we had one.

> > s/Qualifying/Certain/
> 
> I feel like qualifying is more specific than certain, but I would be open
> to changing it if there was a specific reason you don't like it.

I suggested to change it because at first I started to interpret it as
"The act of qualifying large I/O ops .." rather than "Large I/O ops that
qualify..".

+        Number of read operations of <varname>op_bytes</varname> size.
                                                                                                          
 

This is still a bit too easy to misinterpret as being in units of bytes.
I suggest: Number of read operations (which are each of the size
specified in >op_bytes<).

+ in order to add the shared buffer to a separate size-limited ring buffer

separate comma

+ More information on configuring checkpointer can be found in Section 30.5. 

*the* checkpointer (as in the following paragraph)

+   <varname>backend_type</varname> <literal>checkpointer</literal> and
                                                                                                          
 
+   <varname>io_object</varname> <literal>temp relation</literal>.
                                                                                                          
 
+  </para>
                                                                                                          
 

I still think it's a bit hard to understand the <varname>s adjacent to
<literal>s.

+ Some backend_types
+ in some io_contexts
+ on some io_objects
+ in certain io_contexts
+ on certain io_objects

Maybe these should not use underscores:  Some backend types never
perform I/O operations in some I/O contexts and/or on some i/o objects.

+ for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+ for (IOContext io_context = IOCONTEXT_BULKREAD; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+ for (IOObject io_obj = IOOBJECT_RELATION; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+ for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)

These look a bit fragile due to starting at some hardcoded "first"
value.  In other places you use symbols "FIRST" symbols:

+       for (IOContext io_context = IOCONTEXT_FIRST; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+               for (IOObject io_object = IOOBJECT_FIRST; io_object < IOOBJECT_NUM_TYPES; io_object++)
+                       for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)

I think that's marginally better, but I think having to define both
FIRST and NUM is excessive and doesn't make it less fragile.  Not sure
what anyone else will say, but I'd prefer if it started at "0".

Thanks for working on this - I'm looking forward to updating my rrdtool
script for this soon.  It'll be nice to finally distinguish huge number
of "backend ringbuffer writes during ALTER" from other backend writes.
Currently, that makes it look like something is terribly wrong.

-- 
Justin

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

13 января 2023 г., 21:38:15

Attached is v47.

On Fri, Jan 13, 2023 at 12:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Thu, Jan 12, 2023 at 09:19:36PM -0500, Melanie Plageman wrote:
> > On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > >
> > > > Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type
> > >
> > > The patch can/will fail with:
> > >
> > > CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
> > > +WARNING:  tablespaces created by regression test cases should have names starting with "regress_"
> > >
> > > CREATE TABLESPACE test_stats LOCATION '';
> > > +WARNING:  tablespaces created by regression test cases should have names starting with "regress_"
> > >
> > > (I already sent patches to address the omission in cirrus.yml)
> >
> > Thanks. I've fixed this
> > I make a tablespace in amcheck -- are there recommendations for naming
> > tablespaces in contrib also?
>
> That's the test_stats one I mentioned.
>
> Check with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

Thanks. I have now changed both tablespace names and checked using that
macro.

> > > > +          <literal>bulkread</literal>: Qualifying large read I/O operations
> > > > +          done outside of shared buffers, for example, a sequential scan of a
> > > > +          large table.
> > >
> > > I don't think it's correct to say that it's "outside of" shared-buffers.
> >
> > I suppose "outside of" gives the wrong idea. But I need to make clear
> > that this I/O is to and from buffers which are not a part of shared
> > buffers right now -- they may still be accessible from the same data
> > structures which access shared buffers but they are currently being used
> > in a different way.
>
> This would be a good place to link to a description of the ringbuffer,
> if we had one.

Indeed.

> > > s/Qualifying/Certain/
> >
> > I feel like qualifying is more specific than certain, but I would be open
> > to changing it if there was a specific reason you don't like it.
>
> I suggested to change it because at first I started to interpret it as
> "The act of qualifying large I/O ops .." rather than "Large I/O ops that
> qualify..".

I have changed it to "certain".

> +        Number of read operations of <varname>op_bytes</varname> size.
>
> This is still a bit too easy to misinterpret as being in units of bytes.
> I suggest: Number of read operations (which are each of the size
> specified in >op_bytes<).

I have changed this.

> + in order to add the shared buffer to a separate size-limited ring buffer
>
> separate comma
>
> + More information on configuring checkpointer can be found in Section 30.5.
>
> *the* checkpointer (as in the following paragraph)

above items changed.

> +   <varname>backend_type</varname> <literal>checkpointer</literal> and
> +   <varname>io_object</varname> <literal>temp relation</literal>.
> +  </para>
>
> I still think it's a bit hard to understand the <varname>s adjacent to
> <literal>s.

I agree it isn't great -- is there a different XML tag you suggest
instead of literal?

> + Some backend_types
> + in some io_contexts
> + on some io_objects
> + in certain io_contexts
> + on certain io_objects
>
> Maybe these should not use underscores:  Some backend types never
> perform I/O operations in some I/O contexts and/or on some i/o objects.

I've changed this.

Also, taking another look, I forgot to update the docs' column name
tenses in the last version. That is now done.

> + for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
> + for (IOContext io_context = IOCONTEXT_BULKREAD; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + for (IOObject io_obj = IOOBJECT_RELATION; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
> + for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
>
> These look a bit fragile due to starting at some hardcoded "first"
> value.  In other places you use symbols "FIRST" symbols:
>
> +       for (IOContext io_context = IOCONTEXT_FIRST; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> +               for (IOObject io_object = IOOBJECT_FIRST; io_object < IOOBJECT_NUM_TYPES; io_object++)
> +                       for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
>
> I think that's marginally better, but I think having to define both
> FIRST and NUM is excessive and doesn't make it less fragile.  Not sure
> what anyone else will say, but I'd prefer if it started at "0".

Thanks for catching the discrepancy in pg_stat_get_io(). I have changed
those instances to use _FIRST.

I think that having the loop start from the first enum value (except
when that value is something special like _INVALID like with
BackendType) is confusing. I agree that having multiple macros to allow
iteration through all enum values introduces some fragility. I'm not
sure about using the number 0 with the enum as the loop variable
data type. Is that a common pattern?

In this version, I have updated the loops in pg_stat_get_io() to use
_FIRST.

> Thanks for working on this - I'm looking forward to updating my rrdtool
> script for this soon.  It'll be nice to finally distinguish huge number
> of "backend ringbuffer writes during ALTER" from other backend writes.
> Currently, that makes it look like something is terribly wrong.

Cool! I'm glad to know you will use it.

- Melanie

v48 attached.

On Fri, Jan 13, 2023 at 6:36 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-13 13:38:15 -0500, Melanie Plageman wrote:
> > From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Fri, 9 Dec 2022 18:23:19 -0800
> > Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations
> > diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
> > index 0fa5370bcd..608c3b59da 100644
> > --- a/src/backend/utils/activity/pgstat.c
> > +++ b/src/backend/utils/activity/pgstat.c
>
> Reminder to self: Need to bump PGSTAT_FILE_FORMAT_ID before commit.
>
> Perhaps you could add a note about that to the commit message?
>

done

>
>
> > @@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
> >               .snapshot_cb = pgstat_checkpointer_snapshot_cb,
> >       },
> >
> > +     [PGSTAT_KIND_IO] = {
> > +             .name = "io_ops",
>
> That should be "io" now I think?
>

Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
to the IO stats tests now.

It would be nice if pgstat_get_kind_from_str() could be used in
pg_stat_reset_shared() to avoid having to remember to change both. It
doesn't really work because we want to be able to throw the error
message in pg_stat_reset_shared() when the user input is wrong -- not
the one in pgstat_get_kind_from_str().
Also:
- Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
  well into this paradigm
- Only a subset of the statistics kinds are reset through this function
- bgwriter and checkpointer share a reset target
I added a comment -- perhaps that's all I can do?

On a separate note, should we be setting have_[io/slru/etc]stats to
false in the reset all functions?

>
> > +/*
> > + * Check that stats have not been counted for any combination of IOContext,
> > + * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
> > + * passed-in PgStat_BackendIO must contain stats from the BackendType specified
> > + * by the second parameter. Caller is responsible for locking the passed-in
> > + * PgStat_BackendIO, if needed.
> > + */
>
> Other PgStat_Backend* structs are just for pending data. Perhaps we could
> rename it slightly to make that clearer? PgStat_BktypeIO?
> PgStat_IOForBackendType? or a similar variation?

I've done this.

>
> > +bool
> > +pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
> > +                                                      BackendType bktype)
> > +{
> > +     bool            bktype_tracked = pgstat_tracks_io_bktype(bktype);
> > +
> > +     for (IOContext io_context = IOCONTEXT_FIRST;
> > +              io_context < IOCONTEXT_NUM_TYPES; io_context++)
> > +     {
> > +             for (IOObject io_object = IOOBJECT_FIRST;
> > +                      io_object < IOOBJECT_NUM_TYPES; io_object++)
> > +             {
> > +                     /*
> > +                      * Don't bother trying to skip to the next loop iteration if
> > +                      * pgstat_tracks_io_object() would return false here. We still
> > +                      * need to validate that each counter is zero anyway.
> > +                      */
> > +                     for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
> > +                     {
> > +                             if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op))
&&
> > +                                     backend_io->data[io_context][io_object][io_op] != 0)
> > +                                     return false;
>
> Hm, perhaps this could be broken up into multiple lines? Something like
>
>     /* no stats, so nothing to validate */
>     if (backend_io->data[io_context][io_object][io_op] == 0)
>         continue;
>
>     /* something went wrong if have stats for something not tracked */
>     if (!bktype_tracked ||
>         !pgstat_tracks_io_op(bktype, io_context, io_object, io_op))
>         return false;

I've done this.

> > +typedef struct PgStat_BackendIO
> > +{
> > +     PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > +} PgStat_BackendIO;
>
> Would it bother you if we swapped the order of iocontext and iobject here and
> related places? It makes more sense to me semantically, and should now be
> pretty easy, code wise.

So, thinking about this I started noticing inconsistencies in other
areas around this order:
For example: ordering of objects mentioned in commit messages and comments,
ordering of parameters (like in pgstat_count_io_op() [currently in
reverse order]).

I think we should make a final decision about this ordering and then
make everywhere consistent (including ordering in the view).

Currently the order is:
BackendType
  IOContext
    IOObject
      IOOp

You are suggesting this order:
BackendType
  IOObject
    IOContext
      IOOp

Could you explain what you find more natural about this ordering (as I
find the other more natural)?

This is one possible natural sentence with these objects:

During COPY, a client backend may read in data from a permanent
relation.
This order is:
IOContext
  BackendType
    IOOp
      IOObject

I think English sentences are often structured subject, verb, object --
but in our case, we have an extra thing that doesn't fit neatly
(IOContext). Also, IOOp in a sentence would be in the middle (as the
verb). I made it last because a) it feels like the smallest unit b) it
would make the code a lot more annoying if it wasn't last.

WRT IOObject and IOContext, is there a future case for which having
IOObject first will be better or lead to fewer mistakes?

I actually see loads of places where this needs to be made consistent.

>
> > +/* shared version of PgStat_IO */
> > +typedef struct PgStatShared_IO
> > +{
>
> Maybe /* PgStat_IO in shared memory */?
>

updated.

>
> > Subject: [PATCH v47 3/5] pgstat: Count IO for relations
>
> Nearly happy with this now. See one minor nit below.
>
> I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
> I don't have a better idea, and it doesn't seem too horrible.

You don't like it because such things shouldn't be in md.c -- since we
went to the trouble of having function pointers and making it general?

>
> > @@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >
> >       UnlockBufHdr(buf, buf_state);
> >
> > +     if (oldFlags & BM_VALID)
> > +     {
> > +             /*
> > +              * When a BufferAccessStrategy is in use, blocks evicted from shared
> > +              * buffers are counted as IOOP_EVICT in the corresponding context
> > +              * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
> > +              * strategy in two cases: 1) while initially claiming buffers for the
> > +              * strategy ring 2) to replace an existing strategy ring buffer
> > +              * because it is pinned or in use and cannot be reused.
> > +              *
> > +              * Blocks evicted from buffers already in the strategy ring are
> > +              * counted as IOOP_REUSE in the corresponding strategy context.
> > +              *
> > +              * At this point, we can accurately count evictions and reuses,
> > +              * because we have successfully claimed the valid buffer. Previously,
> > +              * we may have been forced to release the buffer due to concurrent
> > +              * pinners or erroring out.
> > +              */
> > +             pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
> > +                                                IOOBJECT_RELATION, *io_context);
> > +     }
> > +
> >       if (oldPartitionLock != NULL)
> >       {
> >               BufTableDelete(&oldTag, oldHash);
>
> There's no reason to do this while we still hold the buffer partition lock,
> right? That's a highly contended lock, and we can just move the counting a few
> lines down.

Thanks, I've done this.

>
> > @@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
> >       if (need_to_close)
> >               FileClose(file);
> >
> > +     if (result >= 0)
> > +             pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> > +
>
> I'd lean towards doing this unconditionally, it's still an fsync if it
> failed... Not that it matters.

Good point. We still incurred the costs if not benefited from the
effects. I've updated this.

>
> > Subject: [PATCH v47 4/5] Add system view tracking IO ops per backend type
>
> Note to self + commit message: Remember the need to do a catversion bump.

Noted.

>
> > +-- pg_stat_io test:
> > +-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy.
>
> Maybe add that "whereas a sequential scan does not, see ..."?

Updated.

>
> > This allows
> > +-- us to reliably test that pg_stat_io BULKREAD reads are being captured
> > +-- without relying on the size of shared buffers or on an expensive operation
> > +-- like CREATE DATABASE.
>
> CREATE / DROP TABLESPACE is also pretty expensive, but I don't have a better
> idea.

I've added a comment.

>
> > +-- Create an alternative tablespace and move the heaptest table to it, causing
> > +-- it to be rewritten.
>
> IIRC the point of that is that it reliably evicts all the buffers from s_b,
> correct? If so, mention that?

Done.

>
> > +Datum
> > +pg_stat_get_io(PG_FUNCTION_ARGS)
> > +{
> > +     ReturnSetInfo *rsinfo;
> > +     PgStat_IO  *backends_io_stats;
> > +     Datum           reset_time;
> > +
> > +     InitMaterializedSRF(fcinfo, 0);
> > +     rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> > +
> > +     backends_io_stats = pgstat_fetch_stat_io();
> > +
> > +     reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> > +
> > +     for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
> > +     {
> > +             bool            bktype_tracked;
> > +             Datum           bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
> > +             PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
> > +
> > +             /*
> > +              * For those BackendTypes without IO Operation stats, skip
> > +              * representing them in the view altogether. We still loop through
> > +              * their counters so that we can assert that all values are zero.
> > +              */
> > +             bktype_tracked = pgstat_tracks_io_bktype(bktype);
>
> How about instead just doing Assert(pgstat_bktype_io_stats_valid(...))? That
> deduplicates the logic for the asserts, and avoids doing the full loop when
> assertions aren't enabled anyway?
>

I've done this and added a comment.

>
>
> > +-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
> > +-- and fsyncs.
> > +-- The second checkpoint ensures that stats from the first checkpoint have been
> > +-- reported and protects against any potential races amongst the table
> > +-- creation, a possible timing-triggered checkpoint, and the explicit
> > +-- checkpoint in the test.
>
> There's a comment about the subsequent checkpoints earlier in the file, and I
> think the comment is slightly more precise. Mybe just reference the earlier comment?
>
>
> > +-- Change the tablespace so that the table is rewritten directly, then SELECT
> > +-- from it to cause it to be read back into shared buffers.
> > +SET allow_in_place_tablespaces = true;
> > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
>
> Perhaps worth doing this in tablespace.sql, to avoid the additional
> checkpoints done as part of CREATE/DROP TABLESPACE?
>
> Or, at least combine this with the CHECKPOINTs above?

I see a checkpoint is requested when dropping the tablespace if not all
the files in it are deleted. It seems like if the DROP TABLE for the
permanent table is before the explicit checkpoints in the test, then the
DROP TABLESPACE will not cause an additional checkpoint. Is this what
you are suggesting? Dropping the temporary table should not have an
effect on this.

>
> > +-- Drop the table so we can drop the tablespace later.
> > +DROP TABLE test_io_shared;
> > +-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
> > +-- - eviction of local buffers in order to reuse them
> > +-- - reads of temporary table blocks into local buffers
> > +-- - writes of local buffers to permanent storage
> > +-- - extends of temporary tables
> > +-- Set temp_buffers to a low value so that we can trigger writes with fewer
> > +-- inserted tuples. Do so in a new session in case temporary tables have been
> > +-- accessed by previous tests in this session.
> > +\c
> > +SET temp_buffers TO '1MB';
>
> I'd set it to the actual minimum '100' (in pages). Perhaps that'd allow to
> make test_io_local a bit smaller?

I've done this.

>
> > +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
> > +SELECT sum(extends) AS io_sum_local_extends_before
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +SELECT sum(evictions) AS io_sum_local_evictions_before
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +SELECT sum(writes) AS io_sum_local_writes_before
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
> > +-- Insert tuples into the temporary table, generating extends in the stats.
> > +-- Insert enough values that we need to reuse and write out dirty local
> > +-- buffers, generating evictions and writes.
> > +INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
> > +SELECT sum(reads) AS io_sum_local_reads_before
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
>
> Maybe add something like
>
> SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
>
> Better toast compression or such could easily make test_io_local smaller than
> it's today. Seeing that it's too small would make it easier to understand the
> failure.

Good idea. So, I used pg_table_size() because it seems like
pg_relation_size() does not include the toast relations. However, I'm
not sure this is a good idea, because pg_table_size() includes FSM and
visibility map. Should I write a query to get the toast relation name
and add pg_relation_size() of that relation and the main relation?

>
> > +SELECT sum(evictions) AS io_sum_local_evictions_after
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
> > +SELECT sum(reads) AS io_sum_local_reads_after
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
> > +SELECT sum(writes) AS io_sum_local_writes_after
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
> > +SELECT sum(extends) AS io_sum_local_extends_after
> > +  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
>
> This could just be one select with multiple columns?
>
> I think if you use something like \gset io_sum_local_after_ you can also avoid
> the need to repeat "io_sum_local_" so many times.

Thanks. I didn't realize. I've fixed this throughout the test file.


On Mon, Jan 16, 2023 at 4:42 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
> I missed a couple of versions, but I think the docs are clearer now.
> I'm torn on losing some of the detail, but overall I do think it's a
> good trade-off. Moving some details out to after the table does keep
> the bulk of the view documentation more readable, and the "inform
> database tuning" part is great. I really like the idea of a separate
> Interpreting Statistics section, but for now this works.
>
> >+          <literal>vacuum</literal>: I/O operations performed outside of shared
> >+          buffers while vacuuming and analyzing permanent relations.
>
> Why only permanent relations? Are temporary relations treated
> differently? I imagine if someone has a temp-table-heavy workload that
> requires regularly vacuuming and analyzing those relations, this point
> may be confusing without some additional explanation.

Ah, yes. This is a bit confusing. We don't use buffer access strategies
when operating on temp relations, so vacuuming them is counted in IO
Context normal. I've added this information to the docs but now that
definition is a bit long. Perhaps it should be a note? That seems like
it would draw too much attention to this detail, though...

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

17 января 2023 г., 22:12:32

Hi,

On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
> > > @@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
> > >               .snapshot_cb = pgstat_checkpointer_snapshot_cb,
> > >       },
> > >
> > > +     [PGSTAT_KIND_IO] = {
> > > +             .name = "io_ops",
> >
> > That should be "io" now I think?
> >
> 
> Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
> to the IO stats tests now.
> 
> It would be nice if pgstat_get_kind_from_str() could be used in
> pg_stat_reset_shared() to avoid having to remember to change both.

It's hard to make that work, because of the historical behaviour of that
function :(


> Also:
> - Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
>   well into this paradigm

I think that needs a rework anyway - it went in at about the same time as the
shared mem stats patch, so it doesn't quite cohere.


> On a separate note, should we be setting have_[io/slru/etc]stats to
> false in the reset all functions?

That'd not work reliably, because other backends won't do the same. I don't
see a benefit in doing it differently in the local connection than the other
connections.


> > > +typedef struct PgStat_BackendIO
> > > +{
> > > +     PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +} PgStat_BackendIO;
> >
> > Would it bother you if we swapped the order of iocontext and iobject here and
> > related places? It makes more sense to me semantically, and should now be
> > pretty easy, code wise.
> 
> So, thinking about this I started noticing inconsistencies in other
> areas around this order:
> For example: ordering of objects mentioned in commit messages and comments,
> ordering of parameters (like in pgstat_count_io_op() [currently in
> reverse order]).
> 
> I think we should make a final decision about this ordering and then
> make everywhere consistent (including ordering in the view).
> 
> Currently the order is:
> BackendType
>   IOContext
>     IOObject
>       IOOp
> 
> You are suggesting this order:
> BackendType
>   IOObject
>     IOContext
>       IOOp
> 
> Could you explain what you find more natural about this ordering (as I
> find the other more natural)?

The object we're performing IO on determines more things than the context. So
it just seems like the natural hierarchical fit. The context is a sub-category
of the object. Consider how it'll look like if we also have objects for 'wal',
'temp files'. It'll make sense to group by just the object, but it won't make
sense to group by just the context.

If it were trivial to do I'd use a different IOContext for each IOObject. But
it'd make it much harder. So there'll just be a bunch of values of IOContext
that'll only be used for one or a subset of the IOObjects.


The reason to put BackendType at the top is pragmatic - one backend is of a
single type, but can do IO for all kinds of objects/contexts. So any other
hierarchy would make the locking etc much harder.


> This is one possible natural sentence with these objects:
> 
> During COPY, a client backend may read in data from a permanent
> relation.
> This order is:
> IOContext
>   BackendType
>     IOOp
>       IOObject
> 
> I think English sentences are often structured subject, verb, object --
> but in our case, we have an extra thing that doesn't fit neatly
> (IOContext).

"..., to avoid polluting the buffer cache it uses the bulk (read|write)
strategy".


> Also, IOOp in a sentence would be in the middle (as the
> verb). I made it last because a) it feels like the smallest unit b) it
> would make the code a lot more annoying if it wasn't last.

Yea, I think pragmatically that is the right choice.



> > > Subject: [PATCH v47 3/5] pgstat: Count IO for relations
> >
> > Nearly happy with this now. See one minor nit below.
> >
> > I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
> > I don't have a better idea, and it doesn't seem too horrible.
> 
> You don't like it because such things shouldn't be in md.c -- since we
> went to the trouble of having function pointers and making it general?

It's more of a gut feeling than well reasoned ;)



> > > +-- Change the tablespace so that the table is rewritten directly, then SELECT
> > > +-- from it to cause it to be read back into shared buffers.
> > > +SET allow_in_place_tablespaces = true;
> > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> >
> > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > checkpoints done as part of CREATE/DROP TABLESPACE?
> >
> > Or, at least combine this with the CHECKPOINTs above?
> 
> I see a checkpoint is requested when dropping the tablespace if not all
> the files in it are deleted. It seems like if the DROP TABLE for the
> permanent table is before the explicit checkpoints in the test, then the
> DROP TABLESPACE will not cause an additional checkpoint.

Unfortunately, that's not how it works :(. See the comment above mdunlink():

> * For regular relations, we don't unlink the first segment file of the rel,
> * but just truncate it to zero length, and record a request to unlink it after
> * the next checkpoint.  Additional segments can be unlinked immediately,
> * however.  Leaving the empty file in place prevents that relfilenumber
> * from being reused.  The scenario this protects us from is:
> ...


> Is this what you are suggesting? Dropping the temporary table should not
> have an effect on this.

I was wondering about simply moving that portion of the test to
tablespace.sql, where we already created a tablespace.


An alternative would be to propose splitting tablespace.sql into one portion
running at the start of parallel_schedule, and one at the end. Historically,
we needed tablespace.sql to be optional due to causing problems when
replicating to another instance on the same machine, but now we have
allow_in_place_tablespaces.


> > SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
> >
> > Better toast compression or such could easily make test_io_local smaller than
> > it's today. Seeing that it's too small would make it easier to understand the
> > failure.
> 
> Good idea. So, I used pg_table_size() because it seems like
> pg_relation_size() does not include the toast relations. However, I'm
> not sure this is a good idea, because pg_table_size() includes FSM and
> visibility map. Should I write a query to get the toast relation name
> and add pg_relation_size() of that relation and the main relation?

I think it's the right thing to just include the relation size. Your queries
IIRC won't use the toast table or other forks. So I'd leave it at just
pg_relation_size().

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

18 января 2023 г., 01:00:34

v49 attached

On Tue, Jan 17, 2023 at 2:12 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
>
> > > > +typedef struct PgStat_BackendIO
> > > > +{
> > > > +     PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > +} PgStat_BackendIO;
> > >
> > > Would it bother you if we swapped the order of iocontext and iobject here and
> > > related places? It makes more sense to me semantically, and should now be
> > > pretty easy, code wise.
> >
> > So, thinking about this I started noticing inconsistencies in other
> > areas around this order:
> > For example: ordering of objects mentioned in commit messages and comments,
> > ordering of parameters (like in pgstat_count_io_op() [currently in
> > reverse order]).
> >
> > I think we should make a final decision about this ordering and then
> > make everywhere consistent (including ordering in the view).
> >
> > Currently the order is:
> > BackendType
> >   IOContext
> >     IOObject
> >       IOOp
> >
> > You are suggesting this order:
> > BackendType
> >   IOObject
> >     IOContext
> >       IOOp
> >
> > Could you explain what you find more natural about this ordering (as I
> > find the other more natural)?
>
> The object we're performing IO on determines more things than the context. So
> it just seems like the natural hierarchical fit. The context is a sub-category
> of the object. Consider how it'll look like if we also have objects for 'wal',
> 'temp files'. It'll make sense to group by just the object, but it won't make
> sense to group by just the context.
>
> If it were trivial to do I'd use a different IOContext for each IOObject. But
> it'd make it much harder. So there'll just be a bunch of values of IOContext
> that'll only be used for one or a subset of the IOObjects.
>
>
> The reason to put BackendType at the top is pragmatic - one backend is of a
> single type, but can do IO for all kinds of objects/contexts. So any other
> hierarchy would make the locking etc much harder.
>
>
> > This is one possible natural sentence with these objects:
> >
> > During COPY, a client backend may read in data from a permanent
> > relation.
> > This order is:
> > IOContext
> >   BackendType
> >     IOOp
> >       IOObject
> >
> > I think English sentences are often structured subject, verb, object --
> > but in our case, we have an extra thing that doesn't fit neatly
> > (IOContext).
>
> "..., to avoid polluting the buffer cache it uses the bulk (read|write)
> strategy".
>
>
> > Also, IOOp in a sentence would be in the middle (as the
> > verb). I made it last because a) it feels like the smallest unit b) it
> > would make the code a lot more annoying if it wasn't last.
>
> Yea, I think pragmatically that is the right choice.

I have changed the order and updated all the places using
PgStat_BktypeIO as well as in all locations in which it should be
ordered for consistency (that I could find in the pass I did) -- e.g.
the view definition, function signatures, comments, commit messages,
etc.

> > > > +-- Change the tablespace so that the table is rewritten directly, then SELECT
> > > > +-- from it to cause it to be read back into shared buffers.
> > > > +SET allow_in_place_tablespaces = true;
> > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> > >
> > > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > > checkpoints done as part of CREATE/DROP TABLESPACE?
> > >
> > > Or, at least combine this with the CHECKPOINTs above?
> >
> > I see a checkpoint is requested when dropping the tablespace if not all
> > the files in it are deleted. It seems like if the DROP TABLE for the
> > permanent table is before the explicit checkpoints in the test, then the
> > DROP TABLESPACE will not cause an additional checkpoint.
>
> Unfortunately, that's not how it works :(. See the comment above mdunlink():
>
> > * For regular relations, we don't unlink the first segment file of the rel,
> > * but just truncate it to zero length, and record a request to unlink it after
> > * the next checkpoint.  Additional segments can be unlinked immediately,
> > * however.  Leaving the empty file in place prevents that relfilenumber
> > * from being reused.  The scenario this protects us from is:
> > ...
>
>
> > Is this what you are suggesting? Dropping the temporary table should not
> > have an effect on this.
>
> I was wondering about simply moving that portion of the test to
> tablespace.sql, where we already created a tablespace.
>
>
> An alternative would be to propose splitting tablespace.sql into one portion
> running at the start of parallel_schedule, and one at the end. Historically,
> we needed tablespace.sql to be optional due to causing problems when
> replicating to another instance on the same machine, but now we have
> allow_in_place_tablespaces.

It seems like the best way would be to split up the tablespace test file
as you suggested and drop the tablespace at the end of the regression
test suite. There could be other tests that could use a tablespace.
Though what I wrote is kind of tablespace test coverage, if this
rewriting behavior no longer happened when doing alter table set
tablespace, we would want to come up with a new test which exercised
that code to count those IO stats, not simply delete it from the
tablespace tests.

> > > SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
> > >
> > > Better toast compression or such could easily make test_io_local smaller than
> > > it's today. Seeing that it's too small would make it easier to understand the
> > > failure.
> >
> > Good idea. So, I used pg_table_size() because it seems like
> > pg_relation_size() does not include the toast relations. However, I'm
> > not sure this is a good idea, because pg_table_size() includes FSM and
> > visibility map. Should I write a query to get the toast relation name
> > and add pg_relation_size() of that relation and the main relation?
>
> I think it's the right thing to just include the relation size. Your queries
> IIRC won't use the toast table or other forks. So I'd leave it at just
> pg_relation_size().

I did notice that this test wasn't using the toast table for the
toastable column -- but you mentioned better toast compression affecting
the future test stability, so I'm confused.

- Melanie

On Thu, Jan 19, 2023 at 6:18 AM vignesh C <vignesh21@gmail.com> wrote:
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:
> === Applying patches on top of PostgreSQL commit ID
> 4f74f5641d53559ec44e74d5bf552e167fdd5d20 ===
> === applying patch
> ./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch
> ....
> patching file src/test/regress/expected/rules.out
> Hunk #1 FAILED at 1876.
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/test/regress/expected/rules.out.rej
>
> [1] - http://cfbot.cputube.org/patch_41_3272.log

Yes, it conflicted with 47bb9db75996232. rebased v50 is attached.

On Tue, Jan 17, 2023 at 5:00 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> > > > > +-- Change the tablespace so that the table is rewritten directly, then SELECT
> > > > > +-- from it to cause it to be read back into shared buffers.
> > > > > +SET allow_in_place_tablespaces = true;
> > > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> > > >
> > > > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > > > checkpoints done as part of CREATE/DROP TABLESPACE?
> > > >
> > > > Or, at least combine this with the CHECKPOINTs above?
> > >
> > > I see a checkpoint is requested when dropping the tablespace if not all
> > > the files in it are deleted. It seems like if the DROP TABLE for the
> > > permanent table is before the explicit checkpoints in the test, then the
> > > DROP TABLESPACE will not cause an additional checkpoint.
> >
> > Unfortunately, that's not how it works :(. See the comment above mdunlink():
> >
> > > * For regular relations, we don't unlink the first segment file of the rel,
> > > * but just truncate it to zero length, and record a request to unlink it after
> > > * the next checkpoint.  Additional segments can be unlinked immediately,
> > > * however.  Leaving the empty file in place prevents that relfilenumber
> > > * from being reused.  The scenario this protects us from is:
> > > ...
> >
> >
> > > Is this what you are suggesting? Dropping the temporary table should not
> > > have an effect on this.
> >
> > I was wondering about simply moving that portion of the test to
> > tablespace.sql, where we already created a tablespace.
> >
> >
> > An alternative would be to propose splitting tablespace.sql into one portion
> > running at the start of parallel_schedule, and one at the end. Historically,
> > we needed tablespace.sql to be optional due to causing problems when
> > replicating to another instance on the same machine, but now we have
> > allow_in_place_tablespaces.
>
> It seems like the best way would be to split up the tablespace test file
> as you suggested and drop the tablespace at the end of the regression
> test suite. There could be other tests that could use a tablespace.
> Though what I wrote is kind of tablespace test coverage, if this
> rewriting behavior no longer happened when doing alter table set
> tablespace, we would want to come up with a new test which exercised
> that code to count those IO stats, not simply delete it from the
> tablespace tests.

I have added a patch to the set which creates the regress_tblspace
(formerly created in tablespace.sq1) in test_setup.sql. I then moved the
tablespace test to the end of the parallel schedule so that my test (and
others) could use the regress_tblspace.

I modified some of the tablespace.sql tests to be more specific in terms
of the objects they are looking for so that tests using the tablespace
are not forced to drop all of the objects they make in the tablespace.

Note that I did not proactively change all tests in tablespace.sql that
may fail in this way -- only those that failed because of the tables I
created (and did not drop) from regress_tblspace.

- Melanie

Вложения

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

20 января 2023 г., 05:15:34

On Thu, Jan 19, 2023 at 4:28 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Jan 19, 2023 at 6:18 AM vignesh C <vignesh21@gmail.com> wrote:
> > The patch does not apply on top of HEAD as in [1], please post a rebased patch:
> > === Applying patches on top of PostgreSQL commit ID
> > 4f74f5641d53559ec44e74d5bf552e167fdd5d20 ===
> > === applying patch
> > ./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch
> > ....
> > patching file src/test/regress/expected/rules.out
> > Hunk #1 FAILED at 1876.
> > 1 out of 1 hunk FAILED -- saving rejects to file
> > src/test/regress/expected/rules.out.rej
> >
> > [1] - http://cfbot.cputube.org/patch_41_3272.log
>
> Yes, it conflicted with 47bb9db75996232. rebased v50 is attached.

Oh dear-- an extra FlushBuffer() snuck in there somehow.
Removed it in attached v51.
Also, I fixed an issue in my tablespace.sql updates

- Melanie

On Sun, Feb 26, 2023 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > The issue seems to be that code like this:
> > ...
> > is far too cute for its own good.
>
> Oh, there's another thing here that qualifies as too-cute: loops like
>
>     for (IOObject io_object = IOOBJECT_FIRST;
>          io_object < IOOBJECT_NUM_TYPES; io_object++)
>
> make it look like we could define these enums as 1-based rather
> than 0-based, but if we did this code would fail, because it's
> confusing "the number of values" with "1 more than the last value".
>
> Again, we could fix that with tests like "io_context <= IOCONTEXT_LAST",
> but I don't see the point of adding more macros rather than removing
> some.  We do need IOOBJECT_NUM_TYPES to declare array sizes with,
> so I think we should nuke the "xxx_FIRST" macros as being not worth
> the electrons they're written on, and write these loops like
>
>     for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
>
> which is not actually adding any assumptions that you don't already
> make by using io_object as a C array subscript.

Attached is a patch to remove the *_FIRST macros.
I was going to add in code to change

    for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
    to
    for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES;
io_object++)

but then I couldn't remember why we didn't just do

    for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)

I recall that when passing that loop variable into a function I was
getting a compiler warning that required me to cast the value back to an
enum to silence it:

            pgstat_tracks_io_op(bktype, (IOObject) io_object,
io_context, io_op))

However, I am now unable to reproduce that warning.
Moreover, I see in cases like table_block_relation_size() with
ForkNumber, the variable i is passed with no cast to smgrnblocks().

- Melanie

Вложения

v1-0001-Remove-potentially-misleading-_FIRST-macros.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Tom Lane

Дата:

27 февраля 2023 г., 18:30:42

Melanie Plageman <melanieplageman@gmail.com> writes:
> Attached is a patch to remove the *_FIRST macros.
> I was going to add in code to change

>     for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
>     to
>     for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES; io_object++)

I don't really like that proposal.  ISTM it's just silencing the
messenger rather than addressing the underlying problem, namely that
there's no guarantee that an IOObject variable can hold the value
IOOBJECT_NUM_TYPES, which it had better do if you want the loop to
terminate.  Admittedly it's quite unlikely that these three enums would
grow to the point that that becomes an actual hazard for them --- but
IMO it's still bad practice and a bad precedent for future code.

> but then I couldn't remember why we didn't just do

>     for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)

> I recall that when passing that loop variable into a function I was
> getting a compiler warning that required me to cast the value back to an
> enum to silence it:

>             pgstat_tracks_io_op(bktype, (IOObject) io_object,
> io_context, io_op))

> However, I am now unable to reproduce that warning.
> Moreover, I see in cases like table_block_relation_size() with
> ForkNumber, the variable i is passed with no cast to smgrnblocks().

Yeah, my druthers would be to just do it the way we do comparable
things with ForkNumber.  I don't feel like we need to invent a
better way here.

The risk of needing to cast when using the "int" loop variable
as an enum is obviously the downside of that approach, but we have
not seen any indication that any compilers actually do warn.
It's interesting that you did see such a warning ... I wonder which
compiler you were using at the time?

            regards, tom lane

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

27 февраля 2023 г., 22:03:16

On Mon, Feb 27, 2023 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Melanie Plageman <melanieplageman@gmail.com> writes:
> > Attached is a patch to remove the *_FIRST macros.
> > I was going to add in code to change
>
> >     for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
> >     to
> >     for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES; io_object++)
>
> I don't really like that proposal.  ISTM it's just silencing the
> messenger rather than addressing the underlying problem, namely that
> there's no guarantee that an IOObject variable can hold the value
> IOOBJECT_NUM_TYPES, which it had better do if you want the loop to
> terminate.  Admittedly it's quite unlikely that these three enums would
> grow to the point that that becomes an actual hazard for them --- but
> IMO it's still bad practice and a bad precedent for future code.

That's fair. Patch attached.

> > but then I couldn't remember why we didn't just do
>
> >     for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
>
> > I recall that when passing that loop variable into a function I was
> > getting a compiler warning that required me to cast the value back to an
> > enum to silence it:
>
> >             pgstat_tracks_io_op(bktype, (IOObject) io_object,
> > io_context, io_op))
>
> > However, I am now unable to reproduce that warning.
> > Moreover, I see in cases like table_block_relation_size() with
> > ForkNumber, the variable i is passed with no cast to smgrnblocks().
>
> Yeah, my druthers would be to just do it the way we do comparable
> things with ForkNumber.  I don't feel like we need to invent a
> better way here.
>
> The risk of needing to cast when using the "int" loop variable
> as an enum is obviously the downside of that approach, but we have
> not seen any indication that any compilers actually do warn.
> It's interesting that you did see such a warning ... I wonder which
> compiler you were using at the time?

so, pretty much any version of clang I tried with
-Wsign-conversion produces a warning.

<source>:35:32: warning: implicit conversion changes signedness: 'int'
to 'IOOp' (aka 'enum IOOp') [-Wsign-conversion]

I didn't do the casts in the attached patch since they aren't done elsewhere.

- Melanie

Вложения

Change-IO-stats-enum-loop-variables-to-ints.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Tom Lane

Дата:

27 февраля 2023 г., 22:58:30

Melanie Plageman <melanieplageman@gmail.com> writes:
> On Mon, Feb 27, 2023 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The risk of needing to cast when using the "int" loop variable
>> as an enum is obviously the downside of that approach, but we have
>> not seen any indication that any compilers actually do warn.
>> It's interesting that you did see such a warning ... I wonder which
>> compiler you were using at the time?

> so, pretty much any version of clang I tried with
> -Wsign-conversion produces a warning.

> <source>:35:32: warning: implicit conversion changes signedness: 'int'
> to 'IOOp' (aka 'enum IOOp') [-Wsign-conversion]

Oh, interesting --- so it's not about the implicit conversion to enum
but just about signedness.  I bet we could silence that by making the
loop variables be "unsigned int".  I doubt it's worth any extra keystrokes
though, because we are not at all clean about sign-conversion warnings.
I tried enabling -Wsign-conversion on Apple's clang 14.0.0 just now,
and counted 13462 such warnings just in the core build :-(.  I don't
foresee anybody trying to clean that up.

> I didn't do the casts in the attached patch since they aren't done elsewhere.

Agreed.  I'll push this along with the earlier patch if there are
not objections.

            regards, tom lane

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

28 февраля 2023 г., 02:18:30

On 2023-02-27 14:58:30 -0500, Tom Lane wrote:
> Agreed.  I'll push this along with the earlier patch if there are
> not objections.

None here.

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Tom Lane

Дата:

05 марта 2023 г., 02:21:09

Andres Freund <andres@anarazel.de> writes:
> Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
> and the pg_stat_io tests.

One of the test cases is flapping a bit:

diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out
/home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out
--- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out    2023-03-04
21:30:05.891579466+0100 
+++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out    2023-03-04
21:34:26.745552661+0100 
@@ -1201,7 +1201,7 @@
 SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
  ?column?
 ----------
- t
+ f
 (1 row)

 DROP TABLE test_io_shared;

There are two instances of this today [1][2], and I've seen it before
but failed to note down where.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2023-03-04%2020%3A30%3A05

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Kyotaro Horiguchi

Дата:

06 марта 2023 г., 09:24:25

At Sat, 04 Mar 2023 18:21:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Andres Freund <andres@anarazel.de> writes:
> > Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
> > and the pg_stat_io tests.
> 
> One of the test cases is flapping a bit:
> 
> diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out
/home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out
> --- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out    2023-03-04
21:30:05.891579466+0100
 
> +++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out    2023-03-04
21:34:26.745552661+0100
 
> @@ -1201,7 +1201,7 @@
>  SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
>   ?column? 
>  ----------
> - t
> + f
>  (1 row)
>  
>  DROP TABLE test_io_shared;
> 
> There are two instances of this today [1][2], and I've seen it before
> but failed to note down where.

The concurrent autoanalyze below is logged as performing at least one
page read from the table. It is unclear, however, how that analyze
operation resulted in 19 hits and 2 reads on the (I think) single-page
relation.

In any case, I think we need to avoid such concurrent autovacuum/analyze.


https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39

2023-03-04 22:36:27.781 CET [4073:106] pg_regress/stats LOG:  statement: ALTER TABLE test_io_shared SET TABLESPACE
regress_tblspace;
2023-03-04 22:36:27.838 CET [4073:107] pg_regress/stats LOG:  statement: SELECT COUNT(*) FROM test_io_shared;
2023-03-04 22:36:27.864 CET [4255:5] LOG:  automatic analyze of table "regression.public.test_io_shared"
    avg read rate: 5.208 MB/s, avg write rate: 5.208 MB/s
    buffer usage: 17 hits, 2 misses, 2 dirtied
2023-03-04 22:36:28.024 CET [4073:108] pg_regress/stats LOG:  statement: SELECT pg_stat_force_next_flush();
2023-03-04 22:36:28.024 CET [4073:108] pg_regress/stats LOG:  statement: SELECT pg_stat_force_next_flush();
2023-03-04 22:36:28.027 CET [4073:109] pg_regress/stats LOG:  statement: SELECT sum(reads) AS
io_sum_shared_after_reads
      FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  



> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2023-03-04%2020%3A30%3A05

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Kyotaro Horiguchi

Дата:

06 марта 2023 г., 09:48:43

At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> In any case, I think we need to avoid such concurrent autovacuum/analyze.

If it is correct, I believe the attached fix works.

regads.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

fix_stats_test.diff

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

06 марта 2023 г., 18:09:24

On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > In any case, I think we need to avoid such concurrent autovacuum/analyze.
>
> If it is correct, I believe the attached fix works.

Thanks for investigating this!

Yes, this fix looks correct and makes sense to me.

On Mon, Mar 6, 2023 at 1:24 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sat, 04 Mar 2023 18:21:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> > Andres Freund <andres@anarazel.de> writes:
> > > Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
> > > and the pg_stat_io tests.
> >
> > One of the test cases is flapping a bit:
> >
> > diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out
/home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out
> > --- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out     2023-03-04
21:30:05.891579466+0100 
> > +++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out      2023-03-04
21:34:26.745552661+0100 
> > @@ -1201,7 +1201,7 @@
> >  SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
> >   ?column?
> >  ----------
> > - t
> > + f
> >  (1 row)
> >
> >  DROP TABLE test_io_shared;
> >
> > There are two instances of this today [1][2], and I've seen it before
> > but failed to note down where.
>
> The concurrent autoanalyze below is logged as performing at least one
> page read from the table. It is unclear, however, how that analyze
> operation resulted in 19 hits and 2 reads on the (I think) single-page
> relation.

Yes, it is a single page.
I think there could be a few different reasons by it is 2 misses/2
dirtied, but the one that seems most likely is that I/O of other
relations done during this autovac/analyze of this relation is counted
in the same global variables (like catalog tables).

- Melanie

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

06 марта 2023 г., 22:09:19

Hi,

On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > > In any case, I think we need to avoid such concurrent autovacuum/analyze.
> >
> > If it is correct, I believe the attached fix works.
> 
> Thanks for investigating this!
> 
> Yes, this fix looks correct and makes sense to me.

Wouldn't it be better to just perform the section from the ALTER TABLE till
the DROP TABLE in a transaction? Then there couldn't be any other accesses in
just that section. I'm not convinced it's good to disallow all concurrent
activity in other parts of the test.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

06 марта 2023 г., 22:24:09

On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > > > In any case, I think we need to avoid such concurrent autovacuum/analyze.
> > >
> > > If it is correct, I believe the attached fix works.
> > 
> > Thanks for investigating this!
> > 
> > Yes, this fix looks correct and makes sense to me.
> 
> Wouldn't it be better to just perform the section from the ALTER TABLE till
> the DROP TABLE in a transaction? Then there couldn't be any other accesses in
> just that section. I'm not convinced it's good to disallow all concurrent
> activity in other parts of the test.

You mean for test coverage reasons? Because the table in question only
exists for a few operations in this test file.

- Melanie

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

06 марта 2023 г., 22:34:04

Hi,

On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote:
> On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:
> > On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> > > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > > >
> > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > > > > In any case, I think we need to avoid such concurrent autovacuum/analyze.
> > > >
> > > > If it is correct, I believe the attached fix works.
> > > 
> > > Thanks for investigating this!
> > > 
> > > Yes, this fix looks correct and makes sense to me.
> > 
> > Wouldn't it be better to just perform the section from the ALTER TABLE till
> > the DROP TABLE in a transaction? Then there couldn't be any other accesses in
> > just that section. I'm not convinced it's good to disallow all concurrent
> > activity in other parts of the test.
> 
> You mean for test coverage reasons? Because the table in question only
> exists for a few operations in this test file.

That, but also because it's simply more reliable. autovacuum=off doesn't
protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
triggering a read. Or ...

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

06 марта 2023 г., 23:21:14

On Mon, Mar 6, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote:
> > On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:
> > > On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:
> > > > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
> > > > <horikyota.ntt@gmail.com> wrote:
> > > > >
> > > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > > > > > In any case, I think we need to avoid such concurrent autovacuum/analyze.
> > > > >
> > > > > If it is correct, I believe the attached fix works.
> > > >
> > > > Thanks for investigating this!
> > > >
> > > > Yes, this fix looks correct and makes sense to me.
> > >
> > > Wouldn't it be better to just perform the section from the ALTER TABLE till
> > > the DROP TABLE in a transaction? Then there couldn't be any other accesses in
> > > just that section. I'm not convinced it's good to disallow all concurrent
> > > activity in other parts of the test.
> >
> > You mean for test coverage reasons? Because the table in question only
> > exists for a few operations in this test file.
>
> That, but also because it's simply more reliable. autovacuum=off doesn't
> protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
> triggering a read. Or ...

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

- Melanie

Вложения

v1-0001-Fix-flakey-pg_stat_io-test.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Kyotaro Horiguchi

Дата:

07 марта 2023 г., 05:55:10

At Mon, 6 Mar 2023 15:21:14 -0500, Melanie Plageman <melanieplageman@gmail.com> wrote in 
> On Mon, Mar 6, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
> > That, but also because it's simply more reliable. autovacuum=off doesn't
> > protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
> > triggering a read. Or ...
>
> Good point. Attached is what you suggested. I committed the transaction
> before the drop table so that the statistics would be visible when we
> queried pg_stat_io.

While I don't believe anti-wraparound vacuum can occur during testing,
Melanie's solution (moving the commit by a few lines) seems working
(by a manual testing).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

07 марта 2023 г., 21:18:44

Hi,

On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:
> Good point. Attached is what you suggested. I committed the transaction
> before the drop table so that the statistics would be visible when we
> queried pg_stat_io.

Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Justin Pryzby

Дата:

09 марта 2023 г., 15:51:31

On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:
> > Good point. Attached is what you suggested. I committed the transaction
> > before the drop table so that the statistics would be visible when we
> > queried pg_stat_io.
> 
> Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.

There's a 2nd portion of the test that's still flapping, at least on
cirrusci.

The issue that Tom mentioned is at:
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

But what I've seen on cirrusci is at:
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5355168397524992/log/src/test/recovery/tmp_check/regression.diffs

https://api.cirrus-ci.com/v1/artifact/task/6142435751886848/testrun/build/testrun/recovery/027_stream_regress/log/regress_log_027_stream_regress

It'd be neat if cfbot could show a histogram of test failures, although
I'm not entirely sure what granularity would be most useful: the test
that failed (027_regress) or the way it failed (:after_write >
:before_writes).  Maybe it's enough to show the test, with links to its
recent failures.

-- 
Justin

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Andres Freund

Дата:

09 марта 2023 г., 22:43:01

Hi,

On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote:
> On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:
> > > Good point. Attached is what you suggested. I committed the transaction
> > > before the drop table so that the statistics would be visible when we
> > > queried pg_stat_io.
> > 
> > Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.
> 
> There's a 2nd portion of the test that's still flapping, at least on
> cirrusci.
> 
> The issue that Tom mentioned is at:
>  SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
> 
> But what I've seen on cirrusci is at:
>  SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

Seems you meant to copy a different line for Tom's (s/writes/redas/)?


> https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs

Hm. I guess the explanation here is that the buffers were already all written
out by another backend. Which is made more likely by your patch.


I found a few more occurances and chatted with Melanie. Melanie will come up
with a fix I think.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

10 марта 2023 г., 22:51:13

On Thu, Mar 9, 2023 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote:
> > On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:
> > There's a 2nd portion of the test that's still flapping, at least on
> > cirrusci.
> >
> > The issue that Tom mentioned is at:
> >  SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
> >
> > But what I've seen on cirrusci is at:
> >  SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
>
> Seems you meant to copy a different line for Tom's (s/writes/redas/)?
>
>
> > https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs
>
> Hm. I guess the explanation here is that the buffers were already all written
> out by another backend. Which is made more likely by your patch.
>
>
> I found a few more occurances and chatted with Melanie. Melanie will come up
> with a fix I think.

So, what this test is relying on is that either the checkpointer or
another backend will flush the pages of test_io_shared which we dirtied
above in the test. The test specifically checks for IOCONTEXT_NORMAL
writes. It could fail if some other backend is doing a bulkread or
bulkwrite and flushes these buffers first in a strategy context.
This will happen more often when shared buffers is small.

I tried to come up with a reliable test which was limited to
IOCONTEXT_NORMAL. I thought if we could guarantee a dirty buffer would
be pinned using a cursor, that we could then issue a checkpoint and
guarantee a flush that way. However, I don't see a way to guarantee that
no one flushes the buffer between dirtying it and pinning it with the
cursor.

So, I think our best bet is to just change the test to pass if there are
any writes in any contexts. By moving the sum(writes) before the INSERT
and keeping the checkpoint, we can guarantee that someway or another,
some buffers will be flushed. This essentially covers the same code anyway.

Patch attached.

- Melanie

Вложения

Stabilize-pg_stat_io-writes-test.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Justin Pryzby

Дата:

10 марта 2023 г., 23:19:04

On Thu, Mar 09, 2023 at 11:43:01AM -0800, Andres Freund wrote:
> On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote:
> > On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:
> > > > Good point. Attached is what you suggested. I committed the transaction
> > > > before the drop table so that the statistics would be visible when we
> > > > queried pg_stat_io.
> > > 
> > > Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.
> > 
> > There's a 2nd portion of the test that's still flapping, at least on
> > cirrusci.
> > 
> > The issue that Tom mentioned is at:
> >  SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
> > 
> > But what I've seen on cirrusci is at:
> >  SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
> 
> Seems you meant to copy a different line for Tom's (s/writes/redas/)?

Seems so

> > https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs
> 
> Hm. I guess the explanation here is that the buffers were already all written
> out by another backend. Which is made more likely by your patch.

FYI: that patch would've made it more likely for each backend to write
out its *own* dirty pages of TOAST ... but the two other failures that I
mentioned were for patches which wouldn't have affected this at all.

-- 
Justin

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

10 марта 2023 г., 23:33:44

On Fri, Mar 10, 2023 at 3:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Thu, Mar 09, 2023 at 11:43:01AM -0800, Andres Freund wrote:
> > > https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs
> >
> > Hm. I guess the explanation here is that the buffers were already all written
> > out by another backend. Which is made more likely by your patch.
>
> FYI: that patch would've made it more likely for each backend to write
> out its *own* dirty pages of TOAST ... but the two other failures that I
> mentioned were for patches which wouldn't have affected this at all.

I think your patch made it more likely that a backend needing to flush a
buffer in order to fit its own data would be doing so in a buffer access
strategy IO context.

Your patch makes it so those toast table writes are using a
BAS_BULKWRITE (see GetBulkInsertState()) and when they are looking for
buffers to put their data in, they have to evict other data (theirs and
others) but all of this is tracked in io_context = 'bulkwrite' -- and
the test only counted writes done in io_context 'normal'. But it is good
that your patch did that! It helped us to see that this test is not
reliable.

The other times this test failed in cfbot were for a patch that had many
failures and might have something wrong with its code, IIRC.

Thanks again for the report!

- Melanie

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Pavel Luzanov

Дата:

03 апреля 2023 г., 07:13:26

Hello,

I found that the 'standalone backend' backend type is not documented 
right now.
Adding something like (from commit message) would be helpful:

Both the bootstrap backend and single user mode backends will have 
backend_type STANDALONE_BACKEND.

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

03 апреля 2023 г., 23:50:43

On Mon, Apr 3, 2023 at 12:13 AM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:
>
> Hello,
>
> I found that the 'standalone backend' backend type is not documented
> right now.
> Adding something like (from commit message) would be helpful:
>
> Both the bootstrap backend and single user mode backends will have
> backend_type STANDALONE_BACKEND.

Thanks for the report.

Attached is a tiny patch to add standalone backend type to
pg_stat_activity documentation (referenced by pg_stat_io).

I mentioned both the bootstrap process and single user mode process in
the docs, though I can't imagine that the bootstrap process is relevant
for pg_stat_activity.

I also noticed that the pg_stat_activity docs call background workers
"parallel workers" (though it also mentions that extensions could have
other background workers registered), but this seems a bit weird because
pg_stat_activity uses GetBackendTypeDesc() and this prints "background
worker" for type B_BG_WORKER. Background workers doing parallelism tasks
is what users will most often see in pg_stat_activity, but I feel like
it is confusing to have it documented as something different than what
would appear in the view. Unless I am misunderstanding something...

- Melanie

Вложения

v1-0001-Document-standalone-backend-type-in-pg_stat_activ.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Pavel Luzanov

Дата:

04 апреля 2023 г., 23:35:13

On 03.04.2023 23:50, Melanie Plageman wrote:
> Attached is a tiny patch to add standalone backend type to
> pg_stat_activity documentation (referenced by pg_stat_io).
>
> I mentioned both the bootstrap process and single user mode process in
> the docs, though I can't imagine that the bootstrap process is relevant
> for pg_stat_activity.

After a little thought... I'm not sure about the term 'bootstrap 
process'. I can't find this term in the documentation.
Do I understand correctly that this is a postmaster? If so, then the 
postmaster process is not shown in pg_stat_activity.

Perhaps it may be worth adding a description of the standalone backend 
to pg_stat_io, not to pg_stat_activity.
Something like: backend_type is all types from pg_stat_activity plus 
'standalone backend',
which is used for the postmaster process and in a single user mode.

> I also noticed that the pg_stat_activity docs call background workers
> "parallel workers" (though it also mentions that extensions could have
> other background workers registered), but this seems a bit weird because
> pg_stat_activity uses GetBackendTypeDesc() and this prints "background
> worker" for type B_BG_WORKER. Background workers doing parallelism tasks
> is what users will most often see in pg_stat_activity, but I feel like
> it is confusing to have it documented as something different than what
> would appear in the view. Unless I am misunderstanding something...

'parallel worker' appears in the pg_stat_activity for parallel queries. 
I think it's right here.

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

05 апреля 2023 г., 03:41:09

On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:
>
> On 03.04.2023 23:50, Melanie Plageman wrote:
> > Attached is a tiny patch to add standalone backend type to
> > pg_stat_activity documentation (referenced by pg_stat_io).
> >
> > I mentioned both the bootstrap process and single user mode process in
> > the docs, though I can't imagine that the bootstrap process is relevant
> > for pg_stat_activity.
>
> After a little thought... I'm not sure about the term 'bootstrap
> process'. I can't find this term in the documentation.

There are various mentions of "bootstrap" peppered throughout the docs
but no concise summary of what it is. For example, initdb docs mention
the "bootstrap backend" [1].

Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This
doesn't really cover what bootstrapping is itself, but I wonder if that
is useful? If so, you could propose a glossary entry for it?
(preferably in a new thread)

> Do I understand correctly that this is a postmaster? If so, then the
> postmaster process is not shown in pg_stat_activity.

No, bootstrap process is for initializing the template database. You
will not be able to see pg_stat_activity when it is running.

> Perhaps it may be worth adding a description of the standalone backend
> to pg_stat_io, not to pg_stat_activity.
> Something like: backend_type is all types from pg_stat_activity plus
> 'standalone backend',
> which is used for the postmaster process and in a single user mode.

You can query pg_stat_activity from single user mode, so it is relevant
to pg_stat_activity also. I take your point that bootstrap mode isn't
relevant for pg_stat_activity, but I am hesitant to add that distinction
to the pg_stat_io docs since the reason you won't see it in
pg_stat_activity is because it is ephemeral and before a user can access
the database and not because stats are not tracked for it.

Can you think of a way to convey this?

> > I also noticed that the pg_stat_activity docs call background workers
> > "parallel workers" (though it also mentions that extensions could have
> > other background workers registered), but this seems a bit weird because
> > pg_stat_activity uses GetBackendTypeDesc() and this prints "background
> > worker" for type B_BG_WORKER. Background workers doing parallelism tasks
> > is what users will most often see in pg_stat_activity, but I feel like
> > it is confusing to have it documented as something different than what
> > would appear in the view. Unless I am misunderstanding something...
>
> 'parallel worker' appears in the pg_stat_activity for parallel queries.
> I think it's right here.

Ah, I didn't read the code closely enough in pg_stat_get_activity()
Even though there is no BackendType which GetBackendTypeDesc() returns
called "parallel worker", we to out of our way to be specific using
GetBackgroundWorkerTypeByPid()

  /* Add backend type */
  if (beentry->st_backendType == B_BG_WORKER)
  {
    const char *bgw_type;

    bgw_type = GetBackgroundWorkerTypeByPid(beentry->st_procpid);
    if (bgw_type)
      values[17] = CStringGetTextDatum(bgw_type);
    else
      nulls[17] = true;
  }
  else
    values[17] =
      CStringGetTextDatum(GetBackendTypeDesc(beentry->st_backendType));

- Melanie

[1] https://www.postgresql.org/docs/current/app-initdb.html

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Pavel Luzanov

Дата:

10 апреля 2023 г., 10:41:38

On 05.04.2023 03:41, Melanie Plageman wrote:
> On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:
>
>> After a little thought... I'm not sure about the term 'bootstrap
>> process'. I can't find this term in the documentation.
> There are various mentions of "bootstrap" peppered throughout the docs
> but no concise summary of what it is. For example, initdb docs mention
> the "bootstrap backend" [1].
>
> Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This
> doesn't really cover what bootstrapping is itself, but I wonder if that
> is useful? If so, you could propose a glossary entry for it?
> (preferably in a new thread)

I'm not sure if this is the reason for adding a new entry in the glossary.

>> Do I understand correctly that this is a postmaster? If so, then the
>> postmaster process is not shown in pg_stat_activity.
> No, bootstrap process is for initializing the template database. You
> will not be able to see pg_stat_activity when it is running.

Oh, it's clear to me now. Thank you for the explanation.

> You can query pg_stat_activity from single user mode, so it is relevant
> to pg_stat_activity also. I take your point that bootstrap mode isn't
> relevant for pg_stat_activity, but I am hesitant to add that distinction
> to the pg_stat_io docs since the reason you won't see it in
> pg_stat_activity is because it is ephemeral and before a user can access
> the database and not because stats are not tracked for it.
>
> Can you think of a way to convey this?

See my attempt attached.
I'm not sure about the wording. But I think we can avoid the term 
'bootstrap process'
by replacing it with "database cluster initialization", which should be 
clear to everyone.

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com

Вложения

v2-0001-PATCH-v2-Document-standalone-backend-type-in-pg_s.patch

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Melanie Plageman

Дата:

24 апреля 2023 г., 23:53:25

On Mon, Apr 10, 2023 at 3:41 AM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:
>
> On 05.04.2023 03:41, Melanie Plageman wrote:
> > On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:
> >
> >> After a little thought... I'm not sure about the term 'bootstrap
> >> process'. I can't find this term in the documentation.
> > There are various mentions of "bootstrap" peppered throughout the docs
> > but no concise summary of what it is. For example, initdb docs mention
> > the "bootstrap backend" [1].
> >
> > Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This
> > doesn't really cover what bootstrapping is itself, but I wonder if that
> > is useful? If so, you could propose a glossary entry for it?
> > (preferably in a new thread)
>
> I'm not sure if this is the reason for adding a new entry in the glossary.
>
> >> Do I understand correctly that this is a postmaster? If so, then the
> >> postmaster process is not shown in pg_stat_activity.
> > No, bootstrap process is for initializing the template database. You
> > will not be able to see pg_stat_activity when it is running.
>
> Oh, it's clear to me now. Thank you for the explanation.
>
> > You can query pg_stat_activity from single user mode, so it is relevant
> > to pg_stat_activity also. I take your point that bootstrap mode isn't
> > relevant for pg_stat_activity, but I am hesitant to add that distinction
> > to the pg_stat_io docs since the reason you won't see it in
> > pg_stat_activity is because it is ephemeral and before a user can access
> > the database and not because stats are not tracked for it.
> >
> > Can you think of a way to convey this?
>
> See my attempt attached.
> I'm not sure about the wording. But I think we can avoid the term
> 'bootstrap process'
> by replacing it with "database cluster initialization", which should be
> clear to everyone.

I like that idea.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f33a1c56c..45e20efbfb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -991,6 +991,9 @@ postgres   27093  0.0  0.0  30096  2752 ?
Ss   11:34   0:00 postgres: ser
        <literal>archiver</literal>,
        <literal>startup</literal>, <literal>walreceiver</literal>,
        <literal>walsender</literal> and <literal>walwriter</literal>.
+       The special type <literal>standalone backend</literal> is used

I think referring to it as a "special type" is a bit confusing. I think
you can just start the sentence with "standalone backend". You could
even include it in the main list of backend_types since it is possible
to see it in pg_stat_activity when in single user mode.

+       when initializing a database cluster by <xref linkend="app-initdb"/>
+       and when running in the <xref linkend="app-postgres-single-user"/>.
        In addition, background workers registered by extensions may have
        additional types.
       </para></entry>

I like the rest of this.

I copied the committer who most recently touched pg_stat_io (Michael
Paquier) to see if we could get someone interested in committing this
docs update.

- Melanie

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

От

Pavel Luzanov

Дата:

25 апреля 2023 г., 10:16:05

On 24.04.2023 23:53, Melanie Plageman wrote:
> I copied the committer who most recently touched pg_stat_io (Michael
> Paquier) to see if we could get someone interested in committing this
> docs update.

I can explain my motivation by suggesting this update.

pg_stat_io is a very impressive feature. So I decided to try it.
I see 4 rows for some 'standalone backend'  out of 30 total rows of the 
view.

The attempt to find description of 'standalone backend' in the docs
did not result in anything. pg_stat_io page references pg_stat_activity
for backend types. But pg_stat_activity page doesn't say anything
about 'standalone backend'.

I think this question will be popular without clarifying in docs.

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения