Обсуждение: Track IO times in pg_stat_io
Hi, As suggested in [1], the attached patch adds IO times to pg_stat_io; I added docs but haven't added any tests. The timings will only be non-zero when track_io_timing is on, and I only see tests with track IO timing on in explain.sql and the IO timings I added to pg_stat_io would not be visible there. I didn't split it up into two patches (one with the changes to track IO timing and 1 with the view additions and docs), because I figured the overall diff is pretty small. There is one minor question (in the code as a TODO) which is whether or not it is worth cross-checking that IO counts and times are either both zero or neither zero in the validation function pgstat_bktype_io_stats_valid(). - Melanie [1] https://www.postgresql.org/message-id/20230209050319.chyyup4vtq4jzobq%40awork3.anarazel.de
Вложения
Hi,
On 2/26/23 5:03 PM, Melanie Plageman wrote:
> Hi,
> 
> As suggested in [1], the attached patch adds IO times to pg_stat_io;
Thanks for the patch!
I started to have a look at it and figured out that a tiny rebase was needed (due to
728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.
> The timings will only be non-zero when track_io_timing is on
That could lead to incorrect interpretation if one wants to divide the timing per operations, say:
- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)
Might be worth to warn/highlight in the "track_io_timing" doc?
+               if (track_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(io_time);
+                       INSTR_TIME_SUBTRACT(io_time, io_start);
+                       pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time);
+               }
+
+
                 pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
vs
@@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
                                 INSTR_TIME_SUBTRACT(io_time, io_start);
                                 pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
                                 INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+                               pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time);
                         }
That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).
What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()?
If so, the ordering would also need to be changed in:
- FlushRelationBuffers()
- register_dirty_segment()
> 
> There is one minor question (in the code as a TODO) which is whether or
> not it is worth cross-checking that IO counts and times are either both
> zero or neither zero in the validation function
> pgstat_bktype_io_stats_valid().
> 
As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.
Regards,
-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
			
		Вложения
Thanks for the review!
On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> On 2/26/23 5:03 PM, Melanie Plageman wrote:
> > As suggested in [1], the attached patch adds IO times to pg_stat_io;
>
> Thanks for the patch!
>
> I started to have a look at it and figured out that a tiny rebase was needed (due to
> 728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.
Thanks for doing that!
> > The timings will only be non-zero when track_io_timing is on
>
> That could lead to incorrect interpretation if one wants to divide the timing per operations, say:
>
> - track_io_timing is set to on while there is already operations
> - or set to off while it was on (and the number of operations keeps growing)
>
> Might be worth to warn/highlight in the "track_io_timing" doc?
This is a good point. I've added a note to the docs for pg_stat_io.
> +               if (track_io_timing)
> +               {
> +                       INSTR_TIME_SET_CURRENT(io_time);
> +                       INSTR_TIME_SUBTRACT(io_time, io_start);
> +                       pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time);
> +               }
> +
> +
>                  pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
>
> vs
>
> @@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>                                  INSTR_TIME_SUBTRACT(io_time, io_start);
>                                  pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
>                                  INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
> +                               pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time);
>                          }
>
> That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and
> after pgstat_count_io_op() (for the IOOP_READ case).
>
> What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()?
>
> If so, the ordering would also need to be changed in:
>
> - FlushRelationBuffers()
> - register_dirty_segment()
Yes, good point. I've updated the code to use this suggested ordering in
attached v3.
> > There is one minor question (in the code as a TODO) which is whether or
> > not it is worth cross-checking that IO counts and times are either both
> > zero or neither zero in the validation function
> > pgstat_bktype_io_stats_valid().
> >
>
> As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> to also check that if counts are not Zero then times are not Zero.
Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.
- Melanie
			
		Вложения
Hi, On 3/6/23 5:30 PM, Melanie Plageman wrote: > Thanks for the review! > > On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> On 2/26/23 5:03 PM, Melanie Plageman wrote: >>> The timings will only be non-zero when track_io_timing is on >> >> That could lead to incorrect interpretation if one wants to divide the timing per operations, say: >> >> - track_io_timing is set to on while there is already operations >> - or set to off while it was on (and the number of operations keeps growing) >> >> Might be worth to warn/highlight in the "track_io_timing" doc? > > This is a good point. I've added a note to the docs for pg_stat_io. Thanks! Now I've a second thought: what do you think about resetting the related number of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc). That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned). Thinking that way as we'd loose some (most?) benefits of the new *_time columns if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples where the track_io_timing changes occur). But well, resetting the operations could also lead to bad interpretation about the operations... Not sure about which approach I like the most yet, what do you think? >> That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and >> after pgstat_count_io_op() (for the IOOP_READ case). >> >> What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()? >> >> If so, the ordering would also need to be changed in: >> >> - FlushRelationBuffers() >> - register_dirty_segment() > > Yes, good point. I've updated the code to use this suggested ordering in > attached v3. > Thanks, this looks good to me. >> As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea >> to also check that if counts are not Zero then times are not Zero. > > Yes, I think adding some validation around the relationship between > counts and timing should help prevent developers from forgetting to call > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant). > > However, I think that we cannot check that if IO counts are non-zero > that IO times are non-zero, because the user may not have > track_io_timing enabled. Yeah, right. > We can check that if IO times are not zero, IO > counts are not zero. I've done this in the attached v3. > Thanks, looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi,
On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > to also check that if counts are not Zero then times are not Zero.
> 
> Yes, I think adding some validation around the relationship between
> counts and timing should help prevent developers from forgetting to call
> pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> 
> However, I think that we cannot check that if IO counts are non-zero
> that IO times are non-zero, because the user may not have
> track_io_timing enabled. We can check that if IO times are not zero, IO
> counts are not zero. I've done this in the attached v3.
And even if track_io_timing is enabled, the timer granularity might be so low
that we *still* get zeroes.
I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
> @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>  
>      if (isExtend)
>      {
> +        instr_time    io_start,
> +                    io_time;
> +
>          /* new buffers are zero-filled */
>          MemSet((char *) bufBlock, 0, BLCKSZ);
> +
> +        if (track_io_timing)
> +            INSTR_TIME_SET_CURRENT(io_start);
> +        else
> +            INSTR_TIME_SET_ZERO(io_start);
> +
I wonder if there's an argument for tracking this in the existing IO stats as
well. But I guess we've lived with this for a long time...
> @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
>       * When a strategy is not in use, the write can only be a "regular" write
>       * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
>       */
> -    pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> -
>      if (track_io_timing)
>      {
>          INSTR_TIME_SET_CURRENT(io_time);
>          INSTR_TIME_SUBTRACT(io_time, io_start);
>          pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
>          INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> +        pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
>      }
I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
might not be able to reduce the whole duplication at this point, but at least
it should be a bit more centralized.
> +    pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
>      pgBufferUsage.shared_blks_written++;
>  
>      /*
> @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
>  
>      if (RelationUsesLocalBuffers(rel))
>      {
> +        instr_time    io_start,
> +                    io_time;
> +
>          for (i = 0; i < NLocBuffer; i++)
>          {
>              uint32        buf_state;
> @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
>  
>                  PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
>  
> +                if (track_io_timing)
> +                    INSTR_TIME_SET_CURRENT(io_start);
> +                else
> +                    INSTR_TIME_SET_ZERO(io_start);
> +
>                  smgrwrite(RelationGetSmgr(rel),
>                            BufTagGetForkNum(&bufHdr->tag),
>                            bufHdr->tag.blockNum,
I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
silence the compiler warnings you can do it one level up.
> @@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>  
>          PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
>  
> +        if (track_io_timing)
> +            INSTR_TIME_SET_CURRENT(io_start);
> +        else
> +            INSTR_TIME_SET_ZERO(io_start);
> +
>          /* And write... */
>          smgrwrite(oreln,
>                    BufTagGetForkNum(&bufHdr->tag),
> @@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>          buf_state &= ~BM_DIRTY;
>          pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>  
> +        if (track_io_timing)
> +        {
> +            INSTR_TIME_SET_CURRENT(io_time);
> +            INSTR_TIME_SUBTRACT(io_time, io_start);
> +            pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
> +        }
> +
>          pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
>          pgBufferUsage.local_blks_written++;
>      }
Perhaps we can instead introduce a FlushLocalBuffer()? Then we don't need this
in multiple write paths.
> diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> index 352958e1fe..052875d86a 100644
> --- a/src/backend/storage/smgr/md.c
> +++ b/src/backend/storage/smgr/md.c
> @@ -1030,6 +1030,30 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
>  
>      if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
>      {
> +        instr_time    io_start,
> +                    io_time;
> +
> +        if (track_io_timing)
> +            INSTR_TIME_SET_CURRENT(io_start);
> +        else
> +            INSTR_TIME_SET_ZERO(io_start);
> +
> +        ereport(DEBUG1,
> +                (errmsg_internal("could not forward fsync request because request queue is full")));
> +
> +        if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> +            ereport(data_sync_elevel(ERROR),
> +                    (errcode_for_file_access(),
> +                     errmsg("could not fsync file \"%s\": %m",
> +                            FilePathName(seg->mdfd_vfd))));
> +
> +        if (track_io_timing)
> +        {
> +            INSTR_TIME_SET_CURRENT(io_time);
> +            INSTR_TIME_SUBTRACT(io_time, io_start);
> +            pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
> +        }
> +
>          /*
>           * We have no way of knowing if the current IOContext is
>           * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
> @@ -1042,15 +1066,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
>           * backend fsyncs.
>           */
>          pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
> -
> -        ereport(DEBUG1,
> -                (errmsg_internal("could not forward fsync request because request queue is full")));
> -
> -        if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> -            ereport(data_sync_elevel(ERROR),
> -                    (errcode_for_file_access(),
> -                     errmsg("could not fsync file \"%s\": %m",
> -                            FilePathName(seg->mdfd_vfd))));
>      }
>  }
>  
> @@ -1399,6 +1414,8 @@ int
>  mdsyncfiletag(const FileTag *ftag, char *path)
>  {
>      SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
> +    instr_time    io_start,
> +                io_time;
>      File        file;
>      bool        need_to_close;
>      int            result,
> @@ -1425,10 +1442,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
>          need_to_close = true;
>      }
>  
> +    if (track_io_timing)
> +        INSTR_TIME_SET_CURRENT(io_start);
> +    else
> +        INSTR_TIME_SET_ZERO(io_start);
> +
>      /* Sync the file. */
>      result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
>      save_errno = errno;
>  
> +    if (track_io_timing)
> +    {
> +        INSTR_TIME_SET_CURRENT(io_time);
> +        INSTR_TIME_SUBTRACT(io_time, io_start);
> +        pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
> +    }
> +
>      if (need_to_close)
>          FileClose(file);
Perhaps we could have mdsyncfd(), used by both mdsyncfiletag() and
register_dirty_segment()?
> @@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
>  
>                  for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
>                  {
> -                    int            col_idx = pgstat_get_io_op_index(io_op);
> +                    int            i = pgstat_get_io_op_index(io_op);
>  
>                      /*
>                       * Some combinations of BackendType and IOOp, of IOContext
>                       * and IOOp, and of IOObject and IOOp are not tracked. Set
>                       * these cells in the view NULL.
>                       */
> -                    nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
> +                    if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
> +                        values[i] = Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
> +                    else
> +                        nulls[i] = true;
> +                }
These lines were already too long, and it's getting worse with this change.
>  typedef struct PgStat_BktypeIO
>  {
> -    PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> +    PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> +    instr_time    times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
>  } PgStat_BktypeIO;
Ah, you're going to hate me. We can't store instr_time on disk. There's
another patch that gets substantial peformance gains by varying the frequency
at which instr_time keeps track of time based on the CPU frequency... It also
just doesn't have enough range to keep track of system wide time on a larger
system. A single backend won't run for 293 years, but with a few thousand
backends that's a whole different story.
I think we need to accumulate in instr_time, but convert to floating point
when flushing stats.
Greetings,
Andres Freund
			
		Thanks for taking another look! On Tue, Mar 7, 2023 at 10:52 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > On 3/6/23 5:30 PM, Melanie Plageman wrote: > > Thanks for the review! > > > > On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> On 2/26/23 5:03 PM, Melanie Plageman wrote: > >>> The timings will only be non-zero when track_io_timing is on > >> > >> That could lead to incorrect interpretation if one wants to divide the timing per operations, say: > >> > >> - track_io_timing is set to on while there is already operations > >> - or set to off while it was on (and the number of operations keeps growing) > >> > >> Might be worth to warn/highlight in the "track_io_timing" doc? > > > > This is a good point. I've added a note to the docs for pg_stat_io. > > Thanks! > > Now I've a second thought: what do you think about resetting the related number > of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc). > > That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned). > > Thinking that way as we'd loose some (most?) benefits of the new *_time columns > if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples > where the track_io_timing changes occur). > > But well, resetting the operations could also lead to bad interpretation about the operations... > > Not sure about which approach I like the most yet, what do you think? Oh, this is an interesting idea. I think you are right about the synchronization issues making the statistics untrustworthy and, thus, unuseable. Building on your idea, what if we had the times be NULL instead of zero when track_io_timing is disabled? Then as you suggested, when you enable track_io_timing, it resets the IOOp counts and starts the times off at zero. However, disabling track_io_timing would only NULL out the times and not zero out the counts. We could also, as you say, log these events. - Melanie
On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote: > > Now I've a second thought: what do you think about resetting the related number > > of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc). > > > > That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned). > > > > Thinking that way as we'd loose some (most?) benefits of the new *_time columns > > if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard thesamples > > where the track_io_timing changes occur). > > > > But well, resetting the operations could also lead to bad interpretation about the operations... > > > > Not sure about which approach I like the most yet, what do you think? > > Oh, this is an interesting idea. I think you are right about the > synchronization issues making the statistics untrustworthy and, thus, > unuseable. No, I don't think we can do that. It can be enabled on a per-session basis. I think we simply shouldn't do anything here. This is a pre-existing issue. I also think that loosing stats when turning track_io_timing on/off would not be helpful. Greetings, Andres Freund
Hi, On 3/7/23 7:47 PM, Andres Freund wrote: > On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote: >>> Now I've a second thought: what do you think about resetting the related number >>> of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc). >>> >>> That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned). >>> >>> Thinking that way as we'd loose some (most?) benefits of the new *_time columns >>> if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard thesamples >>> where the track_io_timing changes occur). >>> >>> But well, resetting the operations could also lead to bad interpretation about the operations... >>> >>> Not sure about which approach I like the most yet, what do you think? >> >> Oh, this is an interesting idea. I think you are right about the >> synchronization issues making the statistics untrustworthy and, thus, >> unuseable. > > No, I don't think we can do that. It can be enabled on a per-session basis. Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that: - pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timingon, then it is even not sure it's only reflecting its own testing (as other sessions may have turned it on too). - There is the risk mentioned above of bad interpretations for the "time per operation" metrics. - Even if there is frequent enough sampling of it pg_stat_io, one does not know which samples contain track_io_timing changes(at the cluster or session level). > I think we simply shouldn't do anything here. This is a pre-existing issue. Oh, never thought about it. You mean like for pg_stat_database.blks_read and pg_stat_database.blk_read_time for example? > I also think that loosing stats when turning track_io_timing on/off would not be > helpful. > Yeah not 100% sure too as that would lead to other possible bad interpretations. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote: > On 3/7/23 7:47 PM, Andres Freund wrote: > > On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote: > > > > Now I've a second thought: what do you think about resetting the related number > > > > of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc). > > > > > > > > That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned). > > > > > > > > Thinking that way as we'd loose some (most?) benefits of the new *_time columns > > > > if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discardthe samples > > > > where the track_io_timing changes occur). > > > > > > > > But well, resetting the operations could also lead to bad interpretation about the operations... > > > > > > > > Not sure about which approach I like the most yet, what do you think? > > > > > > Oh, this is an interesting idea. I think you are right about the > > > synchronization issues making the statistics untrustworthy and, thus, > > > unuseable. > > > > No, I don't think we can do that. It can be enabled on a per-session basis. > > Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that: > > - pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timingon, then it > is even not sure it's only reflecting its own testing (as other sessions may have turned it on too). I think for 17 we should provide access to per-existing-connection pg_stat_io stats, and also provide a database aggregated version. Neither should be particularly hard. > - There is the risk mentioned above of bad interpretations for the "time per operation" metrics. > > - Even if there is frequent enough sampling of it pg_stat_io, one does not know which samples contain track_io_timing changes(at the cluster or session level). You'd just make the same use of them you do with pg_stat_database.blks_read etc today. I don't think it's particularly useful to use the time to calculate "per IO" costs - they can vary *drastically* due to kernel level buffering. The point of having the time available is that it provides information that the number of operations doesn't provide. > > I think we simply shouldn't do anything here. This is a pre-existing issue. > > Oh, never thought about it. You mean like for pg_stat_database.blks_read and pg_stat_database.blk_read_time for example? Yes. Greetings, Andres Freund
Hi, On 3/9/23 1:34 AM, Andres Freund wrote: > Hi, > > On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote: >> On 3/7/23 7:47 PM, Andres Freund wrote: >>> On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote: >>> No, I don't think we can do that. It can be enabled on a per-session basis. >> >> Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that: >> >> - pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timingon, then it >> is even not sure it's only reflecting its own testing (as other sessions may have turned it on too). > > I think for 17 we should provide access to per-existing-connection pg_stat_io > stats, and also provide a database aggregated version. Neither should be > particularly hard. > +1 that would be great. > > I don't think it's particularly useful to use the time to calculate "per IO" > costs - they can vary *drastically* due to kernel level buffering. Exactly and I think that's the reason why it could be useful. I think that could help (with frequent enough sampling) to try to identify when the IOs are served by the page cache or not (if one knows his infra well enough). One could say (for example, depending on his environment) that if the read_time > 4ms then the IO is served by spindle disks(if any) and if <<< ms then by the page cache. What I mean is that one could try to characterized their IOs based on threshold that they could define. Adding/reporting histograms in the game would be even better: something we could look for for 17? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> >>> Now I've a second thought: what do you think about resetting the related number > >>> of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc). > >>> > >>> That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned). > >>> > >>> Thinking that way as we'd loose some (most?) benefits of the new *_time columns > >>> if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard thesamples > >>> where the track_io_timing changes occur). > >>> > >>> But well, resetting the operations could also lead to bad interpretation about the operations... > >>> > >>> Not sure about which approach I like the most yet, what do you think? > >> > >> Oh, this is an interesting idea. I think you are right about the > >> synchronization issues making the statistics untrustworthy and, thus, > >> unuseable. > > > > No, I don't think we can do that. It can be enabled on a per-session basis. > Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that: > - pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timingon, then it > is even not sure it's only reflecting its own testing (as other sessions may have turned it on too). > - There is the risk mentioned above of bad interpretations for the "time per operation" metrics. > - Even if there is frequent enough sampling of it pg_stat_io, one does not know which samples contain track_io_timing changes(at the cluster or session level). As long as track_io_timing can be toggled, blk_write_time could lead to wrong conclusions. I think it may be helpful to track the blks_read when track_io_timing is enabled Separately. blks_read will be as is and give the overall blks_read, while a new column blks_read_with_timing will only report on blks_read with track_io_timing enabled. blks_read_with_timing should never be larger than blks_read. This will then make the blks_read_time valuable if it's looked at with the blks_read_with_timing column. Regards, -- Sami Imseih Amazon Web Services (AWS)
Hi, v4 attached addresses these review comments.
On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > > to also check that if counts are not Zero then times are not Zero.
> >
> > Yes, I think adding some validation around the relationship between
> > counts and timing should help prevent developers from forgetting to call
> > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> >
> > However, I think that we cannot check that if IO counts are non-zero
> > that IO times are non-zero, because the user may not have
> > track_io_timing enabled. We can check that if IO times are not zero, IO
> > counts are not zero. I've done this in the attached v3.
>
> And even if track_io_timing is enabled, the timer granularity might be so low
> that we *still* get zeroes.
>
> I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?
> > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >
> >       if (isExtend)
> >       {
> > +             instr_time      io_start,
> > +                                     io_time;
> > +
> >               /* new buffers are zero-filled */
> >               MemSet((char *) bufBlock, 0, BLCKSZ);
> > +
> > +             if (track_io_timing)
> > +                     INSTR_TIME_SET_CURRENT(io_start);
> > +             else
> > +                     INSTR_TIME_SET_ZERO(io_start);
> > +
>
> I wonder if there's an argument for tracking this in the existing IO stats as
> well. But I guess we've lived with this for a long time...
Not sure I want to include that in this patchset.
> > @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
> >        * When a strategy is not in use, the write can only be a "regular" write
> >        * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
> >        */
> > -     pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> > -
> >       if (track_io_timing)
> >       {
> >               INSTR_TIME_SET_CURRENT(io_time);
> >               INSTR_TIME_SUBTRACT(io_time, io_start);
> >               pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> >               INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > +             pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
> >       }
>
> I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
> pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
> might not be able to reduce the whole duplication at this point, but at least
> it should be a bit more centralized.
So, in the attached v4, I've introduced pgstat_io_start() and
pgstat_io_end(...). The end IO function takes the IOObject, IOOp, and
IOContext, in addition to the start_time, so that we know which
pgBufferUsage field to increment and which pgstat_count_buffer_*_time()
to call.
I will note that calling this function now causes pgBufferUsage and
pgStatBlock*Time to be incremented in a couple of places that they were
not before. I think those might have been accidental omissions, so I
think it is okay.
The exception is pgstat_count_write_time() being only called for
relations in shared buffers and not temporary relations while
pgstat_count_buffer_read_time() is called for temporary relations and
relations in shared buffers. I left that behavior as is, though it seems
like it is wrong.
I added pgstat_io_start() to pgstat.c -- not sure if it is best there.
I could separate it into a commit that does this refactoring of the
existing counting (without adding pgstat_count_io_time()) and then
another that adds pgstat_count_io_time(). I hesitated to do that until I
knew that the new functions were viable.
> > +     pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> >       pgBufferUsage.shared_blks_written++;
> >
> >       /*
> > @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
> >
> >       if (RelationUsesLocalBuffers(rel))
> >       {
> > +             instr_time      io_start,
> > +                                     io_time;
> > +
> >               for (i = 0; i < NLocBuffer; i++)
> >               {
> >                       uint32          buf_state;
> > @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
> >
> >                               PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
> >
> > +                             if (track_io_timing)
> > +                                     INSTR_TIME_SET_CURRENT(io_start);
> > +                             else
> > +                                     INSTR_TIME_SET_ZERO(io_start);
> > +
> >                               smgrwrite(RelationGetSmgr(rel),
> >                                                 BufTagGetForkNum(&bufHdr->tag),
> >                                                 bufHdr->tag.blockNum,
>
> I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
> silence the compiler warnings you can do it one level up.
So, I didn't move it out because I am using pgstat_io_start() which does
set zero. However, I could eschew the pgstat_io_start() helper function
and just do what is in the function inline. Do you think the overhead of
set zero is worth it?
> > @@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> >
> >               PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
> >
> > +             if (track_io_timing)
> > +                     INSTR_TIME_SET_CURRENT(io_start);
> > +             else
> > +                     INSTR_TIME_SET_ZERO(io_start);
> > +
> >               /* And write... */
> >               smgrwrite(oreln,
> >                                 BufTagGetForkNum(&bufHdr->tag),
> > @@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> >               buf_state &= ~BM_DIRTY;
> >               pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
> >
> > +             if (track_io_timing)
> > +             {
> > +                     INSTR_TIME_SET_CURRENT(io_time);
> > +                     INSTR_TIME_SUBTRACT(io_time, io_start);
> > +                     pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
> > +             }
> > +
> >               pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
> >               pgBufferUsage.local_blks_written++;
> >       }
>
> Perhaps we can instead introduce a FlushLocalBuffer()? Then we don't need this
> in multiple write paths.
FlushLocalBuffer() is a good idea. It would be nice to have it contain
more than just
  pgstat_io_start()
  smgrwrite()
  pgstat_io_end()
e.g. to have it include checksumming and marking dirty (more like
normal FlushBuffer()). I noticed that LocalBufferAlloc() does not set up
error traceback support for ereport and FlushRelationBuffers() does. Is
this intentional?
> > diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> > index 352958e1fe..052875d86a 100644
> > --- a/src/backend/storage/smgr/md.c
> > +++ b/src/backend/storage/smgr/md.c
> > @@ -1030,6 +1030,30 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
> >
> >       if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
> >       {
> > +             instr_time      io_start,
> > +                                     io_time;
> > +
> > +             if (track_io_timing)
> > +                     INSTR_TIME_SET_CURRENT(io_start);
> > +             else
> > +                     INSTR_TIME_SET_ZERO(io_start);
> > +
> > +             ereport(DEBUG1,
> > +                             (errmsg_internal("could not forward fsync request because request queue is full")));
> > +
> > +             if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> > +                     ereport(data_sync_elevel(ERROR),
> > +                                     (errcode_for_file_access(),
> > +                                      errmsg("could not fsync file \"%s\": %m",
> > +                                                     FilePathName(seg->mdfd_vfd))));
> > +
> > +             if (track_io_timing)
> > +             {
> > +                     INSTR_TIME_SET_CURRENT(io_time);
> > +                     INSTR_TIME_SUBTRACT(io_time, io_start);
> > +                     pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
> > +             }
> > +
> >               /*
> >                * We have no way of knowing if the current IOContext is
> >                * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
> > @@ -1042,15 +1066,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
> >                * backend fsyncs.
> >                */
> >               pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
> > -
> > -             ereport(DEBUG1,
> > -                             (errmsg_internal("could not forward fsync request because request queue is full")));
> > -
> > -             if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> > -                     ereport(data_sync_elevel(ERROR),
> > -                                     (errcode_for_file_access(),
> > -                                      errmsg("could not fsync file \"%s\": %m",
> > -                                                     FilePathName(seg->mdfd_vfd))));
> >       }
> >  }
> >
> > @@ -1399,6 +1414,8 @@ int
> >  mdsyncfiletag(const FileTag *ftag, char *path)
> >  {
> >       SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
> > +     instr_time      io_start,
> > +                             io_time;
> >       File            file;
> >       bool            need_to_close;
> >       int                     result,
> > @@ -1425,10 +1442,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
> >               need_to_close = true;
> >       }
> >
> > +     if (track_io_timing)
> > +             INSTR_TIME_SET_CURRENT(io_start);
> > +     else
> > +             INSTR_TIME_SET_ZERO(io_start);
> > +
> >       /* Sync the file. */
> >       result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
> >       save_errno = errno;
> >
> > +     if (track_io_timing)
> > +     {
> > +             INSTR_TIME_SET_CURRENT(io_time);
> > +             INSTR_TIME_SUBTRACT(io_time, io_start);
> > +             pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
> > +     }
> > +
> >       if (need_to_close)
> >               FileClose(file);
>
> Perhaps we could have mdsyncfd(), used by both mdsyncfiletag() and
> register_dirty_segment()?
I agree it would be nice, but it seems like it would take a little bit
of work and might not be worth doing that in this patchset.
> > @@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> >
> >                               for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> >                               {
> > -                                     int                     col_idx = pgstat_get_io_op_index(io_op);
> > +                                     int                     i = pgstat_get_io_op_index(io_op);
> >
> >                                       /*
> >                                        * Some combinations of BackendType and IOOp, of IOContext
> >                                        * and IOOp, and of IOObject and IOOp are not tracked. Set
> >                                        * these cells in the view NULL.
> >                                        */
> > -                                     nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
> > +                                     if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
> > +                                             values[i] =
Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
> > +                                     else
> > +                                             nulls[i] = true;
> > +                             }
>
> These lines were already too long, and it's getting worse with this change.
I've started using local variables.
> >  typedef struct PgStat_BktypeIO
> >  {
> > -     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > +     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > +     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> >  } PgStat_BktypeIO;
>
> Ah, you're going to hate me. We can't store instr_time on disk. There's
> another patch that gets substantial peformance gains by varying the frequency
> at which instr_time keeps track of time based on the CPU frequency...
What does that have to do with what we can store on disk?
If so, would it not be enough to do this when reading/writing the stats
file?
void
instr_time_deserialize(instr_time *dest, int64 *src, int length)
{
    for (size_t i = 0; i < length; i++)
    {
        INSTR_TIME_SET_ZERO(dest[i]);
        dest[i].ticks = src[i];
    }
}
void
instr_time_serialize(int64 *dest, instr_time *src, int length)
{
    for (size_t i = 0; i < length; i++)
        dest[i] = INSTR_TIME_GET_NANOSEC(src[i]);
}
> It also just doesn't have enough range to keep track of system wide
> time on a larger system. A single backend won't run for 293 years, but
> with a few thousand backends that's a whole different story.
>
> I think we need to accumulate in instr_time, but convert to floating point
> when flushing stats.
Hmmm. So, are you saying that we need to read from disk when we query
the view and add that to what is in shared memory? That we only store
the delta since the last restart in the instr_time array?
But, I don't see how that avoids the problem of backend total runtime
being 293 years. We would have to reset and write out the delta whenever
we thought the times would overflow.
But, maybe I am misunderstanding something.
- Melanie
			
		Вложения
Hi,
On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > > > to also check that if counts are not Zero then times are not Zero.
> > >
> > > Yes, I think adding some validation around the relationship between
> > > counts and timing should help prevent developers from forgetting to call
> > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > >
> > > However, I think that we cannot check that if IO counts are non-zero
> > > that IO times are non-zero, because the user may not have
> > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > counts are not zero. I've done this in the attached v3.
> >
> > And even if track_io_timing is enabled, the timer granularity might be so low
> > that we *still* get zeroes.
> >
> > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
>
> And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> stats?
Yes.
> > > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> > >
> > >       if (isExtend)
> > >       {
> > > +             instr_time      io_start,
> > > +                                     io_time;
> > > +
> > >               /* new buffers are zero-filled */
> > >               MemSet((char *) bufBlock, 0, BLCKSZ);
> > > +
> > > +             if (track_io_timing)
> > > +                     INSTR_TIME_SET_CURRENT(io_start);
> > > +             else
> > > +                     INSTR_TIME_SET_ZERO(io_start);
> > > +
> >
> > I wonder if there's an argument for tracking this in the existing IO stats as
> > well. But I guess we've lived with this for a long time...
>
> Not sure I want to include that in this patchset.
No, probably not.
> > >  typedef struct PgStat_BktypeIO
> > >  {
> > > -     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > >  } PgStat_BktypeIO;
> >
> > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > another patch that gets substantial peformance gains by varying the frequency
> > at which instr_time keeps track of time based on the CPU frequency...
>
> What does that have to do with what we can store on disk?
The frequency can change.
> If so, would it not be enough to do this when reading/writing the stats
> file?
Theoretically yes. But to me it seems cleaner to do it when flushing to shared
stats. See also the overflow issue below.
> void
> instr_time_deserialize(instr_time *dest, int64 *src, int length)
> {
>     for (size_t i = 0; i < length; i++)
>     {
>         INSTR_TIME_SET_ZERO(dest[i]);
>         dest[i].ticks = src[i];
>     }
> }
That wouldn't be correct, because what ticks means will at some point change
between postgres stopping and starting.
> > It also just doesn't have enough range to keep track of system wide
> > time on a larger system. A single backend won't run for 293 years, but
> > with a few thousand backends that's a whole different story.
> >
> > I think we need to accumulate in instr_time, but convert to floating point
> > when flushing stats.
>
> Hmmm. So, are you saying that we need to read from disk when we query
> the view and add that to what is in shared memory? That we only store
> the delta since the last restart in the instr_time array?
No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.
> But, I don't see how that avoids the problem of backend total runtime
> being 293 years. We would have to reset and write out the delta whenever
> we thought the times would overflow.
The overflow risk is due to storing nanoseconds (which is what instr_time
stores internally on linux now) - which we don't need to do once
accumulatated. Right now we store them as microseconds.
nanosecond range:
((2**63) - 1)/(10**9*60*60*24*365) -> 292 years
microsecond range:
((2**63) - 1)/(10**6*60*60*24*365) -> 292471 years
If you assume 5k connections continually doing IO, a range of 292 years would
last 21 days at nanosecond resolution. At microsecond resolution it's 58
years.
Greetings,
Andres Freund
			
		v5 attached mostly addresses instr_time persistence issues.
On Tue, Mar 14, 2023 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> > On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > > > > to also check that if counts are not Zero then times are not Zero.
> > > >
> > > > Yes, I think adding some validation around the relationship between
> > > > counts and timing should help prevent developers from forgetting to call
> > > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > > >
> > > > However, I think that we cannot check that if IO counts are non-zero
> > > > that IO times are non-zero, because the user may not have
> > > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > > counts are not zero. I've done this in the attached v3.
> > >
> > > And even if track_io_timing is enabled, the timer granularity might be so low
> > > that we *still* get zeroes.
> > >
> > > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
> >
> > And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> > stats?
>
> Yes.
I think this makes sense but I am hesitant to do it in this patchset,
because it feels a bit hidden...maybe?
But, if you feel strongly, I will make the change.
> > > >  typedef struct PgStat_BktypeIO
> > > >  {
> > > > -     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > +     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > +     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > >  } PgStat_BktypeIO;
> > >
> > > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > > another patch that gets substantial peformance gains by varying the frequency
> > > at which instr_time keeps track of time based on the CPU frequency...
> >
> > What does that have to do with what we can store on disk?
>
> The frequency can change.
Ah, I see.
> > If so, would it not be enough to do this when reading/writing the stats
> > file?
>
> Theoretically yes. But to me it seems cleaner to do it when flushing to shared
> stats. See also the overflow issue below.
>
> > > It also just doesn't have enough range to keep track of system wide
> > > time on a larger system. A single backend won't run for 293 years, but
> > > with a few thousand backends that's a whole different story.
> > >
> > > I think we need to accumulate in instr_time, but convert to floating point
> > > when flushing stats.
> >
> > Hmmm. So, are you saying that we need to read from disk when we query
> > the view and add that to what is in shared memory? That we only store
> > the delta since the last restart in the instr_time array?
>
> No, I don't think I am suggesting that. What I am trying to suggest is that
> PendingIOStats should contain instr_time, but that PgStat_IO should contain
> PgStat_Counter as microseconds, as before.
So, I've modified the code to make a union of instr_time and
PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay.
I store in microsec and then in pg_stat_io, I multiply to get
milliseconds for display.
I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
like [1], but, in the end I actually think I would end up with more
operations because of the various different counters needing to be
updated. As it is now, I do a single subtract and a few adds (one for
each of the different statistics objects tracking IO times
(pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
an accum diff for every one of those.
- Melanie
[1] https://www.postgresql.org/message-id/1feedb83-7aa9-cb4b-5086-598349d3f555%40gmail.com
			
		Вложения
Hi,
On 2023-03-16 17:19:16 -0400, Melanie Plageman wrote:
> > > > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
> > >
> > > And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> > > stats?
> >
> > Yes.
> 
> I think this makes sense but I am hesitant to do it in this patchset,
> because it feels a bit hidden...maybe?
I'd not do it in the same commit, but I don't see a problem with doing it in
the same patchset.
Now that I think about it again, this wouldn't make pg_stat_reset_shared('io')
affect pg_stat_database - I was thinking we should use pgstat_io.c stats to
provide the information for pgstat_database.c, using its own pending counter.
> > No, I don't think I am suggesting that. What I am trying to suggest is that
> > PendingIOStats should contain instr_time, but that PgStat_IO should contain
> > PgStat_Counter as microseconds, as before.
> 
> So, I've modified the code to make a union of instr_time and
> PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay.
> I store in microsec and then in pg_stat_io, I multiply to get
> milliseconds for display.
Not a fan - what do we gain by having this union? It seems considerably
cleaner to have a struct local to pgstat_io.c that uses instr_time and have a
clean type in PgStat_BktypeIO.  In fact, the code worked after just changing
that.
I don't think it makes sense to have pgstat_io_start()/end() as well as
pgstat_count_io*. For one, the name seems in a "too general namespace" - why
not a pgstat_count*?
> I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
> like [1], but, in the end I actually think I would end up with more
> operations because of the various different counters needing to be
> updated. As it is now, I do a single subtract and a few adds (one for
> each of the different statistics objects tracking IO times
> (pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
> an accum diff for every one of those.
Right - that only INSTR_TIME_ACCUM_DIFF() only makes sense if there's just a
single counter to update.
WRT:
                /* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
                /* for shared buffers whereas pgstat_count_buffer_read_time is */
                /* called for temp relations and shared buffers. */
                /*
                 * is this intentional and should I match current behavior or
                 * not?
                 */
It's hard to see how that behaviour could be intentional.  Probably worth
fixing in a separate patch. I don't think we're going to backpatch, but it
would make this clearer nonetheless.
Incremental patch with some of the above changed attached.
Btw, it's quite nice how one now can attribute time more easily:
20 connections copying an 8MB file 50 times each:
SELECT reuses, evictions, writes, write_time, extends, extend_time FROM pg_stat_io WHERE backend_type = 'client
backend'AND io_object = 'relation' AND io_context='bulkwrite';
 
┌────────┬───────────┬────────┬────────────┬─────────┬─────────────┐
│ reuses │ evictions │ writes │ write_time │ extends │ extend_time │
├────────┼───────────┼────────┼────────────┼─────────┼─────────────┤
│  36112 │         0 │  36112 │        141 │ 1523176 │        8676 │
└────────┴───────────┴────────┴────────────┴─────────┴─────────────┘
20 connections copying an 80MB file 5 times each:
┌─────────┬───────────┬─────────┬────────────┬─────────┬─────────────┐
│ reuses  │ evictions │ writes  │ write_time │ extends │ extend_time │
├─────────┼───────────┼─────────┼────────────┼─────────┼─────────────┤
│ 1318539 │         0 │ 1318539 │       5013 │ 1523339 │        7873 │
└─────────┴───────────┴─────────┴────────────┴─────────┴─────────────┘
(1 row)
Greetings,
Andres
			
		Вложения
On Mon, Mar 20, 2023 at 10:34 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-03-16 17:19:16 -0400, Melanie Plageman wrote:
> > > > > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
> > > >
> > > > And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> > > > stats?
> > >
> > > Yes.
> >
> > I think this makes sense but I am hesitant to do it in this patchset,
> > because it feels a bit hidden...maybe?
>
> I'd not do it in the same commit, but I don't see a problem with doing it in
> the same patchset.
>
> Now that I think about it again, this wouldn't make pg_stat_reset_shared('io')
> affect pg_stat_database - I was thinking we should use pgstat_io.c stats to
> provide the information for pgstat_database.c, using its own pending counter.
So, I've done this in the attached. But, won't resetting pgstat_database
be a bit weird if you have built up some IO timing in pending counters
and right after you reset a flush happens and then suddenly the values
are way above 0 again?
> > I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
> > like [1], but, in the end I actually think I would end up with more
> > operations because of the various different counters needing to be
> > updated. As it is now, I do a single subtract and a few adds (one for
> > each of the different statistics objects tracking IO times
> > (pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
> > an accum diff for every one of those.
>
> Right - that only INSTR_TIME_ACCUM_DIFF() only makes sense if there's just a
> single counter to update.
>
>
> WRT:
>                                 /* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
>                                 /* for shared buffers whereas pgstat_count_buffer_read_time is */
>                                 /* called for temp relations and shared buffers. */
>                                 /*
>                                  * is this intentional and should I match current behavior or
>                                  * not?
>                                  */
>
> It's hard to see how that behaviour could be intentional.  Probably worth
> fixing in a separate patch. I don't think we're going to backpatch, but it
> would make this clearer nonetheless.
Attached v7 does this in separate commits.
Remaining feedback is about FlushLocalBuffers(). Is the idea simply to
get it into bufmgr.c because that is cleaner from an API perspective?
- Melanie
			
		Вложения
Attached is a rebased version in light of 8aaa04b32d - Melanie
Вложения
Hi,
On 2023-03-31 15:44:58 -0400, Melanie Plageman wrote:
> From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 21 Mar 2023 16:00:55 -0400
> Subject: [PATCH v8 1/4] Count IO time for temp relation writes
>
> Both pgstat_database and pgBufferUsage write times failed to count
> timing for flushes of dirty local buffers when acquiring a new local
> buffer for a temporary relation block.
I think it'd be worth mentioning here that we do count read time? Otherwise
it'd not be as clear that adding tracking increases consistency...
> From f4e0db5c833f33b30d4c0b4bebec1096a1745d81 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 21 Mar 2023 18:20:44 -0400
> Subject: [PATCH v8 2/4] FlushRelationBuffers() counts temp relation IO timing
>
> Add pgstat_database and pgBufferUsage IO timing counting to
> FlushRelationBuffers() for writes of temporary relations.
> ---
>  src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index b3adbbe7d2..05e98d5994 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -3571,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
>  {
>      int            i;
>      BufferDesc *bufHdr;
> +    instr_time    io_start,
> +                io_time;
>
>      if (RelationUsesLocalBuffers(rel))
>      {
> @@ -3596,17 +3598,33 @@ FlushRelationBuffers(Relation rel)
>
>                  PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
>
> +                if (track_io_timing)
> +                    INSTR_TIME_SET_CURRENT(io_start);
> +                else
> +                    INSTR_TIME_SET_ZERO(io_start);
> +
>                  smgrwrite(RelationGetSmgr(rel),
>                            BufTagGetForkNum(&bufHdr->tag),
>                            bufHdr->tag.blockNum,
>                            localpage,
>                            false);
>
> +
Spurious newline.
>                  buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
>                  pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>
>                  pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
>
> +                if (track_io_timing)
> +                {
> +                    INSTR_TIME_SET_CURRENT(io_time);
> +                    INSTR_TIME_SUBTRACT(io_time, io_start);
> +                    pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> +                    INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> +                }
> +
> +                pgBufferUsage.local_blks_written++;
> +
>                  /* Pop the error context stack */
>                  error_context_stack = errcallback.previous;
>              }
> --
> 2.37.2
>
> From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 31 Mar 2023 15:32:36 -0400
> Subject: [PATCH v8 3/4] Track IO times in pg_stat_io
>
> Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.
>
> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> Reviewed-by: Andres Freund <andres@anarazel.de>
> Discussion:
https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
> ---
> -static PgStat_BktypeIO PendingIOStats;
> +typedef struct PgStat_PendingIO
> +{
> +    PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> +    instr_time    pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> +}            PgStat_PendingIO;
Probably will look less awful after adding the typedef to typedefs.list.
> +                /* we do track it */
> +                if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
> +                {
> +                    /* ensure that if IO times are non-zero, counts are > 0 */
> +                    if (backend_io->times[io_object][io_context][io_op] != 0 &&
> +                        backend_io->counts[io_object][io_context][io_op] <= 0)
> +                        return false;
> +
>                      continue;
> +                }
>
> -                /* There are stats and there shouldn't be */
> -                if (!bktype_tracked ||
> -                    !pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
> +                /* we don't track it, and it is not 0 */
> +                if (backend_io->counts[io_object][io_context][io_op] != 0)
>                      return false;
> +
> +                /* we don't track this IOOp, so make sure its IO time is zero */
> +                if (pgstat_tracks_io_time(io_op) > -1)
> +                {
> +                    if (backend_io->times[io_object][io_context][io_op] != 0)
> +                        return false;
> +                }
I'm somehow doubtful it's worth having pgstat_tracks_io_time, what kind of
error would be caught by this check?
> +/*
> + * Get the number of the column containing IO times for the specified IOOp. If
> + * the specified IOOp is one for which IO time is not tracked, return -1. Note
> + * that this function assumes that IO time for an IOOp is displayed in the view
> + * in the column directly after the IOOp counts.
> + */
> +static io_stat_col
> +pgstat_get_io_time_index(IOOp io_op)
> +{
> +    if (pgstat_tracks_io_time(io_op) == -1)
> +        return -1;
That seems dangerous - won't it just lead to accessing something from before
the start of the array? Probably should just assert.
> @@ -1363,20 +1389,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
>
>                  for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
>                  {
> -                    int            col_idx = pgstat_get_io_op_index(io_op);
> +                    PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
> +                    int            i = pgstat_get_io_op_index(io_op);
>
>                      /*
>                       * Some combinations of BackendType and IOOp, of IOContext
>                       * and IOOp, and of IOObject and IOOp are not tracked. Set
>                       * these cells in the view NULL.
>                       */
> -                    nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
> +                    if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
> +                        values[i] = Int64GetDatum(count);
> +                    else
> +                        nulls[i] = true;
> +                }
>
> -                    if (nulls[col_idx])
> +                for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> +                {
> +                    PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
> +                    int            i = pgstat_get_io_time_index(io_op);
> +
> +                    if (i == -1)
>                          continue;
>
> -                    values[col_idx] =
> -                        Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
> +                    if (!nulls[pgstat_get_io_op_index(io_op)])
> +                        values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
> +                    else
> +                        nulls[i] = true;
>                  }
Why two loops?
Greetings,
Andres Freund
			
		Attached v9 addresses review feedback as well as resolving merge
conflicts with recent relation extension patchset.
I've changed pgstat_count_io_op_time() to take a count and call
pgstat_count_io_op_n() so it can be used with smgrzeroextend(). I do
wish that the parameter to pgstat_count_io_op_n() was called "count" and
not "cnt"...
I've also reordered the call site of pgstat_count_io_op_time() in a few
locations, but I have some questions about this.
Before, I didn't think it mattered much that we didn't finish counting
IO time until after setting BM_VALID or BM_DIRTY and unsetting
BM_IO_IN_PROGRESS. With the relation extension code doing this for many
buffers at once, though, I wondered if this will make the IO timing too
inaccurate.
As such, I've moved pgstat_count_io_op_time() to before we set those
flags in all locations. I did wonder if it is bad to prolong having the
buffer pinned and not having those flags set, though.
On Tue, Apr 4, 2023 at 8:59 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-03-31 15:44:58 -0400, Melanie Plageman wrote:
> > From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Tue, 21 Mar 2023 16:00:55 -0400
> > Subject: [PATCH v8 1/4] Count IO time for temp relation writes
> >
> > Both pgstat_database and pgBufferUsage write times failed to count
> > timing for flushes of dirty local buffers when acquiring a new local
> > buffer for a temporary relation block.
>
> I think it'd be worth mentioning here that we do count read time? Otherwise
> it'd not be as clear that adding tracking increases consistency...
Done
> > From f4e0db5c833f33b30d4c0b4bebec1096a1745d81 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Tue, 21 Mar 2023 18:20:44 -0400
> > Subject: [PATCH v8 2/4] FlushRelationBuffers() counts temp relation IO timing
> >
> > Add pgstat_database and pgBufferUsage IO timing counting to
> > FlushRelationBuffers() for writes of temporary relations.
> > ---
> >  src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> > index b3adbbe7d2..05e98d5994 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -3571,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
> >  {
> >       int                     i;
> >       BufferDesc *bufHdr;
> > +     instr_time      io_start,
> > +                             io_time;
> >
> >       if (RelationUsesLocalBuffers(rel))
> >       {
> > @@ -3596,17 +3598,33 @@ FlushRelationBuffers(Relation rel)
> >
> >                               PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
> >
> > +                             if (track_io_timing)
> > +                                     INSTR_TIME_SET_CURRENT(io_start);
> > +                             else
> > +                                     INSTR_TIME_SET_ZERO(io_start);
> > +
> >                               smgrwrite(RelationGetSmgr(rel),
> >                                                 BufTagGetForkNum(&bufHdr->tag),
> >                                                 bufHdr->tag.blockNum,
> >                                                 localpage,
> >                                                 false);
> >
> > +
>
> Spurious newline.
Fixed.
> > From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Fri, 31 Mar 2023 15:32:36 -0400
> > Subject: [PATCH v8 3/4] Track IO times in pg_stat_io
> >
> > Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.
> >
> > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> > Reviewed-by: Andres Freund <andres@anarazel.de>
> > Discussion:
https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
> > ---
>
> > -static PgStat_BktypeIO PendingIOStats;
> > +typedef struct PgStat_PendingIO
> > +{
> > +     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > +     instr_time      pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > +}                    PgStat_PendingIO;
>
> Probably will look less awful after adding the typedef to typedefs.list.
Done.
One day I will remember to add things to typedefs.list.
> > +                             /* we do track it */
> > +                             if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
> > +                             {
> > +                                     /* ensure that if IO times are non-zero, counts are > 0 */
> > +                                     if (backend_io->times[io_object][io_context][io_op] != 0 &&
> > +                                             backend_io->counts[io_object][io_context][io_op] <= 0)
> > +                                             return false;
> > +
> >                                       continue;
> > +                             }
> >
> > -                             /* There are stats and there shouldn't be */
> > -                             if (!bktype_tracked ||
> > -                                     !pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
> > +                             /* we don't track it, and it is not 0 */
> > +                             if (backend_io->counts[io_object][io_context][io_op] != 0)
> >                                       return false;
> > +
> > +                             /* we don't track this IOOp, so make sure its IO time is zero */
> > +                             if (pgstat_tracks_io_time(io_op) > -1)
> > +                             {
> > +                                     if (backend_io->times[io_object][io_context][io_op] != 0)
> > +                                             return false;
> > +                             }
>
> I'm somehow doubtful it's worth having pgstat_tracks_io_time, what kind of
> error would be caught by this check?
Yea, now that the function to count IO timing also increments the count,
I don't think this can happen.
However, pgstat_tracks_io_time() is useful in its other call site in
pgstatfuncs which lets us continue in the loop if we don't need to fill
in that IO time. Perhaps it could be replaced with a if (io_op ==
IOOP_EVICT || io_op == IOOP_REUSE ... but I kind of like the function?
But, maybe it is overkill...
For now, I've moved pgstat_tracks_io_time() into pgstatfuncs.c as a
helper.
> > +/*
> > + * Get the number of the column containing IO times for the specified IOOp. If
> > + * the specified IOOp is one for which IO time is not tracked, return -1. Note
> > + * that this function assumes that IO time for an IOOp is displayed in the view
> > + * in the column directly after the IOOp counts.
> > + */
> > +static io_stat_col
> > +pgstat_get_io_time_index(IOOp io_op)
> > +{
> > +     if (pgstat_tracks_io_time(io_op) == -1)
> > +             return -1;
>
> That seems dangerous - won't it just lead to accessing something from before
> the start of the array? Probably should just assert.
Yea. I've removed it entirely as the passed in io_op can't be negative
(unless we change the enum values) and we add one to it before
returning.
> > @@ -1363,20 +1389,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> >
> >                               for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> >                               {
> > -                                     int                     col_idx = pgstat_get_io_op_index(io_op);
> > +                                     PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
> > +                                     int                     i = pgstat_get_io_op_index(io_op);
> >
> >                                       /*
> >                                        * Some combinations of BackendType and IOOp, of IOContext
> >                                        * and IOOp, and of IOObject and IOOp are not tracked. Set
> >                                        * these cells in the view NULL.
> >                                        */
> > -                                     nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
> > +                                     if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
> > +                                             values[i] = Int64GetDatum(count);
> > +                                     else
> > +                                             nulls[i] = true;
> > +                             }
> >
> > -                                     if (nulls[col_idx])
> > +                             for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> > +                             {
> > +                                     PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
> > +                                     int                     i = pgstat_get_io_time_index(io_op);
> > +
> > +                                     if (i == -1)
> >                                               continue;
> >
> > -                                     values[col_idx] =
> > -                                             Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
> > +                                     if (!nulls[pgstat_get_io_op_index(io_op)])
> > +                                             values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
> > +                                     else
> > +                                             nulls[i] = true;
> >                               }
>
> Why two loops?
Well, it was a stylistic choice that I now realize is actually
confusing.
I consolidated them.
- Melanie
			
		Вложения
Hi, On 2023-04-07 12:17:38 -0400, Melanie Plageman wrote: > Attached v9 addresses review feedback as well as resolving merge > conflicts with recent relation extension patchset. I've edited it a bit more: - removed pgstat_tracks_io_time() and replaced it by returning the new IO_COL_INVALID = -1 from pgstat_get_io_time_index() when there's no time - moved PgStat_Counter count, time into the respective branches. It feels somewhat wrong to access the time when we then decide there is no time. - s/io_object/io_obj/ in pgstat_count_io_op_time(), combined with added linebreaks, got the code to under 80 chars - renamed pg_stat_microseconds_to_milliseconds to pg_stat_us_to_ms - removed a spurious newline - the times reported by pg_stat_io had their fractional part removed, due to pg_stat_us_to_ms returning an integer Verifying this, I saw that the write time visible in pg_stat_io didn't quite match what I saw in log_checkpoints. But not always. Eventually I figured out that that's not pg_stat_io's fault - log_checkpoint's write includes a lot of things, including several other CheckPoint* routines, flushing WAL, asking the kernel to flush things to disk... The biggest portion in my case were the smgrwriteback() calls - which pg_stat_io doesn't track - oops. Pushed up to and including 0003. > I've changed pgstat_count_io_op_time() to take a count and call > pgstat_count_io_op_n() so it can be used with smgrzeroextend(). I do > wish that the parameter to pgstat_count_io_op_n() was called "count" and > not "cnt"... Heh. > I've also reordered the call site of pgstat_count_io_op_time() in a few > locations, but I have some questions about this. > > Before, I didn't think it mattered much that we didn't finish counting > IO time until after setting BM_VALID or BM_DIRTY and unsetting > BM_IO_IN_PROGRESS. With the relation extension code doing this for many > buffers at once, though, I wondered if this will make the IO timing too > inaccurate. > As such, I've moved pgstat_count_io_op_time() to before we set those > flags in all locations. I did wonder if it is bad to prolong having the > buffer pinned and not having those flags set, though. I went back and forth about this before. I think it's ok the way you did it. I think 0004 needs a bit more work. At the very least we would have to swap the order of pgstat_flush_pending_entries() and pgstat_flush_io() - entirely doable. Unlike 0003, this doesn't make pg_stat_io more complete, or such, so I'm inclined to leave it for 17. I think there might be some more opportunities for having counts "flow down", like the patch does. Greetings, Andres Freund
Hi, On Wed, Mar 08, 2023 at 04:34:38PM -0800, Andres Freund wrote: > On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote: > > - pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timingon, then it > > is even not sure it's only reflecting its own testing (as other sessions may have turned it on too). > > I think for 17 we should provide access to per-existing-connection pg_stat_io > stats, and also provide a database aggregated version. Neither should be > particularly hard. FWIW, I think that would be great and plan to have a look at this (unless someone beats me to it). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Aug 23, 2024 at 07:32:16AM +0000, Bertrand Drouvot wrote: > Hi, > > On Wed, Mar 08, 2023 at 04:34:38PM -0800, Andres Freund wrote: > > On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote: > > > - pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timingon, then it > > > is even not sure it's only reflecting its own testing (as other sessions may have turned it on too). > > > > I think for 17 we should provide access to per-existing-connection pg_stat_io > > stats, and also provide a database aggregated version. Neither should be > > particularly hard. > > FWIW, I think that would be great and plan to have a look at this (unless someone > beats me to it). FWIW, here is the patch proposal for per backend I/O statistics [1]. [1]: https://www.postgresql.org/message-id/ZtXR%2BCtkEVVE/LHF%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com