Обсуждение: Report checkpoint progress in server logs

Поиск
Список
Период
Сортировка

Report checkpoint progress in server logs

От
Bharath Rupireddy
Дата:
Hi,

At times, some of the checkpoint operations such as removing old WAL
files, dealing with replication snapshot or mapping files etc. may
take a while during which the server doesn't emit any logs or
information, the only logs emitted are LogCheckpointStart and
LogCheckpointEnd. Many times this isn't a problem if the checkpoint is
quicker, but there can be extreme situations which require the users
to know what's going on with the current checkpoint.

Given that the commit 9ce346ea [1] introduced a nice mechanism to
report the long running operations of the startup process in the
server logs, I'm thinking we can have a similar progress mechanism for
the checkpoint as well. There's another idea suggested in a couple of
other threads to have a pg_stat_progress_checkpoint similar to
pg_stat_progress_analyze/vacuum/etc. But the problem with this idea is
during the end-of-recovery or shutdown checkpoints, the
pg_stat_progress_checkpoint view isn't accessible as it requires a
connection to the server which isn't allowed.

Therefore, reporting the checkpoint progress in the server logs, much
like [1], seems to be the best way IMO. We can 1) either make
ereport_startup_progress and log_startup_progress_interval more
generic (something like ereport_log_progress and
log_progress_interval),  move the code to elog.c, use it for
checkpoint progress and if required for other time-consuming
operations 2) or have an entirely different GUC and API for checkpoint
progress.

IMO, option (1) i.e. ereport_log_progress and log_progress_interval
(better names are welcome) seems a better idea.

Thoughts?

[1]
commit 9ce346eabf350a130bba46be3f8c50ba28506969
Author: Robert Haas <rhaas@postgresql.org>
Date:   Mon Oct 25 11:51:57 2021 -0400

    Report progress of startup operations that take a long time.

Regards,
Bharath Rupireddy.



Re: Report checkpoint progress in server logs

От
Magnus Hagander
Дата:
On Wed, Dec 29, 2021 at 3:31 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> At times, some of the checkpoint operations such as removing old WAL
> files, dealing with replication snapshot or mapping files etc. may
> take a while during which the server doesn't emit any logs or
> information, the only logs emitted are LogCheckpointStart and
> LogCheckpointEnd. Many times this isn't a problem if the checkpoint is
> quicker, but there can be extreme situations which require the users
> to know what's going on with the current checkpoint.
>
> Given that the commit 9ce346ea [1] introduced a nice mechanism to
> report the long running operations of the startup process in the
> server logs, I'm thinking we can have a similar progress mechanism for
> the checkpoint as well. There's another idea suggested in a couple of
> other threads to have a pg_stat_progress_checkpoint similar to
> pg_stat_progress_analyze/vacuum/etc. But the problem with this idea is
> during the end-of-recovery or shutdown checkpoints, the
> pg_stat_progress_checkpoint view isn't accessible as it requires a
> connection to the server which isn't allowed.
>
> Therefore, reporting the checkpoint progress in the server logs, much
> like [1], seems to be the best way IMO. We can 1) either make
> ereport_startup_progress and log_startup_progress_interval more
> generic (something like ereport_log_progress and
> log_progress_interval),  move the code to elog.c, use it for
> checkpoint progress and if required for other time-consuming
> operations 2) or have an entirely different GUC and API for checkpoint
> progress.
>
> IMO, option (1) i.e. ereport_log_progress and log_progress_interval
> (better names are welcome) seems a better idea.
>
> Thoughts?

I find progress reporting in the logfile to generally be a terrible
way of doing things, and the fact that we do it for the startup
process is/should be only because we have no other choice, not because
it's the right choice.

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Report checkpoint progress in server logs

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
>> Therefore, reporting the checkpoint progress in the server logs, much
>> like [1], seems to be the best way IMO.

> I find progress reporting in the logfile to generally be a terrible
> way of doing things, and the fact that we do it for the startup
> process is/should be only because we have no other choice, not because
> it's the right choice.

I'm already pretty seriously unhappy about the log-spamming effects of
64da07c41 (default to log_checkpoints=on), and am willing to lay a side
bet that that gets reverted after we have some field experience with it.
This proposal seems far worse from that standpoint.  Keep in mind that
our out-of-the-box logging configuration still doesn't have any log
rotation ability, which means that the noisier the server is in normal
operation, the sooner you fill your disk.

> I think the right choice to solve the *general* problem is the
> mentioned pg_stat_progress_checkpoints.

+1

            regards, tom lane



Re: Report checkpoint progress in server logs

От
SATYANARAYANA NARLAPURAM
Дата:
  Coincidentally, I was thinking about the same yesterday after tired of waiting for the checkpoint completion on a server.  

On Wed, Dec 29, 2021 at 7:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
>> Therefore, reporting the checkpoint progress in the server logs, much
>> like [1], seems to be the best way IMO.

> I find progress reporting in the logfile to generally be a terrible
> way of doing things, and the fact that we do it for the startup
> process is/should be only because we have no other choice, not because
> it's the right choice.

I'm already pretty seriously unhappy about the log-spamming effects of
64da07c41 (default to log_checkpoints=on), and am willing to lay a side
bet that that gets reverted after we have some field experience with it.
This proposal seems far worse from that standpoint.  Keep in mind that
our out-of-the-box logging configuration still doesn't have any log
rotation ability, which means that the noisier the server is in normal
operation, the sooner you fill your disk.

Server is not open up for the queries while running the end of recovery checkpoint and a catalog view may not help here but the process title change or logging would be helpful in such cases. When the server is running the recovery, anxious customers ask several times the ETA for recovery completion, and not having visibility into these operations makes life difficult for the DBA/operations.
 

> I think the right choice to solve the *general* problem is the
> mentioned pg_stat_progress_checkpoints.

+1
 
+1 to this. We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.


Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.
 

                        regards, tom lane


Re: Report checkpoint progress in server logs

От
Michael Paquier
Дата:
On Wed, Dec 29, 2021 at 10:40:59AM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> I think the right choice to solve the *general* problem is the
>> mentioned pg_stat_progress_checkpoints.
>
> +1

Agreed.  I don't see why this would not work as there are
PgBackendStatus entries for each auxiliary process.
--
Michael

Вложения

Re: Report checkpoint progress in server logs

От
Bruce Momjian
Дата:
On Wed, Dec 29, 2021 at 10:40:59AM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> >> Therefore, reporting the checkpoint progress in the server logs, much
> >> like [1], seems to be the best way IMO.
> 
> > I find progress reporting in the logfile to generally be a terrible
> > way of doing things, and the fact that we do it for the startup
> > process is/should be only because we have no other choice, not because
> > it's the right choice.
> 
> I'm already pretty seriously unhappy about the log-spamming effects of
> 64da07c41 (default to log_checkpoints=on), and am willing to lay a side
> bet that that gets reverted after we have some field experience with it.
> This proposal seems far worse from that standpoint.  Keep in mind that
> our out-of-the-box logging configuration still doesn't have any log
> rotation ability, which means that the noisier the server is in normal
> operation, the sooner you fill your disk.

I think we are looking at three potential observable behaviors people
might care about:

*  the current activity/progress of checkpoints
*  the historical reporting of checkpoint completion, mixed in with other
   log messages for later analysis
*  the aggregate behavior of checkpoint operation

I think it is clear that checkpoint progress activity isn't useful for
the server logs because that information has little historical value,
but does fit for a progress view.  As Tom already expressed, we will
have to wait to see if non-progress checkpoint information in the logs
has sufficient historical value.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Report checkpoint progress in server logs

От
Nitin Jadhav
Дата:
> I think the right choice to solve the *general* problem is the
> mentioned pg_stat_progress_checkpoints.
>
> We may want to *additionally* have the ability to log the progress
> specifically for the special cases when we're not able to use that
> view. And in those case, we can perhaps just use the existing
> log_startup_progress_interval parameter for this as well -- at least
> for the startup checkpoint.

+1

> We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just
emittingthe stats at the end.
 
>
> Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep
itis running, whether it is on target for completion, checkpoint_Reason
 
> (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Thanks & Regards,
Nitin Jadhav

On Thu, Jan 6, 2022 at 5:12 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, Dec 29, 2021 at 10:40:59AM -0500, Tom Lane wrote:
> > Magnus Hagander <magnus@hagander.net> writes:
> > >> Therefore, reporting the checkpoint progress in the server logs, much
> > >> like [1], seems to be the best way IMO.
> >
> > > I find progress reporting in the logfile to generally be a terrible
> > > way of doing things, and the fact that we do it for the startup
> > > process is/should be only because we have no other choice, not because
> > > it's the right choice.
> >
> > I'm already pretty seriously unhappy about the log-spamming effects of
> > 64da07c41 (default to log_checkpoints=on), and am willing to lay a side
> > bet that that gets reverted after we have some field experience with it.
> > This proposal seems far worse from that standpoint.  Keep in mind that
> > our out-of-the-box logging configuration still doesn't have any log
> > rotation ability, which means that the noisier the server is in normal
> > operation, the sooner you fill your disk.
>
> I think we are looking at three potential observable behaviors people
> might care about:
>
> *  the current activity/progress of checkpoints
> *  the historical reporting of checkpoint completion, mixed in with other
>    log messages for later analysis
> *  the aggregate behavior of checkpoint operation
>
> I think it is clear that checkpoint progress activity isn't useful for
> the server logs because that information has little historical value,
> but does fit for a progress view.  As Tom already expressed, we will
> have to wait to see if non-progress checkpoint information in the logs
> has sufficient historical value.
>
> --
>   Bruce Momjian  <bruce@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>
>
>



Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Bharath Rupireddy
Дата:
On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > I think the right choice to solve the *general* problem is the
> > mentioned pg_stat_progress_checkpoints.
> >
> > We may want to *additionally* have the ability to log the progress
> > specifically for the special cases when we're not able to use that
> > view. And in those case, we can perhaps just use the existing
> > log_startup_progress_interval parameter for this as well -- at least
> > for the startup checkpoint.
>
> +1
>
> > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> >
> > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 
>
> I agree to provide above mentioned information as part of showing the
> progress of current checkpoint operation. I am currently looking into
> the code to know if any other information can be added.

As suggested in the other thread by Julien, I'm changing the subject
of this thread to reflect the discussion.

Regards,
Bharath Rupireddy.



> > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> >
> > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 
>
> I agree to provide above mentioned information as part of showing the
> progress of current checkpoint operation. I am currently looking into
> the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.


On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > I think the right choice to solve the *general* problem is the
> > > mentioned pg_stat_progress_checkpoints.
> > >
> > > We may want to *additionally* have the ability to log the progress
> > > specifically for the special cases when we're not able to use that
> > > view. And in those case, we can perhaps just use the existing
> > > log_startup_progress_interval parameter for this as well -- at least
> > > for the startup checkpoint.
> >
> > +1
> >
> > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> As suggested in the other thread by Julien, I'm changing the subject
> of this thread to reflect the discussion.
>
> Regards,
> Bharath Rupireddy.

Вложения
> Apart from above fields, I am planning to add few more fields to the
> view in the next patch. That is, process ID of the backend process
> which triggered a CHECKPOINT command, checkpoint start location, filed
> to indicate whether it is a checkpoint or restartpoint and elapsed
> time of the checkpoint operation. Please share your thoughts. I would
> be happy to add any other information that contributes to showing the
> progress of checkpoint.

The progress reporting mechanism of postgres uses the
'st_progress_param' array of 'PgBackendStatus' structure to hold the
information related to the progress. There is a function
'pgstat_progress_update_param()' which takes 'index' and 'val' as
arguments and updates the 'val' to corresponding 'index' in the
'st_progress_param' array. This mechanism works fine when all the
progress information is of type integer as the data type of
'st_progress_param' is of type integer. If the progress data is of
different type than integer, then there is no easy way to do so. In my
understanding, define a new structure with additional fields. Add this
as part of the 'PgBackendStatus' structure and support the necessary
function to update and fetch the data from this structure. This
becomes very ugly as it will not match the existing mechanism of
progress reporting. Kindly let me know if there is any better way to
handle this. If there are any changes to the existing mechanism to
make it generic to support basic data types, I would like to discuss
this in the new thread.

On Thu, Feb 10, 2022 at 12:22 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> Here is the initial patch to show the progress of checkpoint through
> pg_stat_progress_checkpoint view. Please find the attachment.
>
> The information added to this view are pid - process ID of a
> CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> for checkpoint (values can be wal, time or force), phase - indicates
> the current phase of checkpoint operation, total_buffer_writes - total
> number of buffers to be written, buffers_processed - number of buffers
> processed, buffers_written - number of buffers written,
> total_file_syncs - total number of files to be synced, files_synced -
> number of files synced.
>
> There are many operations happen as part of checkpoint. For each of
> the operation I am updating the phase field of
> pg_stat_progress_checkpoint view. The values supported for this field
> are initializing, checkpointing replication slots, checkpointing
> snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> buffers, performing sync requests, performing two phase checkpoint,
> recycling old XLOG files and Finalizing. In case of checkpointing
> buffers phase, the fields total_buffer_writes, buffers_processed and
> buffers_written shows the detailed progress of writing buffers. In
> case of performing sync requests phase, the fields total_file_syncs
> and files_synced shows the detailed progress of syncing files. In
> other phases, only the phase field is getting updated and it is
> difficult to show the progress because we do not get the total number
> of files count without traversing the directory. It is not worth to
> calculate that as it affects the performance of the checkpoint. I also
> gave a thought to just mention the number of files processed, but this
> wont give a meaningful progress information (It can be treated as
> statistics). Hence just updating the phase field in those scenarios.
>
> Apart from above fields, I am planning to add few more fields to the
> view in the next patch. That is, process ID of the backend process
> which triggered a CHECKPOINT command, checkpoint start location, filed
> to indicate whether it is a checkpoint or restartpoint and elapsed
> time of the checkpoint operation. Please share your thoughts. I would
> be happy to add any other information that contributes to showing the
> progress of checkpoint.
>
> As per the discussion in this thread, there should be some mechanism
> to show the progress of checkpoint during shutdown and end-of-recovery
> cases as we cannot access pg_stat_progress_checkpoint in those cases.
> I am working on this to use log_startup_progress_interval mechanism to
> log the progress in the server logs.
>
> Kindly review the patch and share your thoughts.
>
>
> On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > > I think the right choice to solve the *general* problem is the
> > > > mentioned pg_stat_progress_checkpoints.
> > > >
> > > > We may want to *additionally* have the ability to log the progress
> > > > specifically for the special cases when we're not able to use that
> > > > view. And in those case, we can perhaps just use the existing
> > > > log_startup_progress_interval parameter for this as well -- at least
> > > > for the startup checkpoint.
> > >
> > > +1
> > >
> > > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > > >
> > > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need
somerefactoring here.
 
> > >
> > > I agree to provide above mentioned information as part of showing the
> > > progress of current checkpoint operation. I am currently looking into
> > > the code to know if any other information can be added.
> >
> > As suggested in the other thread by Julien, I'm changing the subject
> > of this thread to reflect the discussion.
> >
> > Regards,
> > Bharath Rupireddy.



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Tue, 15 Feb 2022 at 13:16, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > Apart from above fields, I am planning to add few more fields to the
> > view in the next patch. That is, process ID of the backend process
> > which triggered a CHECKPOINT command, checkpoint start location, filed
> > to indicate whether it is a checkpoint or restartpoint and elapsed
> > time of the checkpoint operation. Please share your thoughts. I would
> > be happy to add any other information that contributes to showing the
> > progress of checkpoint.
>
> The progress reporting mechanism of postgres uses the
> 'st_progress_param' array of 'PgBackendStatus' structure to hold the
> information related to the progress. There is a function
> 'pgstat_progress_update_param()' which takes 'index' and 'val' as
> arguments and updates the 'val' to corresponding 'index' in the
> 'st_progress_param' array. This mechanism works fine when all the
> progress information is of type integer as the data type of
> 'st_progress_param' is of type integer. If the progress data is of
> different type than integer, then there is no easy way to do so.

Progress parameters are int64, so all of the new 'checkpoint start
location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
time' (store as start time in stat_progress, timestamp fits in 64
bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
current stat_progress parameter. Some processing would be required at
the view, but that's not impossible to overcome.

Kind regards,

Matthias van de Meent



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> Here is the initial patch to show the progress of checkpoint through
> pg_stat_progress_checkpoint view. Please find the attachment.
>
> The information added to this view are pid - process ID of a
> CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> for checkpoint (values can be wal, time or force), phase - indicates
> the current phase of checkpoint operation, total_buffer_writes - total
> number of buffers to be written, buffers_processed - number of buffers
> processed, buffers_written - number of buffers written,
> total_file_syncs - total number of files to be synced, files_synced -
> number of files synced.
>
> There are many operations happen as part of checkpoint. For each of
> the operation I am updating the phase field of
> pg_stat_progress_checkpoint view. The values supported for this field
> are initializing, checkpointing replication slots, checkpointing
> snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> buffers, performing sync requests, performing two phase checkpoint,
> recycling old XLOG files and Finalizing. In case of checkpointing
> buffers phase, the fields total_buffer_writes, buffers_processed and
> buffers_written shows the detailed progress of writing buffers. In
> case of performing sync requests phase, the fields total_file_syncs
> and files_synced shows the detailed progress of syncing files. In
> other phases, only the phase field is getting updated and it is
> difficult to show the progress because we do not get the total number
> of files count without traversing the directory. It is not worth to
> calculate that as it affects the performance of the checkpoint. I also
> gave a thought to just mention the number of files processed, but this
> wont give a meaningful progress information (It can be treated as
> statistics). Hence just updating the phase field in those scenarios.
>
> Apart from above fields, I am planning to add few more fields to the
> view in the next patch. That is, process ID of the backend process
> which triggered a CHECKPOINT command, checkpoint start location, filed
> to indicate whether it is a checkpoint or restartpoint and elapsed
> time of the checkpoint operation. Please share your thoughts. I would
> be happy to add any other information that contributes to showing the
> progress of checkpoint.
>
> As per the discussion in this thread, there should be some mechanism
> to show the progress of checkpoint during shutdown and end-of-recovery
> cases as we cannot access pg_stat_progress_checkpoint in those cases.
> I am working on this to use log_startup_progress_interval mechanism to
> log the progress in the server logs.
>
> Kindly review the patch and share your thoughts.

Interesting idea, and overall a nice addition to the
pg_stat_progress_* reporting infrastructure.

Could you add your patch to the current commitfest at
https://commitfest.postgresql.org/37/?

See below for some comments on the patch:

> xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
> +    /* In bootstrap mode, we don't actually record anything. */
> +    if (IsBootstrapProcessingMode())
> +        return;

Why do you check against the state of the system?
pgstat_progress_update_* already provides protections against updating
the progress tables if the progress infrastructure is not loaded; and
otherwise (in the happy path) the cost of updating the progress fields
will be quite a bit higher than normal. Updating stat_progress isn't
very expensive (quite cheap, really), so I don't quite get why you
guard against reporting stats when you expect no other client to be
listening.

I think you can simplify this a lot by directly using
pgstat_progress_update_param() instead.

> xlog.c @ checkpoint_progress_start
> +        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
> +        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
> +                                         PROGRESS_CHECKPOINT_PHASE_INIT);
> +        if (flags & CHECKPOINT_CAUSE_XLOG)
> +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> +                                             PROGRESS_CHECKPOINT_KIND_WAL);
> +        else if (flags & CHECKPOINT_CAUSE_TIME)
> +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> +                                             PROGRESS_CHECKPOINT_KIND_TIME);
> + [...]

Could you assign the kind of checkpoint to a local variable, and then
update the "phase" and "kind" parameters at the same time through
pgstat_progress_update_multi_param(2, ...)? See
BuildRelationExtStatistics in extended_stats.c for an example usage.
Note that regardless of whether checkpoint_progress_update* will
remain, the checks done in that function already have been checked in
this function as well, so you can use the pgstat_* functions directly.

> monitoring.sgml
> +   <structname>pg_stat_progress_checkpoint</structname> view will contain a
> +   single row indicating the progress of checkpoint operation.

... add "if a checkpoint is currently active".

> +       <structfield>total_buffer_writes</structfield> <type>bigint</type>
> +       <structfield>total_file_syncs</structfield> <type>bigint</type>

The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

> +       The checkpoint operation is requested due to XLOG filling.

+ The checkpoint was started because >max_wal_size< of WAL was written.

> +       The checkpoint operation is requested due to timeout.

+ The checkpoint was started due to the expiration of a
>checkpoint_timeout< interval

> +       The checkpoint operation is forced even if no XLOG activity has occurred
> +       since the last one.

+ Some operation forced a checkpoint.

> +      <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

Thanks for working on this.

Kind regards,

Matthias van de Meent



> > The progress reporting mechanism of postgres uses the
> > 'st_progress_param' array of 'PgBackendStatus' structure to hold the
> > information related to the progress. There is a function
> > 'pgstat_progress_update_param()' which takes 'index' and 'val' as
> > arguments and updates the 'val' to corresponding 'index' in the
> > 'st_progress_param' array. This mechanism works fine when all the
> > progress information is of type integer as the data type of
> > 'st_progress_param' is of type integer. If the progress data is of
> > different type than integer, then there is no easy way to do so.
>
> Progress parameters are int64, so all of the new 'checkpoint start
> location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
> time' (store as start time in stat_progress, timestamp fits in 64
> bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
> current stat_progress parameter. Some processing would be required at
> the view, but that's not impossible to overcome.

Thank you for sharing the information.  'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time. This function can be called in the view. Is it
safe/advisable to use int64 type here rather than Timestamptz for this
purpose?  'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Thoughts?

Thanks & Regards,
Nitin Jadhav


On Thu, Feb 17, 2022 at 1:33 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > > >
> > > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need
somerefactoring here.
 
> > >
> > > I agree to provide above mentioned information as part of showing the
> > > progress of current checkpoint operation. I am currently looking into
> > > the code to know if any other information can be added.
> >
> > Here is the initial patch to show the progress of checkpoint through
> > pg_stat_progress_checkpoint view. Please find the attachment.
> >
> > The information added to this view are pid - process ID of a
> > CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> > for checkpoint (values can be wal, time or force), phase - indicates
> > the current phase of checkpoint operation, total_buffer_writes - total
> > number of buffers to be written, buffers_processed - number of buffers
> > processed, buffers_written - number of buffers written,
> > total_file_syncs - total number of files to be synced, files_synced -
> > number of files synced.
> >
> > There are many operations happen as part of checkpoint. For each of
> > the operation I am updating the phase field of
> > pg_stat_progress_checkpoint view. The values supported for this field
> > are initializing, checkpointing replication slots, checkpointing
> > snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> > pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> > checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> > buffers, performing sync requests, performing two phase checkpoint,
> > recycling old XLOG files and Finalizing. In case of checkpointing
> > buffers phase, the fields total_buffer_writes, buffers_processed and
> > buffers_written shows the detailed progress of writing buffers. In
> > case of performing sync requests phase, the fields total_file_syncs
> > and files_synced shows the detailed progress of syncing files. In
> > other phases, only the phase field is getting updated and it is
> > difficult to show the progress because we do not get the total number
> > of files count without traversing the directory. It is not worth to
> > calculate that as it affects the performance of the checkpoint. I also
> > gave a thought to just mention the number of files processed, but this
> > wont give a meaningful progress information (It can be treated as
> > statistics). Hence just updating the phase field in those scenarios.
> >
> > Apart from above fields, I am planning to add few more fields to the
> > view in the next patch. That is, process ID of the backend process
> > which triggered a CHECKPOINT command, checkpoint start location, filed
> > to indicate whether it is a checkpoint or restartpoint and elapsed
> > time of the checkpoint operation. Please share your thoughts. I would
> > be happy to add any other information that contributes to showing the
> > progress of checkpoint.
> >
> > As per the discussion in this thread, there should be some mechanism
> > to show the progress of checkpoint during shutdown and end-of-recovery
> > cases as we cannot access pg_stat_progress_checkpoint in those cases.
> > I am working on this to use log_startup_progress_interval mechanism to
> > log the progress in the server logs.
> >
> > Kindly review the patch and share your thoughts.
>
> Interesting idea, and overall a nice addition to the
> pg_stat_progress_* reporting infrastructure.
>
> Could you add your patch to the current commitfest at
> https://commitfest.postgresql.org/37/?
>
> See below for some comments on the patch:
>
> > xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
> > +    /* In bootstrap mode, we don't actually record anything. */
> > +    if (IsBootstrapProcessingMode())
> > +        return;
>
> Why do you check against the state of the system?
> pgstat_progress_update_* already provides protections against updating
> the progress tables if the progress infrastructure is not loaded; and
> otherwise (in the happy path) the cost of updating the progress fields
> will be quite a bit higher than normal. Updating stat_progress isn't
> very expensive (quite cheap, really), so I don't quite get why you
> guard against reporting stats when you expect no other client to be
> listening.
>
> I think you can simplify this a lot by directly using
> pgstat_progress_update_param() instead.
>
> > xlog.c @ checkpoint_progress_start
> > +        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
> > +        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
> > +                                         PROGRESS_CHECKPOINT_PHASE_INIT);
> > +        if (flags & CHECKPOINT_CAUSE_XLOG)
> > +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> > +                                             PROGRESS_CHECKPOINT_KIND_WAL);
> > +        else if (flags & CHECKPOINT_CAUSE_TIME)
> > +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> > +                                             PROGRESS_CHECKPOINT_KIND_TIME);
> > + [...]
>
> Could you assign the kind of checkpoint to a local variable, and then
> update the "phase" and "kind" parameters at the same time through
> pgstat_progress_update_multi_param(2, ...)? See
> BuildRelationExtStatistics in extended_stats.c for an example usage.
> Note that regardless of whether checkpoint_progress_update* will
> remain, the checks done in that function already have been checked in
> this function as well, so you can use the pgstat_* functions directly.
>
> > monitoring.sgml
> > +   <structname>pg_stat_progress_checkpoint</structname> view will contain a
> > +   single row indicating the progress of checkpoint operation.
>
> ... add "if a checkpoint is currently active".
>
> > +       <structfield>total_buffer_writes</structfield> <type>bigint</type>
> > +       <structfield>total_file_syncs</structfield> <type>bigint</type>
>
> The other progress tables use [type]_total as column names for counter
> targets (e.g. backup_total for backup_streamed, heap_blks_total for
> heap_blks_scanned, etc.). I think that `buffers_total` and
> `files_total` would be better column names.
>
> > +       The checkpoint operation is requested due to XLOG filling.
>
> + The checkpoint was started because >max_wal_size< of WAL was written.
>
> > +       The checkpoint operation is requested due to timeout.
>
> + The checkpoint was started due to the expiration of a
> >checkpoint_timeout< interval
>
> > +       The checkpoint operation is forced even if no XLOG activity has occurred
> > +       since the last one.
>
> + Some operation forced a checkpoint.
>
> > +      <entry><literal>checkpointing CommitTs pages</literal></entry>
>
> CommitTs -> Commit time stamp
>
> Thanks for working on this.
>
> Kind regards,
>
> Matthias van de Meent



Hi,

On Thu, Feb 17, 2022 at 12:26:07PM +0530, Nitin Jadhav wrote:
> 
> Thank you for sharing the information.  'triggering backend PID' (int)
> - can be stored without any problem.

There can be multiple processes triggering a checkpoint, or at least wanting it
to happen or happen faster.

> 'checkpoint or restartpoint?'

Do you actually need to store that?  Can't it be inferred from
pg_is_in_recovery()?



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Thu, 17 Feb 2022 at 07:56, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > Progress parameters are int64, so all of the new 'checkpoint start
> > location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
> > time' (store as start time in stat_progress, timestamp fits in 64
> > bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
> > current stat_progress parameter. Some processing would be required at
> > the view, but that's not impossible to overcome.
>
> Thank you for sharing the information.  'triggering backend PID' (int)
> - can be stored without any problem. 'checkpoint or restartpoint?'
> (boolean) - can be stored as a integer value like
> PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> start time in stat_progress, timestamp fits in 64 bits) - As
> Timestamptz is of type int64 internally, so we can store the timestamp
> value in the progres parameter and then expose a function like
> 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> Timestamptz) as argument and then returns string representing the
> elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0].
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

> This function can be called in the view. Is it
> safe/advisable to use int64 type here rather than Timestamptz for this
> purpose?

Yes, this must be exposed through int64, as the sql-callable
pg_stat_get_progress_info only exposes bigint columns. Any
transformation function may return other types (see
pg_indexam_progress_phasename for an example of that).

>  'checkpoint start location' (lsn = uint64) - I feel we
> cannot use progress parameters for this case. As assigning uint64 to
> int64 type would be an issue for larger values and can lead to hidden
> bugs.

Not necessarily - we can (without much trouble) do a bitwise cast from
uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
very elegant, but it works quite well.

Kind regards,

Matthias van de Meent

[0] Assuming we don't care about the years past 294246 CE (2942467 is
when int64 overflows into negatives), the following works without any
precision losses: SELECT
to_timestamp((stat.my_int64::bigint/1000000)::float8) +
make_interval(0, 0, 0, 0, 0, 0, MOD(stat.my_int64, 1000000)::float8 /
1000000::float8) FROM (SELECT 1::bigint) AS stat(my_int64);
[1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
*/ AS my_bigint_lsn) AS stat(my_int64);



> > Thank you for sharing the information.  'triggering backend PID' (int)
> > - can be stored without any problem.
>
> There can be multiple processes triggering a checkpoint, or at least wanting it
> to happen or happen faster.

Yes. There can be multiple processes but there will be one checkpoint
operation at a time. So the backend PID corresponds to the current
checkpoint operation. Let me know if I am missing something.

> > 'checkpoint or restartpoint?'
>
> Do you actually need to store that?  Can't it be inferred from
> pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Thanks & Regards,
Nitin Jadhav

On Thu, Feb 17, 2022 at 4:35 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Thu, Feb 17, 2022 at 12:26:07PM +0530, Nitin Jadhav wrote:
> >
> > Thank you for sharing the information.  'triggering backend PID' (int)
> > - can be stored without any problem.
>
> There can be multiple processes triggering a checkpoint, or at least wanting it
> to happen or happen faster.
>
> > 'checkpoint or restartpoint?'
>
> Do you actually need to store that?  Can't it be inferred from
> pg_is_in_recovery()?



Hi,

On Thu, Feb 17, 2022 at 10:39:02PM +0530, Nitin Jadhav wrote:
> > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > - can be stored without any problem.
> >
> > There can be multiple processes triggering a checkpoint, or at least wanting it
> > to happen or happen faster.
> 
> Yes. There can be multiple processes but there will be one checkpoint
> operation at a time. So the backend PID corresponds to the current
> checkpoint operation. Let me know if I am missing something.

If there's a checkpoint timed triggered and then someone calls
pg_start_backup() which then wait for the end of the current checkpoint
(possibly after changing the flags), I think the view should reflect that in
some way.  Maybe storing an array of (pid, flags) is too much, but at least a
counter with the number of processes actively waiting for the end of the
checkpoint.

> > > 'checkpoint or restartpoint?'
> >
> > Do you actually need to store that?  Can't it be inferred from
> > pg_is_in_recovery()?
> 
> AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> checkpoint or restartpoint because if the system exits from recovery
> mode during restartpoint then any query to pg_stat_progress_checkpoint
> view will return it as a checkpoint which is ideally not correct. Please
> correct me if I am wrong.

Recovery ends with an end-of-recovery checkpoint that has to finish before the
promotion can happen, so I don't think that a restart can still be in progress
if pg_is_in_recovery() returns false.



> Interesting idea, and overall a nice addition to the
> pg_stat_progress_* reporting infrastructure.
>
> Could you add your patch to the current commitfest at
> https://commitfest.postgresql.org/37/?
>
> See below for some comments on the patch:

Thanks you for reviewing.
I have added it to the commitfest - https://commitfest.postgresql.org/37/3545/

> > xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
> > +    /* In bootstrap mode, we don't actually record anything. */
> > +    if (IsBootstrapProcessingMode())
> > +        return;
>
> Why do you check against the state of the system?
> pgstat_progress_update_* already provides protections against updating
> the progress tables if the progress infrastructure is not loaded; and
> otherwise (in the happy path) the cost of updating the progress fields
> will be quite a bit higher than normal. Updating stat_progress isn't
> very expensive (quite cheap, really), so I don't quite get why you
> guard against reporting stats when you expect no other client to be
> listening.

Nice point. I agree that the extra guards(IsBootstrapProcessingMode()
and (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) ==
0) are not needed as the progress reporting mechanism handles that
internally (It only updates when there is an access to the
pg_stat_progress_activity view). I am planning to add the progress of
checkpoint during shutdown and end-of-recovery cases in server logs as
we don't have access to the view. In this case these guards are
necessary. checkpoint_progress_update_param() is a generic function to
report progress to the view or server logs. Thoughts?

> I think you can simplify this a lot by directly using
> pgstat_progress_update_param() instead.
>
> > xlog.c @ checkpoint_progress_start
> > +        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
> > +        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
> > +                                         PROGRESS_CHECKPOINT_PHASE_INIT);
> > +        if (flags & CHECKPOINT_CAUSE_XLOG)
> > +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> > +                                             PROGRESS_CHECKPOINT_KIND_WAL);
> > +        else if (flags & CHECKPOINT_CAUSE_TIME)
> > +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> > +                                             PROGRESS_CHECKPOINT_KIND_TIME);
> > + [...]
>
> Could you assign the kind of checkpoint to a local variable, and then
> update the "phase" and "kind" parameters at the same time through
> pgstat_progress_update_multi_param(2, ...)? See
> BuildRelationExtStatistics in extended_stats.c for an example usage.

I will make use of pgstat_progress_update_multi_param() in the next
patch to replace multiple calls to checkpoint_progress_update_param().

> Note that regardless of whether checkpoint_progress_update* will
> remain, the checks done in that function already have been checked in
> this function as well, so you can use the pgstat_* functions directly.

As I mentioned before I am planning to add progress reporting in the
server logs, checkpoint_progress_update_param() is required and it
makes the job easier.

> > monitoring.sgml
> > +   <structname>pg_stat_progress_checkpoint</structname> view will contain a
> > +   single row indicating the progress of checkpoint operation.
>
>... add "if a checkpoint is currently active".

I feel adding extra words here to indicate "if a checkpoint is
currently active" is not necessary as the view description provides
that information and also it aligns with the documentation of existing
progress views.

> > +       <structfield>total_buffer_writes</structfield> <type>bigint</type>
> > +       <structfield>total_file_syncs</structfield> <type>bigint</type>
>
> The other progress tables use [type]_total as column names for counter
> targets (e.g. backup_total for backup_streamed, heap_blks_total for
> heap_blks_scanned, etc.). I think that `buffers_total` and
> `files_total` would be better column names.

I agree and I will update this in the next patch.

> > +       The checkpoint operation is requested due to XLOG filling.
>
> + The checkpoint was started because >max_wal_size< of WAL was written.

How about this "The checkpoint is started because max_wal_size is reached".

> > +       The checkpoint operation is requested due to timeout.
>
> + The checkpoint was started due to the expiration of a
> >checkpoint_timeout< interval

"The checkpoint is started because checkpoint_timeout expired".

> > +       The checkpoint operation is forced even if no XLOG activity has occurred
> > +       since the last one.
>
> + Some operation forced a checkpoint.

"The checkpoint is started because some operation forced a checkpoint".

> > +      <entry><literal>checkpointing CommitTs pages</literal></entry>
>
> CommitTs -> Commit time stamp

I will handle this in the next patch.

Thanks & Regards,
Nitin Jadhav
> On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > > >
> > > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need
somerefactoring here.
 
> > >
> > > I agree to provide above mentioned information as part of showing the
> > > progress of current checkpoint operation. I am currently looking into
> > > the code to know if any other information can be added.
> >
> > Here is the initial patch to show the progress of checkpoint through
> > pg_stat_progress_checkpoint view. Please find the attachment.
> >
> > The information added to this view are pid - process ID of a
> > CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> > for checkpoint (values can be wal, time or force), phase - indicates
> > the current phase of checkpoint operation, total_buffer_writes - total
> > number of buffers to be written, buffers_processed - number of buffers
> > processed, buffers_written - number of buffers written,
> > total_file_syncs - total number of files to be synced, files_synced -
> > number of files synced.
> >
> > There are many operations happen as part of checkpoint. For each of
> > the operation I am updating the phase field of
> > pg_stat_progress_checkpoint view. The values supported for this field
> > are initializing, checkpointing replication slots, checkpointing
> > snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> > pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> > checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> > buffers, performing sync requests, performing two phase checkpoint,
> > recycling old XLOG files and Finalizing. In case of checkpointing
> > buffers phase, the fields total_buffer_writes, buffers_processed and
> > buffers_written shows the detailed progress of writing buffers. In
> > case of performing sync requests phase, the fields total_file_syncs
> > and files_synced shows the detailed progress of syncing files. In
> > other phases, only the phase field is getting updated and it is
> > difficult to show the progress because we do not get the total number
> > of files count without traversing the directory. It is not worth to
> > calculate that as it affects the performance of the checkpoint. I also
> > gave a thought to just mention the number of files processed, but this
> > wont give a meaningful progress information (It can be treated as
> > statistics). Hence just updating the phase field in those scenarios.
> >
> > Apart from above fields, I am planning to add few more fields to the
> > view in the next patch. That is, process ID of the backend process
> > which triggered a CHECKPOINT command, checkpoint start location, filed
> > to indicate whether it is a checkpoint or restartpoint and elapsed
> > time of the checkpoint operation. Please share your thoughts. I would
> > be happy to add any other information that contributes to showing the
> > progress of checkpoint.
> >
> > As per the discussion in this thread, there should be some mechanism
> > to show the progress of checkpoint during shutdown and end-of-recovery
> > cases as we cannot access pg_stat_progress_checkpoint in those cases.
> > I am working on this to use log_startup_progress_interval mechanism to
> > log the progress in the server logs.
> >
> > Kindly review the patch and share your thoughts.
>
> Interesting idea, and overall a nice addition to the
> pg_stat_progress_* reporting infrastructure.
>
> Could you add your patch to the current commitfest at
> https://commitfest.postgresql.org/37/?
>
> See below for some comments on the patch:
>
> > xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
> > +    /* In bootstrap mode, we don't actually record anything. */
> > +    if (IsBootstrapProcessingMode())
> > +        return;
>
> Why do you check against the state of the system?
> pgstat_progress_update_* already provides protections against updating
> the progress tables if the progress infrastructure is not loaded; and
> otherwise (in the happy path) the cost of updating the progress fields
> will be quite a bit higher than normal. Updating stat_progress isn't
> very expensive (quite cheap, really), so I don't quite get why you
> guard against reporting stats when you expect no other client to be
> listening.
>
> I think you can simplify this a lot by directly using
> pgstat_progress_update_param() instead.
>
> > xlog.c @ checkpoint_progress_start
> > +        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
> > +        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
> > +                                         PROGRESS_CHECKPOINT_PHASE_INIT);
> > +        if (flags & CHECKPOINT_CAUSE_XLOG)
> > +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> > +                                             PROGRESS_CHECKPOINT_KIND_WAL);
> > +        else if (flags & CHECKPOINT_CAUSE_TIME)
> > +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> > +                                             PROGRESS_CHECKPOINT_KIND_TIME);
> > + [...]
>
> Could you assign the kind of checkpoint to a local variable, and then
> update the "phase" and "kind" parameters at the same time through
> pgstat_progress_update_multi_param(2, ...)? See
> BuildRelationExtStatistics in extended_stats.c for an example usage.
> Note that regardless of whether checkpoint_progress_update* will
> remain, the checks done in that function already have been checked in
> this function as well, so you can use the pgstat_* functions directly.
>
> > monitoring.sgml
> > +   <structname>pg_stat_progress_checkpoint</structname> view will contain a
> > +   single row indicating the progress of checkpoint operation.
>
> ... add "if a checkpoint is currently active".
>
> > +       <structfield>total_buffer_writes</structfield> <type>bigint</type>
> > +       <structfield>total_file_syncs</structfield> <type>bigint</type>
>
> The other progress tables use [type]_total as column names for counter
> targets (e.g. backup_total for backup_streamed, heap_blks_total for
> heap_blks_scanned, etc.). I think that `buffers_total` and
> `files_total` would be better column names.
>
> > +       The checkpoint operation is requested due to XLOG filling.
>
> + The checkpoint was started because >max_wal_size< of WAL was written.
>
> > +       The checkpoint operation is requested due to timeout.
>
> + The checkpoint was started due to the expiration of a
> >checkpoint_timeout< interval
>
> > +       The checkpoint operation is forced even if no XLOG activity has occurred
> > +       since the last one.
>
> + Some operation forced a checkpoint.
>
> > +      <entry><literal>checkpointing CommitTs pages</literal></entry>
>
> CommitTs -> Commit time stamp
>
> Thanks for working on this.
>
> Kind regards,
>
> Matthias van de Meent



> > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > - can be stored without any problem.
> > >
> > > There can be multiple processes triggering a checkpoint, or at least wanting it
> > > to happen or happen faster.
> >
> > Yes. There can be multiple processes but there will be one checkpoint
> > operation at a time. So the backend PID corresponds to the current
> > checkpoint operation. Let me know if I am missing something.
>
> If there's a checkpoint timed triggered and then someone calls
> pg_start_backup() which then wait for the end of the current checkpoint
> (possibly after changing the flags), I think the view should reflect that in
> some way.  Maybe storing an array of (pid, flags) is too much, but at least a
> counter with the number of processes actively waiting for the end of the
> checkpoint.

Okay. I feel this can be added as additional field but it will not
replace backend_pid field as this represents the pid of the backend
which triggered the current checkpoint. Probably a new field named
'processes_wiating' or 'events_waiting' can be added for this purpose.
Thoughts?

> > > > 'checkpoint or restartpoint?'
> > >
> > > Do you actually need to store that?  Can't it be inferred from
> > > pg_is_in_recovery()?
> >
> > AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> > checkpoint or restartpoint because if the system exits from recovery
> > mode during restartpoint then any query to pg_stat_progress_checkpoint
> > view will return it as a checkpoint which is ideally not correct. Please
> > correct me if I am wrong.
>
> Recovery ends with an end-of-recovery checkpoint that has to finish before the
> promotion can happen, so I don't think that a restart can still be in progress
> if pg_is_in_recovery() returns false.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
        replayTLI = XLogCtl->InsertTimeLineID;

Thanks & Regards,
Nitin Jadhav

On Thu, Feb 17, 2022 at 10:57 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Thu, Feb 17, 2022 at 10:39:02PM +0530, Nitin Jadhav wrote:
> > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > - can be stored without any problem.
> > >
> > > There can be multiple processes triggering a checkpoint, or at least wanting it
> > > to happen or happen faster.
> >
> > Yes. There can be multiple processes but there will be one checkpoint
> > operation at a time. So the backend PID corresponds to the current
> > checkpoint operation. Let me know if I am missing something.
>
> If there's a checkpoint timed triggered and then someone calls
> pg_start_backup() which then wait for the end of the current checkpoint
> (possibly after changing the flags), I think the view should reflect that in
> some way.  Maybe storing an array of (pid, flags) is too much, but at least a
> counter with the number of processes actively waiting for the end of the
> checkpoint.
>
> > > > 'checkpoint or restartpoint?'
> > >
> > > Do you actually need to store that?  Can't it be inferred from
> > > pg_is_in_recovery()?
> >
> > AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> > checkpoint or restartpoint because if the system exits from recovery
> > mode during restartpoint then any query to pg_stat_progress_checkpoint
> > view will return it as a checkpoint which is ideally not correct. Please
> > correct me if I am wrong.
>
> Recovery ends with an end-of-recovery checkpoint that has to finish before the
> promotion can happen, so I don't think that a restart can still be in progress
> if pg_is_in_recovery() returns false.



Hi,

On Fri, Feb 18, 2022 at 12:20:26PM +0530, Nitin Jadhav wrote:
> >
> > If there's a checkpoint timed triggered and then someone calls
> > pg_start_backup() which then wait for the end of the current checkpoint
> > (possibly after changing the flags), I think the view should reflect that in
> > some way.  Maybe storing an array of (pid, flags) is too much, but at least a
> > counter with the number of processes actively waiting for the end of the
> > checkpoint.
> 
> Okay. I feel this can be added as additional field but it will not
> replace backend_pid field as this represents the pid of the backend
> which triggered the current checkpoint.

I don't think that's true.  Requesting a checkpoint means telling the
checkpointer that it should wake up and start a checkpoint (or restore point)
if it's not already doing so, so the pid will always be the checkpointer pid.
The only exception is a standalone backend, but in that case you won't be able
to query that view anyway.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that.  It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

> Probably a new field named 'processes_wiating' or 'events_waiting' can be
> added for this purpose.

Maybe num_process_waiting?

> > > > > 'checkpoint or restartpoint?'
> > > >
> > > > Do you actually need to store that?  Can't it be inferred from
> > > > pg_is_in_recovery()?
> > >
> > > AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> > > checkpoint or restartpoint because if the system exits from recovery
> > > mode during restartpoint then any query to pg_stat_progress_checkpoint
> > > view will return it as a checkpoint which is ideally not correct. Please
> > > correct me if I am wrong.
> >
> > Recovery ends with an end-of-recovery checkpoint that has to finish before the
> > promotion can happen, so I don't think that a restart can still be in progress
> > if pg_is_in_recovery() returns false.
> 
> Probably writing of buffers or syncing files may complete before
> pg_is_in_recovery() returns false. But there are some cleanup
> operations happen as part of the checkpoint. During this scenario, we
> may get false value for pg_is_in_recovery(). Please refer following
> piece of code which is present in CreateRestartpoint().
> 
> if (!RecoveryInProgress())
>         replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.



> > Okay. I feel this can be added as additional field but it will not
> > replace backend_pid field as this represents the pid of the backend
> > which triggered the current checkpoint.
>
> I don't think that's true.  Requesting a checkpoint means telling the
> checkpointer that it should wake up and start a checkpoint (or restore point)
> if it's not already doing so, so the pid will always be the checkpointer pid.
> The only exception is a standalone backend, but in that case you won't be able
> to query that view anyway.

Yes. I agree that the checkpoint will always be performed by the
checkpointer process. So the pid in the pg_stat_progress_checkpoint
view will always correspond to the checkpointer pid only. Checkpoints
get triggered in many scenarios. One of the cases is the CHECKPOINT
command issued explicitly by the backend. In this scenario I would
like to know the backend pid which triggered the checkpoint. Hence I
would like to add a backend_pid field. So the
pg_stat_progress_checkpoint view contains pid fields as well as
backend_pid fields. The backend_pid contains a valid value only during
the CHECKPOINT command issued by the backend explicitly, otherwise the
value will be 0. We may have to add an additional field to
'CheckpointerShmemStruct' to hold the backend pid. The backend
requesting the checkpoint will update its pid to this structure.
Kindly let me know if you still feel the backend_pid field is not
necessary.


> And also while looking at the patch I see there's the same problem that I
> mentioned in the previous thread, which is that the effective flags can be
> updated once the checkpoint started, and as-is the view won't reflect that.  It
> also means that you can't simply display one of wal, time or force but a
> possible combination of the flags (including the one not handled in v1).

If I understand the above comment properly, it has 2 points. First is
to display the combination of flags rather than just displaying wal,
time or force - The idea behind this is to just let the user know the
reason for checkpointing. That is, the checkpoint is started because
max_wal_size is reached or checkpoint_timeout expired or explicitly
issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
has to be performed. Hence I have not included those in the view.  If
it is really required, I would like to modify the code to include
other flags and display the combination. Second point is to reflect
the updated flags in the view. AFAIK, there is a possibility that the
flags get updated during the on-going checkpoint but the reason for
checkpoint (wal, time or force) will remain same for the current
checkpoint. There might be a change in how checkpoint has to be
performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
displaying the combination of flags in the view, then probably we may
have to reflect this in the view.

> > Probably a new field named 'processes_wiating' or 'events_waiting' can be
> > added for this purpose.
>
> Maybe num_process_waiting?

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

> > Probably writing of buffers or syncing files may complete before
> > pg_is_in_recovery() returns false. But there are some cleanup
> > operations happen as part of the checkpoint. During this scenario, we
> > may get false value for pg_is_in_recovery(). Please refer following
> > piece of code which is present in CreateRestartpoint().
> >
> > if (!RecoveryInProgress())
> >         replayTLI = XLogCtl->InsertTimeLineID;
>
> Then maybe we could store the timeline rather then then kind of checkpoint?
> You should still be able to compute the information while giving a bit more
> information for the same memory usage.

Can you please describe more about how checkpoint/restartpoint can be
confirmed using the timeline id.

Thanks & Regards,
Nitin Jadhav

On Fri, Feb 18, 2022 at 1:13 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Fri, Feb 18, 2022 at 12:20:26PM +0530, Nitin Jadhav wrote:
> > >
> > > If there's a checkpoint timed triggered and then someone calls
> > > pg_start_backup() which then wait for the end of the current checkpoint
> > > (possibly after changing the flags), I think the view should reflect that in
> > > some way.  Maybe storing an array of (pid, flags) is too much, but at least a
> > > counter with the number of processes actively waiting for the end of the
> > > checkpoint.
> >
> > Okay. I feel this can be added as additional field but it will not
> > replace backend_pid field as this represents the pid of the backend
> > which triggered the current checkpoint.
>
> I don't think that's true.  Requesting a checkpoint means telling the
> checkpointer that it should wake up and start a checkpoint (or restore point)
> if it's not already doing so, so the pid will always be the checkpointer pid.
> The only exception is a standalone backend, but in that case you won't be able
> to query that view anyway.
>
> And also while looking at the patch I see there's the same problem that I
> mentioned in the previous thread, which is that the effective flags can be
> updated once the checkpoint started, and as-is the view won't reflect that.  It
> also means that you can't simply display one of wal, time or force but a
> possible combination of the flags (including the one not handled in v1).
>
> > Probably a new field named 'processes_wiating' or 'events_waiting' can be
> > added for this purpose.
>
> Maybe num_process_waiting?
>
> > > > > > 'checkpoint or restartpoint?'
> > > > >
> > > > > Do you actually need to store that?  Can't it be inferred from
> > > > > pg_is_in_recovery()?
> > > >
> > > > AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
> > > > checkpoint or restartpoint because if the system exits from recovery
> > > > mode during restartpoint then any query to pg_stat_progress_checkpoint
> > > > view will return it as a checkpoint which is ideally not correct. Please
> > > > correct me if I am wrong.
> > >
> > > Recovery ends with an end-of-recovery checkpoint that has to finish before the
> > > promotion can happen, so I don't think that a restart can still be in progress
> > > if pg_is_in_recovery() returns false.
> >
> > Probably writing of buffers or syncing files may complete before
> > pg_is_in_recovery() returns false. But there are some cleanup
> > operations happen as part of the checkpoint. During this scenario, we
> > may get false value for pg_is_in_recovery(). Please refer following
> > piece of code which is present in CreateRestartpoint().
> >
> > if (!RecoveryInProgress())
> >         replayTLI = XLogCtl->InsertTimeLineID;
>
> Then maybe we could store the timeline rather then then kind of checkpoint?
> You should still be able to compute the information while giving a bit more
> information for the same memory usage.



Hi,

On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:
> 
> The backend_pid contains a valid value only during
> the CHECKPOINT command issued by the backend explicitly, otherwise the
> value will be 0.  We may have to add an additional field to
> 'CheckpointerShmemStruct' to hold the backend pid. The backend
> requesting the checkpoint will update its pid to this structure.
> Kindly let me know if you still feel the backend_pid field is not
> necessary.

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

> > And also while looking at the patch I see there's the same problem that I
> > mentioned in the previous thread, which is that the effective flags can be
> > updated once the checkpoint started, and as-is the view won't reflect that.  It
> > also means that you can't simply display one of wal, time or force but a
> > possible combination of the flags (including the one not handled in v1).
> 
> If I understand the above comment properly, it has 2 points. First is
> to display the combination of flags rather than just displaying wal,
> time or force - The idea behind this is to just let the user know the
> reason for checkpointing. That is, the checkpoint is started because
> max_wal_size is reached or checkpoint_timeout expired or explicitly
> issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
> CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
> has to be performed. Hence I have not included those in the view.  If
> it is really required, I would like to modify the code to include
> other flags and display the combination.

I think all the information should be exposed.  Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless.  Think for instance for cases like [1].

> Second point is to reflect
> the updated flags in the view. AFAIK, there is a possibility that the
> flags get updated during the on-going checkpoint but the reason for
> checkpoint (wal, time or force) will remain same for the current
> checkpoint. There might be a change in how checkpoint has to be
> performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
> displaying the combination of flags in the view, then probably we may
> have to reflect this in the view.

You can only "upgrade" a checkpoint, but not "downgrade" it.  So if for
instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
possible) you can easily know which one was the one that triggered the
checkpoint and which one was added later.

> > > Probably a new field named 'processes_wiating' or 'events_waiting' can be
> > > added for this purpose.
> >
> > Maybe num_process_waiting?
> 
> I feel 'processes_wiating' aligns more with the naming conventions of
> the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

> > > Probably writing of buffers or syncing files may complete before
> > > pg_is_in_recovery() returns false. But there are some cleanup
> > > operations happen as part of the checkpoint. During this scenario, we
> > > may get false value for pg_is_in_recovery(). Please refer following
> > > piece of code which is present in CreateRestartpoint().
> > >
> > > if (!RecoveryInProgress())
> > >         replayTLI = XLogCtl->InsertTimeLineID;
> >
> > Then maybe we could store the timeline rather then then kind of checkpoint?
> > You should still be able to compute the information while giving a bit more
> > information for the same memory usage.
> 
> Can you please describe more about how checkpoint/restartpoint can be
> confirmed using the timeline id.

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

[1] https://www.postgresql.org/message-id/1486805889.24568.96.camel%40credativ.de



> > Thank you for sharing the information.  'triggering backend PID' (int)
> > - can be stored without any problem. 'checkpoint or restartpoint?'
> > (boolean) - can be stored as a integer value like
> > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> > start time in stat_progress, timestamp fits in 64 bits) - As
> > Timestamptz is of type int64 internally, so we can store the timestamp
> > value in the progres parameter and then expose a function like
> > 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> > Timestamptz) as argument and then returns string representing the
> > elapsed time.
>
> No need to use a string there; I think exposing the checkpoint start
> time is good enough. The conversion of int64 to timestamp[tz] can be
> done in SQL (although I'm not sure that exposing the internal bitwise
> representation of Interval should be exposed to that extent) [0].
> Users can then extract the duration interval using now() - start_time,
> which also allows the user to use their own preferred formatting.

The reason for showing the elapsed time rather than exposing the
timestamp directly is in case of checkpoint during shutdown and
end-of-recovery, I am planning to log a message in server logs using
'log_startup_progress_interval' infrastructure which displays elapsed
time. So just to match both of the behaviour I am displaying elapsed
time here. I feel that elapsed time gives a quicker feel of the
progress. Kindly let me know if you still feel just exposing the
timestamp is better than showing the elapsed time.

> >  'checkpoint start location' (lsn = uint64) - I feel we
> > cannot use progress parameters for this case. As assigning uint64 to
> > int64 type would be an issue for larger values and can lead to hidden
> > bugs.
>
> Not necessarily - we can (without much trouble) do a bitwise cast from
> uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
> very elegant, but it works quite well.
>
> [1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
> pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
> stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
> */ AS my_bigint_lsn) AS stat(my_int64);

Thanks for sharing. It works. I will include this in the next patch.
On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:
> >
> > The backend_pid contains a valid value only during
> > the CHECKPOINT command issued by the backend explicitly, otherwise the
> > value will be 0.  We may have to add an additional field to
> > 'CheckpointerShmemStruct' to hold the backend pid. The backend
> > requesting the checkpoint will update its pid to this structure.
> > Kindly let me know if you still feel the backend_pid field is not
> > necessary.
>
> There are more scenarios where you can have a baackend requesting a checkpoint
> and waiting for its completion, and there may be more than one backend
> concerned, so I don't think that storing only one / the first backend pid is
> ok.
>
> > > And also while looking at the patch I see there's the same problem that I
> > > mentioned in the previous thread, which is that the effective flags can be
> > > updated once the checkpoint started, and as-is the view won't reflect that.  It
> > > also means that you can't simply display one of wal, time or force but a
> > > possible combination of the flags (including the one not handled in v1).
> >
> > If I understand the above comment properly, it has 2 points. First is
> > to display the combination of flags rather than just displaying wal,
> > time or force - The idea behind this is to just let the user know the
> > reason for checkpointing. That is, the checkpoint is started because
> > max_wal_size is reached or checkpoint_timeout expired or explicitly
> > issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
> > CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
> > has to be performed. Hence I have not included those in the view.  If
> > it is really required, I would like to modify the code to include
> > other flags and display the combination.
>
> I think all the information should be exposed.  Only knowing why the current
> checkpoint has been triggered without any further information seems a bit
> useless.  Think for instance for cases like [1].
>
> > Second point is to reflect
> > the updated flags in the view. AFAIK, there is a possibility that the
> > flags get updated during the on-going checkpoint but the reason for
> > checkpoint (wal, time or force) will remain same for the current
> > checkpoint. There might be a change in how checkpoint has to be
> > performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
> > displaying the combination of flags in the view, then probably we may
> > have to reflect this in the view.
>
> You can only "upgrade" a checkpoint, but not "downgrade" it.  So if for
> instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
> possible) you can easily know which one was the one that triggered the
> checkpoint and which one was added later.
>
> > > > Probably a new field named 'processes_wiating' or 'events_waiting' can be
> > > > added for this purpose.
> > >
> > > Maybe num_process_waiting?
> >
> > I feel 'processes_wiating' aligns more with the naming conventions of
> > the fields of the existing progres views.
>
> There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
> a strong opinion on it, just make sure to correct the typo.
>
> > > > Probably writing of buffers or syncing files may complete before
> > > > pg_is_in_recovery() returns false. But there are some cleanup
> > > > operations happen as part of the checkpoint. During this scenario, we
> > > > may get false value for pg_is_in_recovery(). Please refer following
> > > > piece of code which is present in CreateRestartpoint().
> > > >
> > > > if (!RecoveryInProgress())
> > > >         replayTLI = XLogCtl->InsertTimeLineID;
> > >
> > > Then maybe we could store the timeline rather then then kind of checkpoint?
> > > You should still be able to compute the information while giving a bit more
> > > information for the same memory usage.
> >
> > Can you please describe more about how checkpoint/restartpoint can be
> > confirmed using the timeline id.
>
> If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> restartpoint if the checkpoint's timeline is different from the current
> timeline?
>
> [1] https://www.postgresql.org/message-id/1486805889.24568.96.camel%40credativ.de



+/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */
+#define PROGRESS_CHECKPOINT_KIND_WAL                0
+#define PROGRESS_CHECKPOINT_KIND_TIME               1
+#define PROGRESS_CHECKPOINT_KIND_FORCE              2
+#define PROGRESS_CHECKPOINT_KIND_UNKNOWN            3

On what basis have you classified the above into the various types of
checkpoints? AFAIK, the first two types are based on what triggered
the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
settings) while the third type indicates the force checkpoint that can
happen when the checkpoint is triggered for various reasons e.g. .
during createb or dropdb etc. This is quite possible that both the
PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
are set for the checkpoint because multiple checkpoint requests are
processed at one go, so what type of checkpoint would that be?

+        */
+       if ((flags & (CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_END_OF_RECOVERY)) == 0)
+       {
+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+               checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_PHASE,
+
          PROGRESS_CHECKPOINT_PHASE_INIT);
+               if (flags & CHECKPOINT_CAUSE_XLOG)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_WAL);
+               else if (flags & CHECKPOINT_CAUSE_TIME)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_TIME);
+               else if (flags & CHECKPOINT_FORCE)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_FORCE);
+               else
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_UNKNOWN);
+       }
+}

--
With Regards,
Ashutosh Sharma.

On Thu, Feb 10, 2022 at 12:23 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> Here is the initial patch to show the progress of checkpoint through
> pg_stat_progress_checkpoint view. Please find the attachment.
>
> The information added to this view are pid - process ID of a
> CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> for checkpoint (values can be wal, time or force), phase - indicates
> the current phase of checkpoint operation, total_buffer_writes - total
> number of buffers to be written, buffers_processed - number of buffers
> processed, buffers_written - number of buffers written,
> total_file_syncs - total number of files to be synced, files_synced -
> number of files synced.
>
> There are many operations happen as part of checkpoint. For each of
> the operation I am updating the phase field of
> pg_stat_progress_checkpoint view. The values supported for this field
> are initializing, checkpointing replication slots, checkpointing
> snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> buffers, performing sync requests, performing two phase checkpoint,
> recycling old XLOG files and Finalizing. In case of checkpointing
> buffers phase, the fields total_buffer_writes, buffers_processed and
> buffers_written shows the detailed progress of writing buffers. In
> case of performing sync requests phase, the fields total_file_syncs
> and files_synced shows the detailed progress of syncing files. In
> other phases, only the phase field is getting updated and it is
> difficult to show the progress because we do not get the total number
> of files count without traversing the directory. It is not worth to
> calculate that as it affects the performance of the checkpoint. I also
> gave a thought to just mention the number of files processed, but this
> wont give a meaningful progress information (It can be treated as
> statistics). Hence just updating the phase field in those scenarios.
>
> Apart from above fields, I am planning to add few more fields to the
> view in the next patch. That is, process ID of the backend process
> which triggered a CHECKPOINT command, checkpoint start location, filed
> to indicate whether it is a checkpoint or restartpoint and elapsed
> time of the checkpoint operation. Please share your thoughts. I would
> be happy to add any other information that contributes to showing the
> progress of checkpoint.
>
> As per the discussion in this thread, there should be some mechanism
> to show the progress of checkpoint during shutdown and end-of-recovery
> cases as we cannot access pg_stat_progress_checkpoint in those cases.
> I am working on this to use log_startup_progress_interval mechanism to
> log the progress in the server logs.
>
> Kindly review the patch and share your thoughts.
>
>
> On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > > I think the right choice to solve the *general* problem is the
> > > > mentioned pg_stat_progress_checkpoints.
> > > >
> > > > We may want to *additionally* have the ability to log the progress
> > > > specifically for the special cases when we're not able to use that
> > > > view. And in those case, we can perhaps just use the existing
> > > > log_startup_progress_interval parameter for this as well -- at least
> > > > for the startup checkpoint.
> > >
> > > +1
> > >
> > > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > > >
> > > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need
somerefactoring here.
 
> > >
> > > I agree to provide above mentioned information as part of showing the
> > > progress of current checkpoint operation. I am currently looking into
> > > the code to know if any other information can be added.
> >
> > As suggested in the other thread by Julien, I'm changing the subject
> > of this thread to reflect the discussion.
> >
> > Regards,
> > Bharath Rupireddy.



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Tue, 22 Feb 2022 at 07:39, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > - can be stored without any problem. 'checkpoint or restartpoint?'
> > > (boolean) - can be stored as a integer value like
> > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> > > start time in stat_progress, timestamp fits in 64 bits) - As
> > > Timestamptz is of type int64 internally, so we can store the timestamp
> > > value in the progres parameter and then expose a function like
> > > 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> > > Timestamptz) as argument and then returns string representing the
> > > elapsed time.
> >
> > No need to use a string there; I think exposing the checkpoint start
> > time is good enough. The conversion of int64 to timestamp[tz] can be
> > done in SQL (although I'm not sure that exposing the internal bitwise
> > representation of Interval should be exposed to that extent) [0].
> > Users can then extract the duration interval using now() - start_time,
> > which also allows the user to use their own preferred formatting.
>
> The reason for showing the elapsed time rather than exposing the
> timestamp directly is in case of checkpoint during shutdown and
> end-of-recovery, I am planning to log a message in server logs using
> 'log_startup_progress_interval' infrastructure which displays elapsed
> time. So just to match both of the behaviour I am displaying elapsed
> time here. I feel that elapsed time gives a quicker feel of the
> progress. Kindly let me know if you still feel just exposing the
> timestamp is better than showing the elapsed time.

At least for pg_stat_progress_checkpoint, storing only a timestamp in
the pg_stat storage (instead of repeatedly updating the field as a
duration) seems to provide much more precise measures of 'time
elapsed' for other sessions if one step of the checkpoint is taking a
long time.

I understand the want to integrate the log-based reporting in the same
API, but I don't think that is necessarily the right approach:
pg_stat_progress_* has low-overhead infrastructure specifically to
ensure that most tasks will not run much slower while reporting, never
waiting for locks. Logging, however, needs to take locks (if only to
prevent concurrent writes to the output file at a kernel level) and
thus has a not insignificant overhead and thus is not very useful for
precise and very frequent statistics updates.

So, although similar in nature, I don't think it is smart to use the
exact same infrastructure between pgstat_progress*-based reporting and
log-based progress reporting, especially if your logging-based
progress reporting is not intended to be a debugging-only
configuration option similar to log_min_messages=DEBUG[1..5].

- Matthias



> I will make use of pgstat_progress_update_multi_param() in the next
> patch to replace multiple calls to checkpoint_progress_update_param().

Fixed.
---

> > The other progress tables use [type]_total as column names for counter
> > targets (e.g. backup_total for backup_streamed, heap_blks_total for
> > heap_blks_scanned, etc.). I think that `buffers_total` and
> > `files_total` would be better column names.
>
> I agree and I will update this in the next patch.

Fixed.
---

> How about this "The checkpoint is started because max_wal_size is reached".
>
> "The checkpoint is started because checkpoint_timeout expired".
>
> "The checkpoint is started because some operation forced a checkpoint".

I have used the above description. Kindly let me know if any changes
are required.
---

> > > +      <entry><literal>checkpointing CommitTs pages</literal></entry>
> >
> > CommitTs -> Commit time stamp
>
> I will handle this in the next patch.

Fixed.
---

> There are more scenarios where you can have a baackend requesting a checkpoint
> and waiting for its completion, and there may be more than one backend
> concerned, so I don't think that storing only one / the first backend pid is
> ok.

Thanks for this information. I am not considering backend_pid.
---

> I think all the information should be exposed.  Only knowing why the current
> checkpoint has been triggered without any further information seems a bit
> useless.  Think for instance for cases like [1].

I have supported all possible checkpoint kinds. Added
pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
string representing a combination of flags and also checking for the
flag update in ImmediateCheckpointRequested() which checks whether
CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
cases where the flags get changed (which changes the current
checkpoint behaviour) during the checkpoint. Kindly let me know if I
am missing something.
---

> > I feel 'processes_wiating' aligns more with the naming conventions of
> > the fields of the existing progres views.
>
> There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
> a strong opinion on it, just make sure to correct the typo.

More analysis is required to support this. I am planning to take care
in the next patch.
---

> If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> restartpoint if the checkpoint's timeline is different from the current
> timeline?

Fixed.

Sharing the v2 patch. Kindly have a look and share your comments.

Thanks & Regards,
Nitin Jadhav




On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > - can be stored without any problem. 'checkpoint or restartpoint?'
> > > (boolean) - can be stored as a integer value like
> > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> > > start time in stat_progress, timestamp fits in 64 bits) - As
> > > Timestamptz is of type int64 internally, so we can store the timestamp
> > > value in the progres parameter and then expose a function like
> > > 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> > > Timestamptz) as argument and then returns string representing the
> > > elapsed time.
> >
> > No need to use a string there; I think exposing the checkpoint start
> > time is good enough. The conversion of int64 to timestamp[tz] can be
> > done in SQL (although I'm not sure that exposing the internal bitwise
> > representation of Interval should be exposed to that extent) [0].
> > Users can then extract the duration interval using now() - start_time,
> > which also allows the user to use their own preferred formatting.
>
> The reason for showing the elapsed time rather than exposing the
> timestamp directly is in case of checkpoint during shutdown and
> end-of-recovery, I am planning to log a message in server logs using
> 'log_startup_progress_interval' infrastructure which displays elapsed
> time. So just to match both of the behaviour I am displaying elapsed
> time here. I feel that elapsed time gives a quicker feel of the
> progress. Kindly let me know if you still feel just exposing the
> timestamp is better than showing the elapsed time.
>
> > >  'checkpoint start location' (lsn = uint64) - I feel we
> > > cannot use progress parameters for this case. As assigning uint64 to
> > > int64 type would be an issue for larger values and can lead to hidden
> > > bugs.
> >
> > Not necessarily - we can (without much trouble) do a bitwise cast from
> > uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
> > very elegant, but it works quite well.
> >
> > [1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
> > pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
> > stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
> > */ AS my_bigint_lsn) AS stat(my_int64);
>
> Thanks for sharing. It works. I will include this in the next patch.
> On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > Hi,
> >
> > On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:
> > >
> > > The backend_pid contains a valid value only during
> > > the CHECKPOINT command issued by the backend explicitly, otherwise the
> > > value will be 0.  We may have to add an additional field to
> > > 'CheckpointerShmemStruct' to hold the backend pid. The backend
> > > requesting the checkpoint will update its pid to this structure.
> > > Kindly let me know if you still feel the backend_pid field is not
> > > necessary.
> >
> > There are more scenarios where you can have a baackend requesting a checkpoint
> > and waiting for its completion, and there may be more than one backend
> > concerned, so I don't think that storing only one / the first backend pid is
> > ok.
> >
> > > > And also while looking at the patch I see there's the same problem that I
> > > > mentioned in the previous thread, which is that the effective flags can be
> > > > updated once the checkpoint started, and as-is the view won't reflect that.  It
> > > > also means that you can't simply display one of wal, time or force but a
> > > > possible combination of the flags (including the one not handled in v1).
> > >
> > > If I understand the above comment properly, it has 2 points. First is
> > > to display the combination of flags rather than just displaying wal,
> > > time or force - The idea behind this is to just let the user know the
> > > reason for checkpointing. That is, the checkpoint is started because
> > > max_wal_size is reached or checkpoint_timeout expired or explicitly
> > > issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
> > > CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
> > > has to be performed. Hence I have not included those in the view.  If
> > > it is really required, I would like to modify the code to include
> > > other flags and display the combination.
> >
> > I think all the information should be exposed.  Only knowing why the current
> > checkpoint has been triggered without any further information seems a bit
> > useless.  Think for instance for cases like [1].
> >
> > > Second point is to reflect
> > > the updated flags in the view. AFAIK, there is a possibility that the
> > > flags get updated during the on-going checkpoint but the reason for
> > > checkpoint (wal, time or force) will remain same for the current
> > > checkpoint. There might be a change in how checkpoint has to be
> > > performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
> > > displaying the combination of flags in the view, then probably we may
> > > have to reflect this in the view.
> >
> > You can only "upgrade" a checkpoint, but not "downgrade" it.  So if for
> > instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
> > possible) you can easily know which one was the one that triggered the
> > checkpoint and which one was added later.
> >
> > > > > Probably a new field named 'processes_wiating' or 'events_waiting' can be
> > > > > added for this purpose.
> > > >
> > > > Maybe num_process_waiting?
> > >
> > > I feel 'processes_wiating' aligns more with the naming conventions of
> > > the fields of the existing progres views.
> >
> > There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
> > a strong opinion on it, just make sure to correct the typo.
> >
> > > > > Probably writing of buffers or syncing files may complete before
> > > > > pg_is_in_recovery() returns false. But there are some cleanup
> > > > > operations happen as part of the checkpoint. During this scenario, we
> > > > > may get false value for pg_is_in_recovery(). Please refer following
> > > > > piece of code which is present in CreateRestartpoint().
> > > > >
> > > > > if (!RecoveryInProgress())
> > > > >         replayTLI = XLogCtl->InsertTimeLineID;
> > > >
> > > > Then maybe we could store the timeline rather then then kind of checkpoint?
> > > > You should still be able to compute the information while giving a bit more
> > > > information for the same memory usage.
> > >
> > > Can you please describe more about how checkpoint/restartpoint can be
> > > confirmed using the timeline id.
> >
> > If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> > restartpoint if the checkpoint's timeline is different from the current
> > timeline?
> >
> > [1] https://www.postgresql.org/message-id/1486805889.24568.96.camel%40credativ.de

Вложения
> On what basis have you classified the above into the various types of
> checkpoints? AFAIK, the first two types are based on what triggered
> the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
> settings) while the third type indicates the force checkpoint that can
> happen when the checkpoint is triggered for various reasons e.g. .
> during createb or dropdb etc. This is quite possible that both the
> PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
> are set for the checkpoint because multiple checkpoint requests are
> processed at one go, so what type of checkpoint would that be?

My initial understanding was wrong. In the v2 patch I have supported
all values for checkpoint kinds and displaying a string in the
pg_stat_progress_checkpoint view which describes all the bits set in
the checkpoint flags.

On Tue, Feb 22, 2022 at 8:10 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> +/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */
> +#define PROGRESS_CHECKPOINT_KIND_WAL                0
> +#define PROGRESS_CHECKPOINT_KIND_TIME               1
> +#define PROGRESS_CHECKPOINT_KIND_FORCE              2
> +#define PROGRESS_CHECKPOINT_KIND_UNKNOWN            3
>
> On what basis have you classified the above into the various types of
> checkpoints? AFAIK, the first two types are based on what triggered
> the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
> settings) while the third type indicates the force checkpoint that can
> happen when the checkpoint is triggered for various reasons e.g. .
> during createb or dropdb etc. This is quite possible that both the
> PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
> are set for the checkpoint because multiple checkpoint requests are
> processed at one go, so what type of checkpoint would that be?
>
> +        */
> +       if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> CHECKPOINT_END_OF_RECOVERY)) == 0)
> +       {
> +
> pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
> InvalidOid);
> +               checkpoint_progress_update_param(flags,
> PROGRESS_CHECKPOINT_PHASE,
> +
>           PROGRESS_CHECKPOINT_PHASE_INIT);
> +               if (flags & CHECKPOINT_CAUSE_XLOG)
> +                       checkpoint_progress_update_param(flags,
> PROGRESS_CHECKPOINT_KIND,
> +
>                   PROGRESS_CHECKPOINT_KIND_WAL);
> +               else if (flags & CHECKPOINT_CAUSE_TIME)
> +                       checkpoint_progress_update_param(flags,
> PROGRESS_CHECKPOINT_KIND,
> +
>                   PROGRESS_CHECKPOINT_KIND_TIME);
> +               else if (flags & CHECKPOINT_FORCE)
> +                       checkpoint_progress_update_param(flags,
> PROGRESS_CHECKPOINT_KIND,
> +
>                   PROGRESS_CHECKPOINT_KIND_FORCE);
> +               else
> +                       checkpoint_progress_update_param(flags,
> PROGRESS_CHECKPOINT_KIND,
> +
>                   PROGRESS_CHECKPOINT_KIND_UNKNOWN);
> +       }
> +}
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Thu, Feb 10, 2022 at 12:23 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > > >
> > > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need
somerefactoring here.
 
> > >
> > > I agree to provide above mentioned information as part of showing the
> > > progress of current checkpoint operation. I am currently looking into
> > > the code to know if any other information can be added.
> >
> > Here is the initial patch to show the progress of checkpoint through
> > pg_stat_progress_checkpoint view. Please find the attachment.
> >
> > The information added to this view are pid - process ID of a
> > CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> > for checkpoint (values can be wal, time or force), phase - indicates
> > the current phase of checkpoint operation, total_buffer_writes - total
> > number of buffers to be written, buffers_processed - number of buffers
> > processed, buffers_written - number of buffers written,
> > total_file_syncs - total number of files to be synced, files_synced -
> > number of files synced.
> >
> > There are many operations happen as part of checkpoint. For each of
> > the operation I am updating the phase field of
> > pg_stat_progress_checkpoint view. The values supported for this field
> > are initializing, checkpointing replication slots, checkpointing
> > snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> > pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> > checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> > buffers, performing sync requests, performing two phase checkpoint,
> > recycling old XLOG files and Finalizing. In case of checkpointing
> > buffers phase, the fields total_buffer_writes, buffers_processed and
> > buffers_written shows the detailed progress of writing buffers. In
> > case of performing sync requests phase, the fields total_file_syncs
> > and files_synced shows the detailed progress of syncing files. In
> > other phases, only the phase field is getting updated and it is
> > difficult to show the progress because we do not get the total number
> > of files count without traversing the directory. It is not worth to
> > calculate that as it affects the performance of the checkpoint. I also
> > gave a thought to just mention the number of files processed, but this
> > wont give a meaningful progress information (It can be treated as
> > statistics). Hence just updating the phase field in those scenarios.
> >
> > Apart from above fields, I am planning to add few more fields to the
> > view in the next patch. That is, process ID of the backend process
> > which triggered a CHECKPOINT command, checkpoint start location, filed
> > to indicate whether it is a checkpoint or restartpoint and elapsed
> > time of the checkpoint operation. Please share your thoughts. I would
> > be happy to add any other information that contributes to showing the
> > progress of checkpoint.
> >
> > As per the discussion in this thread, there should be some mechanism
> > to show the progress of checkpoint during shutdown and end-of-recovery
> > cases as we cannot access pg_stat_progress_checkpoint in those cases.
> > I am working on this to use log_startup_progress_interval mechanism to
> > log the progress in the server logs.
> >
> > Kindly review the patch and share your thoughts.
> >
> >
> > On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
> > > <nitinjadhavpostgres@gmail.com> wrote:
> > > >
> > > > > I think the right choice to solve the *general* problem is the
> > > > > mentioned pg_stat_progress_checkpoints.
> > > > >
> > > > > We may want to *additionally* have the ability to log the progress
> > > > > specifically for the special cases when we're not able to use that
> > > > > view. And in those case, we can perhaps just use the existing
> > > > > log_startup_progress_interval parameter for this as well -- at least
> > > > > for the startup checkpoint.
> > > >
> > > > +1
> > > >
> > > > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead
ofjust emitting the stats at the end.
 
> > > > >
> > > > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need
somerefactoring here. 
> > > >
> > > > I agree to provide above mentioned information as part of showing the
> > > > progress of current checkpoint operation. I am currently looking into
> > > > the code to know if any other information can be added.
> > >
> > > As suggested in the other thread by Julien, I'm changing the subject
> > > of this thread to reflect the discussion.
> > >
> > > Regards,
> > > Bharath Rupireddy.



> At least for pg_stat_progress_checkpoint, storing only a timestamp in
> the pg_stat storage (instead of repeatedly updating the field as a
> duration) seems to provide much more precise measures of 'time
> elapsed' for other sessions if one step of the checkpoint is taking a
> long time.

I am storing the checkpoint start timestamp in the st_progress_param[]
and this gets set only once during the checkpoint (at the start of the
checkpoint). I have added function
pg_stat_get_progress_checkpoint_elapsed() which calculates the elapsed
time and returns a string. This function gets called whenever
pg_stat_progress_checkpoint view is queried. Kindly refer v2 patch and
share your thoughts.

> I understand the want to integrate the log-based reporting in the same
> API, but I don't think that is necessarily the right approach:
> pg_stat_progress_* has low-overhead infrastructure specifically to
> ensure that most tasks will not run much slower while reporting, never
> waiting for locks. Logging, however, needs to take locks (if only to
> prevent concurrent writes to the output file at a kernel level) and
> thus has a not insignificant overhead and thus is not very useful for
> precise and very frequent statistics updates.

I understand that the log based reporting is very costly and very
frequent updates are not advisable.  I am planning to use the existing
infrastructure of 'log_startup_progress_interval' which provides an
option for the user to configure the interval between each progress
update. Hence it avoids frequent updates to server logs. This approach
is used only during shutdown and end-of-recovery cases because we
cannot access pg_stat_progress_checkpoint view during those scenarios.

> So, although similar in nature, I don't think it is smart to use the
> exact same infrastructure between pgstat_progress*-based reporting and
> log-based progress reporting, especially if your logging-based
> progress reporting is not intended to be a debugging-only
> configuration option similar to log_min_messages=DEBUG[1..5].

Yes. I agree that we cannot use the same infrastructure for both.
Progress views and servers logs have different APIs to report the
progress information. But since both of this are required for the same
purpose, I am planning to use a common function which increases the
code readability than calling it separately in all the scenarios. I am
planning to include log based reporting in the next patch. Even after
that if using the same function is not recommended, I am happy to
change.

Thanks & Regards,
Nitin Jadhav

On Wed, Feb 23, 2022 at 12:13 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Tue, 22 Feb 2022 at 07:39, Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > - can be stored without any problem. 'checkpoint or restartpoint?'
> > > > (boolean) - can be stored as a integer value like
> > > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > > > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> > > > start time in stat_progress, timestamp fits in 64 bits) - As
> > > > Timestamptz is of type int64 internally, so we can store the timestamp
> > > > value in the progres parameter and then expose a function like
> > > > 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> > > > Timestamptz) as argument and then returns string representing the
> > > > elapsed time.
> > >
> > > No need to use a string there; I think exposing the checkpoint start
> > > time is good enough. The conversion of int64 to timestamp[tz] can be
> > > done in SQL (although I'm not sure that exposing the internal bitwise
> > > representation of Interval should be exposed to that extent) [0].
> > > Users can then extract the duration interval using now() - start_time,
> > > which also allows the user to use their own preferred formatting.
> >
> > The reason for showing the elapsed time rather than exposing the
> > timestamp directly is in case of checkpoint during shutdown and
> > end-of-recovery, I am planning to log a message in server logs using
> > 'log_startup_progress_interval' infrastructure which displays elapsed
> > time. So just to match both of the behaviour I am displaying elapsed
> > time here. I feel that elapsed time gives a quicker feel of the
> > progress. Kindly let me know if you still feel just exposing the
> > timestamp is better than showing the elapsed time.
>
> At least for pg_stat_progress_checkpoint, storing only a timestamp in
> the pg_stat storage (instead of repeatedly updating the field as a
> duration) seems to provide much more precise measures of 'time
> elapsed' for other sessions if one step of the checkpoint is taking a
> long time.
>
> I understand the want to integrate the log-based reporting in the same
> API, but I don't think that is necessarily the right approach:
> pg_stat_progress_* has low-overhead infrastructure specifically to
> ensure that most tasks will not run much slower while reporting, never
> waiting for locks. Logging, however, needs to take locks (if only to
> prevent concurrent writes to the output file at a kernel level) and
> thus has a not insignificant overhead and thus is not very useful for
> precise and very frequent statistics updates.
>
> So, although similar in nature, I don't think it is smart to use the
> exact same infrastructure between pgstat_progress*-based reporting and
> log-based progress reporting, especially if your logging-based
> progress reporting is not intended to be a debugging-only
> configuration option similar to log_min_messages=DEBUG[1..5].
>
> - Matthias



+       if ((ckpt_flags &
+                (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+       {

This code (present at multiple places) looks a little ugly to me, what
we can do instead is add a macro probably named IsShutdownCheckpoint()
which does the above check and use it in all the functions that have
this check. See below:

#define IsShutdownCheckpoint(flags) \
  (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)

And then you may use this macro like:

if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
    return;

This change can be done in all these functions:

+void
+checkpoint_progress_start(int flags)

--

+ */
+void
+checkpoint_progress_update_param(int index, int64 val)

--

+ * Stop reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_end(void)

==

+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+
+               val[0] = XLogCtl->InsertTimeLineID;
+               val[1] = flags;
+               val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+               val[3] = CheckpointStats.ckpt_start_t;
+
+               pgstat_progress_update_multi_param(4, index, val);
+       }

Any specific reason for recording the timelineID in checkpoint stats
table? Will this ever change in our case?

--
With Regards,
Ashutosh Sharma.

On Wed, Feb 23, 2022 at 6:59 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > I will make use of pgstat_progress_update_multi_param() in the next
> > patch to replace multiple calls to checkpoint_progress_update_param().
>
> Fixed.
> ---
>
> > > The other progress tables use [type]_total as column names for counter
> > > targets (e.g. backup_total for backup_streamed, heap_blks_total for
> > > heap_blks_scanned, etc.). I think that `buffers_total` and
> > > `files_total` would be better column names.
> >
> > I agree and I will update this in the next patch.
>
> Fixed.
> ---
>
> > How about this "The checkpoint is started because max_wal_size is reached".
> >
> > "The checkpoint is started because checkpoint_timeout expired".
> >
> > "The checkpoint is started because some operation forced a checkpoint".
>
> I have used the above description. Kindly let me know if any changes
> are required.
> ---
>
> > > > +      <entry><literal>checkpointing CommitTs pages</literal></entry>
> > >
> > > CommitTs -> Commit time stamp
> >
> > I will handle this in the next patch.
>
> Fixed.
> ---
>
> > There are more scenarios where you can have a baackend requesting a checkpoint
> > and waiting for its completion, and there may be more than one backend
> > concerned, so I don't think that storing only one / the first backend pid is
> > ok.
>
> Thanks for this information. I am not considering backend_pid.
> ---
>
> > I think all the information should be exposed.  Only knowing why the current
> > checkpoint has been triggered without any further information seems a bit
> > useless.  Think for instance for cases like [1].
>
> I have supported all possible checkpoint kinds. Added
> pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
> string representing a combination of flags and also checking for the
> flag update in ImmediateCheckpointRequested() which checks whether
> CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
> cases where the flags get changed (which changes the current
> checkpoint behaviour) during the checkpoint. Kindly let me know if I
> am missing something.
> ---
>
> > > I feel 'processes_wiating' aligns more with the naming conventions of
> > > the fields of the existing progres views.
> >
> > There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
> > a strong opinion on it, just make sure to correct the typo.
>
> More analysis is required to support this. I am planning to take care
> in the next patch.
> ---
>
> > If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> > restartpoint if the checkpoint's timeline is different from the current
> > timeline?
>
> Fixed.
>
> Sharing the v2 patch. Kindly have a look and share your comments.
>
> Thanks & Regards,
> Nitin Jadhav
>
>
>
>
> On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > - can be stored without any problem. 'checkpoint or restartpoint?'
> > > > (boolean) - can be stored as a integer value like
> > > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > > > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> > > > start time in stat_progress, timestamp fits in 64 bits) - As
> > > > Timestamptz is of type int64 internally, so we can store the timestamp
> > > > value in the progres parameter and then expose a function like
> > > > 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> > > > Timestamptz) as argument and then returns string representing the
> > > > elapsed time.
> > >
> > > No need to use a string there; I think exposing the checkpoint start
> > > time is good enough. The conversion of int64 to timestamp[tz] can be
> > > done in SQL (although I'm not sure that exposing the internal bitwise
> > > representation of Interval should be exposed to that extent) [0].
> > > Users can then extract the duration interval using now() - start_time,
> > > which also allows the user to use their own preferred formatting.
> >
> > The reason for showing the elapsed time rather than exposing the
> > timestamp directly is in case of checkpoint during shutdown and
> > end-of-recovery, I am planning to log a message in server logs using
> > 'log_startup_progress_interval' infrastructure which displays elapsed
> > time. So just to match both of the behaviour I am displaying elapsed
> > time here. I feel that elapsed time gives a quicker feel of the
> > progress. Kindly let me know if you still feel just exposing the
> > timestamp is better than showing the elapsed time.
> >
> > > >  'checkpoint start location' (lsn = uint64) - I feel we
> > > > cannot use progress parameters for this case. As assigning uint64 to
> > > > int64 type would be an issue for larger values and can lead to hidden
> > > > bugs.
> > >
> > > Not necessarily - we can (without much trouble) do a bitwise cast from
> > > uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
> > > very elegant, but it works quite well.
> > >
> > > [1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
> > > pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
> > > stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
> > > */ AS my_bigint_lsn) AS stat(my_int64);
> >
> > Thanks for sharing. It works. I will include this in the next patch.
> > On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:
> > > >
> > > > The backend_pid contains a valid value only during
> > > > the CHECKPOINT command issued by the backend explicitly, otherwise the
> > > > value will be 0.  We may have to add an additional field to
> > > > 'CheckpointerShmemStruct' to hold the backend pid. The backend
> > > > requesting the checkpoint will update its pid to this structure.
> > > > Kindly let me know if you still feel the backend_pid field is not
> > > > necessary.
> > >
> > > There are more scenarios where you can have a baackend requesting a checkpoint
> > > and waiting for its completion, and there may be more than one backend
> > > concerned, so I don't think that storing only one / the first backend pid is
> > > ok.
> > >
> > > > > And also while looking at the patch I see there's the same problem that I
> > > > > mentioned in the previous thread, which is that the effective flags can be
> > > > > updated once the checkpoint started, and as-is the view won't reflect that.  It
> > > > > also means that you can't simply display one of wal, time or force but a
> > > > > possible combination of the flags (including the one not handled in v1).
> > > >
> > > > If I understand the above comment properly, it has 2 points. First is
> > > > to display the combination of flags rather than just displaying wal,
> > > > time or force - The idea behind this is to just let the user know the
> > > > reason for checkpointing. That is, the checkpoint is started because
> > > > max_wal_size is reached or checkpoint_timeout expired or explicitly
> > > > issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
> > > > CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
> > > > has to be performed. Hence I have not included those in the view.  If
> > > > it is really required, I would like to modify the code to include
> > > > other flags and display the combination.
> > >
> > > I think all the information should be exposed.  Only knowing why the current
> > > checkpoint has been triggered without any further information seems a bit
> > > useless.  Think for instance for cases like [1].
> > >
> > > > Second point is to reflect
> > > > the updated flags in the view. AFAIK, there is a possibility that the
> > > > flags get updated during the on-going checkpoint but the reason for
> > > > checkpoint (wal, time or force) will remain same for the current
> > > > checkpoint. There might be a change in how checkpoint has to be
> > > > performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
> > > > displaying the combination of flags in the view, then probably we may
> > > > have to reflect this in the view.
> > >
> > > You can only "upgrade" a checkpoint, but not "downgrade" it.  So if for
> > > instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
> > > possible) you can easily know which one was the one that triggered the
> > > checkpoint and which one was added later.
> > >
> > > > > > Probably a new field named 'processes_wiating' or 'events_waiting' can be
> > > > > > added for this purpose.
> > > > >
> > > > > Maybe num_process_waiting?
> > > >
> > > > I feel 'processes_wiating' aligns more with the naming conventions of
> > > > the fields of the existing progres views.
> > >
> > > There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
> > > a strong opinion on it, just make sure to correct the typo.
> > >
> > > > > > Probably writing of buffers or syncing files may complete before
> > > > > > pg_is_in_recovery() returns false. But there are some cleanup
> > > > > > operations happen as part of the checkpoint. During this scenario, we
> > > > > > may get false value for pg_is_in_recovery(). Please refer following
> > > > > > piece of code which is present in CreateRestartpoint().
> > > > > >
> > > > > > if (!RecoveryInProgress())
> > > > > >         replayTLI = XLogCtl->InsertTimeLineID;
> > > > >
> > > > > Then maybe we could store the timeline rather then then kind of checkpoint?
> > > > > You should still be able to compute the information while giving a bit more
> > > > > information for the same memory usage.
> > > >
> > > > Can you please describe more about how checkpoint/restartpoint can be
> > > > confirmed using the timeline id.
> > >
> > > If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> > > restartpoint if the checkpoint's timeline is different from the current
> > > timeline?
> > >
> > > [1] https://www.postgresql.org/message-id/1486805889.24568.96.camel%40credativ.de



I think the change to ImmediateCheckpointRequested() makes no sense.
Before this patch, that function merely inquires whether there's an
immediate checkpoint queued.  After this patch, it ... changes a
progress-reporting flag?  I think it would make more sense to make the
progress-report flag change in whatever is the place that *requests* an
immediate checkpoint rather than here.

I think the use of capitals in CHECKPOINT and CHECKPOINTER in the
documentation is excessive.  (Same for terms such as MULTIXACT and
others in those docs; we typically use those in lowercase when
user-facing; and do we really use term CLOG anymore? Don't we call it
"commit log" nowadays?)

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)



+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below

Maybe it should show a single row , unless the checkpointer isn't running at
all (like in single user mode).

+       Process ID of a CHECKPOINTER process.

It's *the* checkpointer process.

pgstatfuncs.c has a whitespace issue (tab-space).

I suppose the functions should set provolatile.

-- 
Justin



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Wed, 23 Feb 2022 at 15:24, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > At least for pg_stat_progress_checkpoint, storing only a timestamp in
> > the pg_stat storage (instead of repeatedly updating the field as a
> > duration) seems to provide much more precise measures of 'time
> > elapsed' for other sessions if one step of the checkpoint is taking a
> > long time.
>
> I am storing the checkpoint start timestamp in the st_progress_param[]
> and this gets set only once during the checkpoint (at the start of the
> checkpoint). I have added function
> pg_stat_get_progress_checkpoint_elapsed() which calculates the elapsed
> time and returns a string. This function gets called whenever
> pg_stat_progress_checkpoint view is queried. Kindly refer v2 patch and
> share your thoughts.

I dislike the lack of access to the actual value of the checkpoint
start / checkpoint elapsed field.

As a user, if I query the pg_stat_progress_* views, my terminal or
application can easily interpret an `interval` value and cast it to
string, but the opposite is not true: the current implementation for
pg_stat_get_progress_checkpoint_elapsed loses precision. This is why
we use typed numeric fields in effectively all other places instead of
stringified versions of the values: oid fields, counters, etc are all
rendered as bigint in the view, so that no information is lost and
interpretation is trivial.

> > I understand the want to integrate the log-based reporting in the same
> > API, but I don't think that is necessarily the right approach:
> > pg_stat_progress_* has low-overhead infrastructure specifically to
> > ensure that most tasks will not run much slower while reporting, never
> > waiting for locks. Logging, however, needs to take locks (if only to
> > prevent concurrent writes to the output file at a kernel level) and
> > thus has a not insignificant overhead and thus is not very useful for
> > precise and very frequent statistics updates.
>
> I understand that the log based reporting is very costly and very
> frequent updates are not advisable.  I am planning to use the existing
> infrastructure of 'log_startup_progress_interval' which provides an
> option for the user to configure the interval between each progress
> update. Hence it avoids frequent updates to server logs. This approach
> is used only during shutdown and end-of-recovery cases because we
> cannot access pg_stat_progress_checkpoint view during those scenarios.

I see; but log_startup_progress_interval seems to be exclusively
consumed through the ereport_startup_progress macro. Why put
startup/shutdown logging on the same path as the happy flow of normal
checkpoints?

> > So, although similar in nature, I don't think it is smart to use the
> > exact same infrastructure between pgstat_progress*-based reporting and
> > log-based progress reporting, especially if your logging-based
> > progress reporting is not intended to be a debugging-only
> > configuration option similar to log_min_messages=DEBUG[1..5].
>
> Yes. I agree that we cannot use the same infrastructure for both.
> Progress views and servers logs have different APIs to report the
> progress information. But since both of this are required for the same
> purpose, I am planning to use a common function which increases the
> code readability than calling it separately in all the scenarios. I am
> planning to include log based reporting in the next patch. Even after
> that if using the same function is not recommended, I am happy to
> change.

I don't think that checkpoint_progress_update_param(int, uint64) fits
well with the construction of progress log messages, requiring
special-casing / matching the offset numbers to actual fields inside
that single function, which adds unnecessary overhead when compared
against normal and direct calls to the related infrastructure.

I think that, instead of looking to what might at some point be added,
it is better to use the currently available functions instead, and
move to new functions if and when the log-based reporting requires it.


- Matthias



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Wed, 23 Feb 2022 at 14:28, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> Sharing the v2 patch. Kindly have a look and share your comments.

Thanks for updating.

> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml

With the new pg_stat_progress_checkpoint, you should also add a
backreference to this progress reporting in the CHECKPOINT sql command
documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
backup.sgml too. See e.g. cluster.sgml around line 195 for an example.

> diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> +ImmediateCheckpointRequested(int flags)
>      if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> +    {
> +        updated_flags |= CHECKPOINT_IMMEDIATE;

I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?

> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> +        ( SELECT '0/0'::pg_lsn +
> +                 ((CASE
> +                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
> +                     ELSE 0::numeric
> +                  END) +
> +                  stat.lsn_int64::numeric)
> +          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
> +        ) AS start_lsn,

My LSN select statement was an example that could be run directly in
psql; the so you didn't have to embed the SELECT into the view query.
The following should be sufficient (and save the planner a few cycles
otherwise spent in inlining):

+        ('0/0'::pg_lsn +
+                 ((CASE
+                     WHEN s.param3 < 0 THEN pow(2::numeric,
64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  s.param3::numeric)
+        ) AS start_lsn,


> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> +checkpoint_progress_start(int flags)
> [...]
> +checkpoint_progress_update_param(int index, int64 val)
> [...]
> +checkpoint_progress_end(void)
> +{
> +    /* In bootstrap mode, we don't actually record anything. */
> +    if (IsBootstrapProcessingMode())
> +        return;

Disabling pgstat progress reporting when in bootstrap processing mode
/ startup/end-of-recovery makes very little sense (see upthread) and
should be removed, regardless of whether seperate functions stay.

> diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
> +#define PROGRESS_CHECKPOINT_PHASE_INIT                          0

Generally, enum-like values in a stat_progress field are 1-indexed, to
differentiate between empty/uninitialized (0) and states that have
been set by the progress reporting infrastructure.



Kind regards,

Matthias van de Meent



> I think the change to ImmediateCheckpointRequested() makes no sense.
> Before this patch, that function merely inquires whether there's an
> immediate checkpoint queued.  After this patch, it ... changes a
> progress-reporting flag?  I think it would make more sense to make the
> progress-report flag change in whatever is the place that *requests* an
> immediate checkpoint rather than here.
>
> > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> > +ImmediateCheckpointRequested(int flags)
> >      if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> > +    {
> > +        updated_flags |= CHECKPOINT_IMMEDIATE;
>
> I don't think that these changes are expected behaviour. Under in this
> condition; the currently running checkpoint is still not 'immediate',
> but it is going to hurry up for a new, actually immediate checkpoint.
> Those are different kinds of checkpoint handling; and I don't think
> you should modify the reported flags to show that we're going to do
> stuff faster than usual. Maybe maintiain a seperate 'upcoming
> checkpoint flags' field instead?

Thank you Alvaro and Matthias for your views. I understand your point
of not updating the progress-report flag here as it just checks
whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
based on that but it doesn't change the checkpoint flags. I will
modify the code but I am a bit confused here. As per Alvaro, we need
to make the progress-report flag change in whatever is the place that
*requests* an immediate checkpoint. I feel this gives information
about the upcoming checkpoint not the current one. So updating here
provides wrong details in the view. The flags available during
CreateCheckPoint() will remain same for the entire checkpoint
operation and we should show the same information in the view till it
completes. So just removing the above piece of code (modified in
ImmediateCheckpointRequested()) in the patch will make it correct. My
opinion about maintaining a separate field to show upcoming checkpoint
flags is it makes the view complex. Please share your thoughts.

Thanks & Regards,

On Thu, Feb 24, 2022 at 10:45 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 23 Feb 2022 at 14:28, Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > Sharing the v2 patch. Kindly have a look and share your comments.
>
> Thanks for updating.
>
> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
>
> With the new pg_stat_progress_checkpoint, you should also add a
> backreference to this progress reporting in the CHECKPOINT sql command
> documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
> backup.sgml too. See e.g. cluster.sgml around line 195 for an example.
>
> > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> > +ImmediateCheckpointRequested(int flags)
> >      if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> > +    {
> > +        updated_flags |= CHECKPOINT_IMMEDIATE;
>
> I don't think that these changes are expected behaviour. Under in this
> condition; the currently running checkpoint is still not 'immediate',
> but it is going to hurry up for a new, actually immediate checkpoint.
> Those are different kinds of checkpoint handling; and I don't think
> you should modify the reported flags to show that we're going to do
> stuff faster than usual. Maybe maintiain a seperate 'upcoming
> checkpoint flags' field instead?
>
> > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> > +        ( SELECT '0/0'::pg_lsn +
> > +                 ((CASE
> > +                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
> > +                     ELSE 0::numeric
> > +                  END) +
> > +                  stat.lsn_int64::numeric)
> > +          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
> > +        ) AS start_lsn,
>
> My LSN select statement was an example that could be run directly in
> psql; the so you didn't have to embed the SELECT into the view query.
> The following should be sufficient (and save the planner a few cycles
> otherwise spent in inlining):
>
> +        ('0/0'::pg_lsn +
> +                 ((CASE
> +                     WHEN s.param3 < 0 THEN pow(2::numeric,
> 64::numeric)::numeric
> +                     ELSE 0::numeric
> +                  END) +
> +                  s.param3::numeric)
> +        ) AS start_lsn,
>
>
> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > +checkpoint_progress_start(int flags)
> > [...]
> > +checkpoint_progress_update_param(int index, int64 val)
> > [...]
> > +checkpoint_progress_end(void)
> > +{
> > +    /* In bootstrap mode, we don't actually record anything. */
> > +    if (IsBootstrapProcessingMode())
> > +        return;
>
> Disabling pgstat progress reporting when in bootstrap processing mode
> / startup/end-of-recovery makes very little sense (see upthread) and
> should be removed, regardless of whether seperate functions stay.
>
> > diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
> > +#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
>
> Generally, enum-like values in a stat_progress field are 1-indexed, to
> differentiate between empty/uninitialized (0) and states that have
> been set by the progress reporting infrastructure.
>
>
>
> Kind regards,
>
> Matthias van de Meent



Hi,

On Fri, Feb 25, 2022 at 12:23:27AM +0530, Nitin Jadhav wrote:
> > I think the change to ImmediateCheckpointRequested() makes no sense.
> > Before this patch, that function merely inquires whether there's an
> > immediate checkpoint queued.  After this patch, it ... changes a
> > progress-reporting flag?  I think it would make more sense to make the
> > progress-report flag change in whatever is the place that *requests* an
> > immediate checkpoint rather than here.
> >
> > > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> > > +ImmediateCheckpointRequested(int flags)
> > >      if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> > > +    {
> > > +        updated_flags |= CHECKPOINT_IMMEDIATE;
> >
> > I don't think that these changes are expected behaviour. Under in this
> > condition; the currently running checkpoint is still not 'immediate',
> > but it is going to hurry up for a new, actually immediate checkpoint.
> > Those are different kinds of checkpoint handling; and I don't think
> > you should modify the reported flags to show that we're going to do
> > stuff faster than usual. Maybe maintiain a seperate 'upcoming
> > checkpoint flags' field instead?
> 
> Thank you Alvaro and Matthias for your views. I understand your point
> of not updating the progress-report flag here as it just checks
> whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> based on that but it doesn't change the checkpoint flags. I will
> modify the code but I am a bit confused here. As per Alvaro, we need
> to make the progress-report flag change in whatever is the place that
> *requests* an immediate checkpoint. I feel this gives information
> about the upcoming checkpoint not the current one. So updating here
> provides wrong details in the view. The flags available during
> CreateCheckPoint() will remain same for the entire checkpoint
> operation and we should show the same information in the view till it
> completes.

I'm not sure what Matthias meant, but as far as I know there's no fundamental
difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
and there's also no scheduling for multiple checkpoints.

Yes, the flags will remain the same but checkpoint.c will test both the passed
flags and the shmem flags to see whether a delay should be added or not, which
is the only difference in checkpoint processing for this flag.  See the call to
ImmediateCheckpointRequested() which will look at the value in shmem:

    /*
     * Perform the usual duties and take a nap, unless we're behind schedule,
     * in which case we just try to catch up as quickly as possible.
     */
    if (!(flags & CHECKPOINT_IMMEDIATE) &&
        !ShutdownRequestPending &&
        !ImmediateCheckpointRequested() &&
        IsCheckpointOnSchedule(progress))
[...]



> Thank you Alvaro and Matthias for your views. I understand your point
> of not updating the progress-report flag here as it just checks
> whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> based on that but it doesn't change the checkpoint flags. I will
> modify the code but I am a bit confused here. As per Alvaro, we need
> to make the progress-report flag change in whatever is the place that
> *requests* an immediate checkpoint. I feel this gives information
> about the upcoming checkpoint not the current one. So updating here
> provides wrong details in the view. The flags available during
> CreateCheckPoint() will remain same for the entire checkpoint
> operation and we should show the same information in the view till it
> completes. So just removing the above piece of code (modified in
> ImmediateCheckpointRequested()) in the patch will make it correct. My
> opinion about maintaining a separate field to show upcoming checkpoint
> flags is it makes the view complex. Please share your thoughts.

I have modified the code accordingly.
---

> I think the use of capitals in CHECKPOINT and CHECKPOINTER in the
> documentation is excessive.

Fixed. Here the word CHECKPOINT represents command/checkpoint
operation. If we treat it as a checkpoint operation, I agree to use
lowercase but if we treat it as command, then I think uppercase is
recommended (Refer
https://www.postgresql.org/docs/14/sql-checkpoint.html). Is it ok to
always use lowercase here?
---

> (Same for terms such as MULTIXACT and
> others in those docs; we typically use those in lowercase when
> user-facing; and do we really use term CLOG anymore? Don't we call it
> "commit log" nowadays?)

I have observed the CLOG term in the existing documentation. Anyways I
have changed MULTIXACT to multixact, SUBTRANS to subtransaction and
CLOG to commit log.
---

> +   Whenever the checkpoint operation is running, the
> +   <structname>pg_stat_progress_checkpoint</structname> view will contain a
> +   single row indicating the progress of the checkpoint. The tables below
>
> Maybe it should show a single row , unless the checkpointer isn't running at
> all (like in single user mode).

Nice thought. Can we add an additional checkpoint phase like 'Idle'.
Idle is ON whenever the checkpointer process is running and there are
no on-going checkpoint Thoughts?
---

> +       Process ID of a CHECKPOINTER process.
>
> It's *the* checkpointer process.

Fixed.
---

> pgstatfuncs.c has a whitespace issue (tab-space).

I have verified with 'git diff --check' and also manually. I did not
find any issue. Kindly mention the specific code which has an issue.
---

> I suppose the functions should set provolatile.

Fixed.
---

> > I am storing the checkpoint start timestamp in the st_progress_param[]
> > and this gets set only once during the checkpoint (at the start of the
> > checkpoint). I have added function
> > pg_stat_get_progress_checkpoint_elapsed() which calculates the elapsed
> > time and returns a string. This function gets called whenever
> > pg_stat_progress_checkpoint view is queried. Kindly refer v2 patch and
> > share your thoughts.
>
> I dislike the lack of access to the actual value of the checkpoint
> start / checkpoint elapsed field.
>
> As a user, if I query the pg_stat_progress_* views, my terminal or
> application can easily interpret an `interval` value and cast it to
> string, but the opposite is not true: the current implementation for
> pg_stat_get_progress_checkpoint_elapsed loses precision. This is why
> we use typed numeric fields in effectively all other places instead of
> stringified versions of the values: oid fields, counters, etc are all
> rendered as bigint in the view, so that no information is lost and
> interpretation is trivial.

Displaying start time of the checkpoint.
---

> > I understand that the log based reporting is very costly and very
> > frequent updates are not advisable.  I am planning to use the existing
> > infrastructure of 'log_startup_progress_interval' which provides an
> > option for the user to configure the interval between each progress
> > update. Hence it avoids frequent updates to server logs. This approach
> > is used only during shutdown and end-of-recovery cases because we
> > cannot access pg_stat_progress_checkpoint view during those scenarios.
>
> I see; but log_startup_progress_interval seems to be exclusively
> consumed through the ereport_startup_progress macro. Why put
> startup/shutdown logging on the same path as the happy flow of normal
> checkpoints?

You mean to say while updating the progress of the checkpoint, call
pgstat_progress_update_param() and then call
ereport_startup_progress() ?

> I think that, instead of looking to what might at some point be added,
> it is better to use the currently available functions instead, and
> move to new functions if and when the log-based reporting requires it.

Make sense. Removing checkpoint_progress_update_param() and
checkpoint_progress_end(). I would like to concentrate on
pg_stat_progress_checkpoint view as of now and I will consider log
based reporting later.

> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
>
> With the new pg_stat_progress_checkpoint, you should also add a
> backreference to this progress reporting in the CHECKPOINT sql command
> documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
> backup.sgml too. See e.g. cluster.sgml around line 195 for an example.

I have updated in checkpoint.sqml and wal.sqml.

> > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> > +        ( SELECT '0/0'::pg_lsn +
> > +                 ((CASE
> > +                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
> > +                     ELSE 0::numeric
> > +                  END) +
> > +                  stat.lsn_int64::numeric)
> > +          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
> > +        ) AS start_lsn,
>
> My LSN select statement was an example that could be run directly in
> psql; the so you didn't have to embed the SELECT into the view query.
> The following should be sufficient (and save the planner a few cycles
> otherwise spent in inlining):
>
> +        ('0/0'::pg_lsn +
> +                 ((CASE
> +                     WHEN s.param3 < 0 THEN pow(2::numeric,
> 64::numeric)::numeric
> +                     ELSE 0::numeric
> +                  END) +
> +                  s.param3::numeric)
> +        ) AS start_lsn,

Thanks for the suggestion. Fixed.

> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > +checkpoint_progress_start(int flags)
> > [...]
> > +checkpoint_progress_update_param(int index, int64 val)
> > [...]
> > +checkpoint_progress_end(void)
> > +{
> > +    /* In bootstrap mode, we don't actually record anything. */
> > +    if (IsBootstrapProcessingMode())
> > +        return;
>
> Disabling pgstat progress reporting when in bootstrap processing mode
> / startup/end-of-recovery makes very little sense (see upthread) and
> should be removed, regardless of whether seperate functions stay.

Removed since log based reporting is not part of the current patch.

> > diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
> > +#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
>
> Generally, enum-like values in a stat_progress field are 1-indexed, to
> differentiate between empty/uninitialized (0) and states that have
> been set by the progress reporting infrastructure.

Fixed.

Please find the v3 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav
On Fri, Feb 25, 2022 at 12:23 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > I think the change to ImmediateCheckpointRequested() makes no sense.
> > Before this patch, that function merely inquires whether there's an
> > immediate checkpoint queued.  After this patch, it ... changes a
> > progress-reporting flag?  I think it would make more sense to make the
> > progress-report flag change in whatever is the place that *requests* an
> > immediate checkpoint rather than here.
> >
> > > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> > > +ImmediateCheckpointRequested(int flags)
> > >      if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> > > +    {
> > > +        updated_flags |= CHECKPOINT_IMMEDIATE;
> >
> > I don't think that these changes are expected behaviour. Under in this
> > condition; the currently running checkpoint is still not 'immediate',
> > but it is going to hurry up for a new, actually immediate checkpoint.
> > Those are different kinds of checkpoint handling; and I don't think
> > you should modify the reported flags to show that we're going to do
> > stuff faster than usual. Maybe maintiain a seperate 'upcoming
> > checkpoint flags' field instead?
>
> Thank you Alvaro and Matthias for your views. I understand your point
> of not updating the progress-report flag here as it just checks
> whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> based on that but it doesn't change the checkpoint flags. I will
> modify the code but I am a bit confused here. As per Alvaro, we need
> to make the progress-report flag change in whatever is the place that
> *requests* an immediate checkpoint. I feel this gives information
> about the upcoming checkpoint not the current one. So updating here
> provides wrong details in the view. The flags available during
> CreateCheckPoint() will remain same for the entire checkpoint
> operation and we should show the same information in the view till it
> completes. So just removing the above piece of code (modified in
> ImmediateCheckpointRequested()) in the patch will make it correct. My
> opinion about maintaining a separate field to show upcoming checkpoint
> flags is it makes the view complex. Please share your thoughts.
>
> Thanks & Regards,
>
> On Thu, Feb 24, 2022 at 10:45 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > On Wed, 23 Feb 2022 at 14:28, Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > Sharing the v2 patch. Kindly have a look and share your comments.
> >
> > Thanks for updating.
> >
> > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> >
> > With the new pg_stat_progress_checkpoint, you should also add a
> > backreference to this progress reporting in the CHECKPOINT sql command
> > documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
> > backup.sgml too. See e.g. cluster.sgml around line 195 for an example.
> >
> > > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> > > +ImmediateCheckpointRequested(int flags)
> > >      if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> > > +    {
> > > +        updated_flags |= CHECKPOINT_IMMEDIATE;
> >
> > I don't think that these changes are expected behaviour. Under in this
> > condition; the currently running checkpoint is still not 'immediate',
> > but it is going to hurry up for a new, actually immediate checkpoint.
> > Those are different kinds of checkpoint handling; and I don't think
> > you should modify the reported flags to show that we're going to do
> > stuff faster than usual. Maybe maintiain a seperate 'upcoming
> > checkpoint flags' field instead?
> >
> > > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> > > +        ( SELECT '0/0'::pg_lsn +
> > > +                 ((CASE
> > > +                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
> > > +                     ELSE 0::numeric
> > > +                  END) +
> > > +                  stat.lsn_int64::numeric)
> > > +          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
> > > +        ) AS start_lsn,
> >
> > My LSN select statement was an example that could be run directly in
> > psql; the so you didn't have to embed the SELECT into the view query.
> > The following should be sufficient (and save the planner a few cycles
> > otherwise spent in inlining):
> >
> > +        ('0/0'::pg_lsn +
> > +                 ((CASE
> > +                     WHEN s.param3 < 0 THEN pow(2::numeric,
> > 64::numeric)::numeric
> > +                     ELSE 0::numeric
> > +                  END) +
> > +                  s.param3::numeric)
> > +        ) AS start_lsn,
> >
> >
> > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > > +checkpoint_progress_start(int flags)
> > > [...]
> > > +checkpoint_progress_update_param(int index, int64 val)
> > > [...]
> > > +checkpoint_progress_end(void)
> > > +{
> > > +    /* In bootstrap mode, we don't actually record anything. */
> > > +    if (IsBootstrapProcessingMode())
> > > +        return;
> >
> > Disabling pgstat progress reporting when in bootstrap processing mode
> > / startup/end-of-recovery makes very little sense (see upthread) and
> > should be removed, regardless of whether seperate functions stay.
> >
> > > diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
> > > +#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
> >
> > Generally, enum-like values in a stat_progress field are 1-indexed, to
> > differentiate between empty/uninitialized (0) and states that have
> > been set by the progress reporting infrastructure.
> >
> >
> >
> > Kind regards,
> >
> > Matthias van de Meent

Вложения
> +       if ((ckpt_flags &
> +                (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
> +       {
>
> This code (present at multiple places) looks a little ugly to me, what
> we can do instead is add a macro probably named IsShutdownCheckpoint()
> which does the above check and use it in all the functions that have
> this check. See below:
>
> #define IsShutdownCheckpoint(flags) \
>  (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)
>
> And then you may use this macro like:
>
> if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
>    return;

Good suggestion. In the v3 patch, I have removed the corresponding
code as these checks are not required. Hence this suggestion is not
applicable now.
---

> pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
> InvalidOid);
> +
> +               val[0] = XLogCtl->InsertTimeLineID;
> +               val[1] = flags;
> +               val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
> +               val[3] = CheckpointStats.ckpt_start_t;
> +
> +               pgstat_progress_update_multi_param(4, index, val);
> +       }
>
> Any specific reason for recording the timelineID in checkpoint stats
> table? Will this ever change in our case?

The timelineID is used to decide whether the current operation is
checkpoint or restartpoint. There is a field in the view to display
this information.

Thanks & Regards,
Nitin Jadhav

On Wed, Feb 23, 2022 at 9:46 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> +       if ((ckpt_flags &
> +                (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
> +       {
>
> This code (present at multiple places) looks a little ugly to me, what
> we can do instead is add a macro probably named IsShutdownCheckpoint()
> which does the above check and use it in all the functions that have
> this check. See below:
>
> #define IsShutdownCheckpoint(flags) \
>   (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)
>
> And then you may use this macro like:
>
> if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
>     return;
>
> This change can be done in all these functions:
>
> +void
> +checkpoint_progress_start(int flags)
>
> --
>
> + */
> +void
> +checkpoint_progress_update_param(int index, int64 val)
>
> --
>
> + * Stop reporting progress of the checkpoint.
> + */
> +void
> +checkpoint_progress_end(void)
>
> ==
>
> +
> pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
> InvalidOid);
> +
> +               val[0] = XLogCtl->InsertTimeLineID;
> +               val[1] = flags;
> +               val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
> +               val[3] = CheckpointStats.ckpt_start_t;
> +
> +               pgstat_progress_update_multi_param(4, index, val);
> +       }
>
> Any specific reason for recording the timelineID in checkpoint stats
> table? Will this ever change in our case?
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Wed, Feb 23, 2022 at 6:59 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > I will make use of pgstat_progress_update_multi_param() in the next
> > > patch to replace multiple calls to checkpoint_progress_update_param().
> >
> > Fixed.
> > ---
> >
> > > > The other progress tables use [type]_total as column names for counter
> > > > targets (e.g. backup_total for backup_streamed, heap_blks_total for
> > > > heap_blks_scanned, etc.). I think that `buffers_total` and
> > > > `files_total` would be better column names.
> > >
> > > I agree and I will update this in the next patch.
> >
> > Fixed.
> > ---
> >
> > > How about this "The checkpoint is started because max_wal_size is reached".
> > >
> > > "The checkpoint is started because checkpoint_timeout expired".
> > >
> > > "The checkpoint is started because some operation forced a checkpoint".
> >
> > I have used the above description. Kindly let me know if any changes
> > are required.
> > ---
> >
> > > > > +      <entry><literal>checkpointing CommitTs pages</literal></entry>
> > > >
> > > > CommitTs -> Commit time stamp
> > >
> > > I will handle this in the next patch.
> >
> > Fixed.
> > ---
> >
> > > There are more scenarios where you can have a baackend requesting a checkpoint
> > > and waiting for its completion, and there may be more than one backend
> > > concerned, so I don't think that storing only one / the first backend pid is
> > > ok.
> >
> > Thanks for this information. I am not considering backend_pid.
> > ---
> >
> > > I think all the information should be exposed.  Only knowing why the current
> > > checkpoint has been triggered without any further information seems a bit
> > > useless.  Think for instance for cases like [1].
> >
> > I have supported all possible checkpoint kinds. Added
> > pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
> > string representing a combination of flags and also checking for the
> > flag update in ImmediateCheckpointRequested() which checks whether
> > CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
> > cases where the flags get changed (which changes the current
> > checkpoint behaviour) during the checkpoint. Kindly let me know if I
> > am missing something.
> > ---
> >
> > > > I feel 'processes_wiating' aligns more with the naming conventions of
> > > > the fields of the existing progres views.
> > >
> > > There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
> > > a strong opinion on it, just make sure to correct the typo.
> >
> > More analysis is required to support this. I am planning to take care
> > in the next patch.
> > ---
> >
> > > If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> > > restartpoint if the checkpoint's timeline is different from the current
> > > timeline?
> >
> > Fixed.
> >
> > Sharing the v2 patch. Kindly have a look and share your comments.
> >
> > Thanks & Regards,
> > Nitin Jadhav
> >
> >
> >
> >
> > On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > > - can be stored without any problem. 'checkpoint or restartpoint?'
> > > > > (boolean) - can be stored as a integer value like
> > > > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > > > > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> > > > > start time in stat_progress, timestamp fits in 64 bits) - As
> > > > > Timestamptz is of type int64 internally, so we can store the timestamp
> > > > > value in the progres parameter and then expose a function like
> > > > > 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> > > > > Timestamptz) as argument and then returns string representing the
> > > > > elapsed time.
> > > >
> > > > No need to use a string there; I think exposing the checkpoint start
> > > > time is good enough. The conversion of int64 to timestamp[tz] can be
> > > > done in SQL (although I'm not sure that exposing the internal bitwise
> > > > representation of Interval should be exposed to that extent) [0].
> > > > Users can then extract the duration interval using now() - start_time,
> > > > which also allows the user to use their own preferred formatting.
> > >
> > > The reason for showing the elapsed time rather than exposing the
> > > timestamp directly is in case of checkpoint during shutdown and
> > > end-of-recovery, I am planning to log a message in server logs using
> > > 'log_startup_progress_interval' infrastructure which displays elapsed
> > > time. So just to match both of the behaviour I am displaying elapsed
> > > time here. I feel that elapsed time gives a quicker feel of the
> > > progress. Kindly let me know if you still feel just exposing the
> > > timestamp is better than showing the elapsed time.
> > >
> > > > >  'checkpoint start location' (lsn = uint64) - I feel we
> > > > > cannot use progress parameters for this case. As assigning uint64 to
> > > > > int64 type would be an issue for larger values and can lead to hidden
> > > > > bugs.
> > > >
> > > > Not necessarily - we can (without much trouble) do a bitwise cast from
> > > > uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
> > > > very elegant, but it works quite well.
> > > >
> > > > [1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
> > > > pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
> > > > stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
> > > > */ AS my_bigint_lsn) AS stat(my_int64);
> > >
> > > Thanks for sharing. It works. I will include this in the next patch.
> > > On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:
> > > > >
> > > > > The backend_pid contains a valid value only during
> > > > > the CHECKPOINT command issued by the backend explicitly, otherwise the
> > > > > value will be 0.  We may have to add an additional field to
> > > > > 'CheckpointerShmemStruct' to hold the backend pid. The backend
> > > > > requesting the checkpoint will update its pid to this structure.
> > > > > Kindly let me know if you still feel the backend_pid field is not
> > > > > necessary.
> > > >
> > > > There are more scenarios where you can have a baackend requesting a checkpoint
> > > > and waiting for its completion, and there may be more than one backend
> > > > concerned, so I don't think that storing only one / the first backend pid is
> > > > ok.
> > > >
> > > > > > And also while looking at the patch I see there's the same problem that I
> > > > > > mentioned in the previous thread, which is that the effective flags can be
> > > > > > updated once the checkpoint started, and as-is the view won't reflect that.  It
> > > > > > also means that you can't simply display one of wal, time or force but a
> > > > > > possible combination of the flags (including the one not handled in v1).
> > > > >
> > > > > If I understand the above comment properly, it has 2 points. First is
> > > > > to display the combination of flags rather than just displaying wal,
> > > > > time or force - The idea behind this is to just let the user know the
> > > > > reason for checkpointing. That is, the checkpoint is started because
> > > > > max_wal_size is reached or checkpoint_timeout expired or explicitly
> > > > > issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
> > > > > CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
> > > > > has to be performed. Hence I have not included those in the view.  If
> > > > > it is really required, I would like to modify the code to include
> > > > > other flags and display the combination.
> > > >
> > > > I think all the information should be exposed.  Only knowing why the current
> > > > checkpoint has been triggered without any further information seems a bit
> > > > useless.  Think for instance for cases like [1].
> > > >
> > > > > Second point is to reflect
> > > > > the updated flags in the view. AFAIK, there is a possibility that the
> > > > > flags get updated during the on-going checkpoint but the reason for
> > > > > checkpoint (wal, time or force) will remain same for the current
> > > > > checkpoint. There might be a change in how checkpoint has to be
> > > > > performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
> > > > > displaying the combination of flags in the view, then probably we may
> > > > > have to reflect this in the view.
> > > >
> > > > You can only "upgrade" a checkpoint, but not "downgrade" it.  So if for
> > > > instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
> > > > possible) you can easily know which one was the one that triggered the
> > > > checkpoint and which one was added later.
> > > >
> > > > > > > Probably a new field named 'processes_wiating' or 'events_waiting' can be
> > > > > > > added for this purpose.
> > > > > >
> > > > > > Maybe num_process_waiting?
> > > > >
> > > > > I feel 'processes_wiating' aligns more with the naming conventions of
> > > > > the fields of the existing progres views.
> > > >
> > > > There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't have
> > > > a strong opinion on it, just make sure to correct the typo.
> > > >
> > > > > > > Probably writing of buffers or syncing files may complete before
> > > > > > > pg_is_in_recovery() returns false. But there are some cleanup
> > > > > > > operations happen as part of the checkpoint. During this scenario, we
> > > > > > > may get false value for pg_is_in_recovery(). Please refer following
> > > > > > > piece of code which is present in CreateRestartpoint().
> > > > > > >
> > > > > > > if (!RecoveryInProgress())
> > > > > > >         replayTLI = XLogCtl->InsertTimeLineID;
> > > > > >
> > > > > > Then maybe we could store the timeline rather then then kind of checkpoint?
> > > > > > You should still be able to compute the information while giving a bit more
> > > > > > information for the same memory usage.
> > > > >
> > > > > Can you please describe more about how checkpoint/restartpoint can be
> > > > > confirmed using the timeline id.
> > > >
> > > > If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> > > > restartpoint if the checkpoint's timeline is different from the current
> > > > timeline?
> > > >
> > > > [1] https://www.postgresql.org/message-id/1486805889.24568.96.camel%40credativ.de



> > Thank you Alvaro and Matthias for your views. I understand your point
> > of not updating the progress-report flag here as it just checks
> > whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> > based on that but it doesn't change the checkpoint flags. I will
> > modify the code but I am a bit confused here. As per Alvaro, we need
> > to make the progress-report flag change in whatever is the place that
> > *requests* an immediate checkpoint. I feel this gives information
> > about the upcoming checkpoint not the current one. So updating here
> > provides wrong details in the view. The flags available during
> > CreateCheckPoint() will remain same for the entire checkpoint
> > operation and we should show the same information in the view till it
> > completes.
>
> I'm not sure what Matthias meant, but as far as I know there's no fundamental
> difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
> and there's also no scheduling for multiple checkpoints.
>
> Yes, the flags will remain the same but checkpoint.c will test both the passed
> flags and the shmem flags to see whether a delay should be added or not, which
> is the only difference in checkpoint processing for this flag.  See the call to
> ImmediateCheckpointRequested() which will look at the value in shmem:
>
>        /*
>         * Perform the usual duties and take a nap, unless we're behind schedule,
>         * in which case we just try to catch up as quickly as possible.
>         */
>        if (!(flags & CHECKPOINT_IMMEDIATE) &&
>                !ShutdownRequestPending &&
>                !ImmediateCheckpointRequested() &&
>                IsCheckpointOnSchedule(progress))

I understand that the checkpointer considers flags as well as the
shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
current checkpoint operation (No further delay) but does not change
the current flag value. Should we display this change in the kind
field of the view or not? Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Fri, Feb 25, 2022 at 12:33 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Fri, Feb 25, 2022 at 12:23:27AM +0530, Nitin Jadhav wrote:
> > > I think the change to ImmediateCheckpointRequested() makes no sense.
> > > Before this patch, that function merely inquires whether there's an
> > > immediate checkpoint queued.  After this patch, it ... changes a
> > > progress-reporting flag?  I think it would make more sense to make the
> > > progress-report flag change in whatever is the place that *requests* an
> > > immediate checkpoint rather than here.
> > >
> > > > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> > > > +ImmediateCheckpointRequested(int flags)
> > > >      if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> > > > +    {
> > > > +        updated_flags |= CHECKPOINT_IMMEDIATE;
> > >
> > > I don't think that these changes are expected behaviour. Under in this
> > > condition; the currently running checkpoint is still not 'immediate',
> > > but it is going to hurry up for a new, actually immediate checkpoint.
> > > Those are different kinds of checkpoint handling; and I don't think
> > > you should modify the reported flags to show that we're going to do
> > > stuff faster than usual. Maybe maintiain a seperate 'upcoming
> > > checkpoint flags' field instead?
> >
> > Thank you Alvaro and Matthias for your views. I understand your point
> > of not updating the progress-report flag here as it just checks
> > whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> > based on that but it doesn't change the checkpoint flags. I will
> > modify the code but I am a bit confused here. As per Alvaro, we need
> > to make the progress-report flag change in whatever is the place that
> > *requests* an immediate checkpoint. I feel this gives information
> > about the upcoming checkpoint not the current one. So updating here
> > provides wrong details in the view. The flags available during
> > CreateCheckPoint() will remain same for the entire checkpoint
> > operation and we should show the same information in the view till it
> > completes.
>
> I'm not sure what Matthias meant, but as far as I know there's no fundamental
> difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
> and there's also no scheduling for multiple checkpoints.
>
> Yes, the flags will remain the same but checkpoint.c will test both the passed
> flags and the shmem flags to see whether a delay should be added or not, which
> is the only difference in checkpoint processing for this flag.  See the call to
> ImmediateCheckpointRequested() which will look at the value in shmem:
>
>         /*
>          * Perform the usual duties and take a nap, unless we're behind schedule,
>          * in which case we just try to catch up as quickly as possible.
>          */
>         if (!(flags & CHECKPOINT_IMMEDIATE) &&
>                 !ShutdownRequestPending &&
>                 !ImmediateCheckpointRequested() &&
>                 IsCheckpointOnSchedule(progress))
> [...]



On Fri, Feb 25, 2022 at 08:53:50PM +0530, Nitin Jadhav wrote:
> >
> > I'm not sure what Matthias meant, but as far as I know there's no fundamental
> > difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
> > and there's also no scheduling for multiple checkpoints.
> >
> > Yes, the flags will remain the same but checkpoint.c will test both the passed
> > flags and the shmem flags to see whether a delay should be added or not, which
> > is the only difference in checkpoint processing for this flag.  See the call to
> > ImmediateCheckpointRequested() which will look at the value in shmem:
> >
> >        /*
> >         * Perform the usual duties and take a nap, unless we're behind schedule,
> >         * in which case we just try to catch up as quickly as possible.
> >         */
> >        if (!(flags & CHECKPOINT_IMMEDIATE) &&
> >                !ShutdownRequestPending &&
> >                !ImmediateCheckpointRequested() &&
> >                IsCheckpointOnSchedule(progress))
> 
> I understand that the checkpointer considers flags as well as the
> shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
> current checkpoint operation (No further delay) but does not change
> the current flag value. Should we display this change in the kind
> field of the view or not? Please share your thoughts.

I think the fields should be added.  It's good to know that a checkpoint was
trigger due to normal activity and should be spreaded, and then something
upgraded it to an immediate checkpoint.  If you're desperately waiting for the
end of a checkpoint for some reason and ask for an immediate checkpoint, you'll
certainly be happy to see that the checkpointer is aware of it.

But maybe I missed something in the code, so let's wait for Matthias input
about it.



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Fri, 25 Feb 2022 at 17:35, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Fri, Feb 25, 2022 at 08:53:50PM +0530, Nitin Jadhav wrote:
> > >
> > > I'm not sure what Matthias meant, but as far as I know there's no fundamental
> > > difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
> > > and there's also no scheduling for multiple checkpoints.
> > >
> > > Yes, the flags will remain the same but checkpoint.c will test both the passed
> > > flags and the shmem flags to see whether a delay should be added or not, which
> > > is the only difference in checkpoint processing for this flag.  See the call to
> > > ImmediateCheckpointRequested() which will look at the value in shmem:
> > >
> > >        /*
> > >         * Perform the usual duties and take a nap, unless we're behind schedule,
> > >         * in which case we just try to catch up as quickly as possible.
> > >         */
> > >        if (!(flags & CHECKPOINT_IMMEDIATE) &&
> > >                !ShutdownRequestPending &&
> > >                !ImmediateCheckpointRequested() &&
> > >                IsCheckpointOnSchedule(progress))
> >
> > I understand that the checkpointer considers flags as well as the
> > shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
> > current checkpoint operation (No further delay) but does not change
> > the current flag value. Should we display this change in the kind
> > field of the view or not? Please share your thoughts.
>
> I think the fields should be added.  It's good to know that a checkpoint was
> trigger due to normal activity and should be spreaded, and then something
> upgraded it to an immediate checkpoint.  If you're desperately waiting for the
> end of a checkpoint for some reason and ask for an immediate checkpoint, you'll
> certainly be happy to see that the checkpointer is aware of it.
>
> But maybe I missed something in the code, so let's wait for Matthias input
> about it.

The point I was trying to make was "If cps->ckpt_flags is
CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
actually immediate". That doesn't mean that this checkpoint was
created with IMMEDIATE or running using IMMEDIATE, only that optional
delays are now being skipped instead.

To let the user detect _why_ the optional delays are now being
skipped, I propose not to report this currently running checkpoint's
"flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
checkpoint's flags; which would allow the detection and display of the
CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
interesting information flags.

-Matthias

PS. I just noticed that the checkpoint flags are also being parsed and
stringified twice in LogCheckpointStart; and adding another duplicate
in the current code would put that at 3 copies of effectively the same
code. Do we maybe want to deduplicate that into macros, similar to
LSN_FORMAT_ARGS?



On Fri, Feb 25, 2022 at 06:49:42PM +0100, Matthias van de Meent wrote:
>
> The point I was trying to make was "If cps->ckpt_flags is
> CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
> actually immediate". That doesn't mean that this checkpoint was
> created with IMMEDIATE or running using IMMEDIATE, only that optional
> delays are now being skipped instead.

Ah, I now see what you mean.

> To let the user detect _why_ the optional delays are now being
> skipped, I propose not to report this currently running checkpoint's
> "flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
> checkpoint's flags; which would allow the detection and display of the
> CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
> interesting information flags.

I'm still not convinced that's a sensible approach.  The next checkpoint will
be displayed in the view as CHECKPOINT_IMMEDIATE, so you will then know about
it.  I'm not sure that having that specific information in the view is
going to help, especially if users have to understand "a slow checkpoint is
actually fast even if it's displayed as slow if the next checkpoint is going to
be fast".  Saying "it's timed" (which imply slow) and "it's fast" is maybe
still counter intuitive, but at least have a better chance to see there's
something going on and refer to the doc if you don't get it.



On Sat, Feb 26, 2022 at 02:30:36AM +0800, Julien Rouhaud wrote:
> On Fri, Feb 25, 2022 at 06:49:42PM +0100, Matthias van de Meent wrote:
> >
> > The point I was trying to make was "If cps->ckpt_flags is
> > CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
> > actually immediate". That doesn't mean that this checkpoint was
> > created with IMMEDIATE or running using IMMEDIATE, only that optional
> > delays are now being skipped instead.
> 
> Ah, I now see what you mean.
> 
> > To let the user detect _why_ the optional delays are now being
> > skipped, I propose not to report this currently running checkpoint's
> > "flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
> > checkpoint's flags; which would allow the detection and display of the
> > CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
> > interesting information flags.
> 
> I'm still not convinced that's a sensible approach.  The next checkpoint will
> be displayed in the view as CHECKPOINT_IMMEDIATE, so you will then know about
> it.  I'm not sure that having that specific information in the view is
> going to help, especially if users have to understand "a slow checkpoint is
> actually fast even if it's displayed as slow if the next checkpoint is going to
> be fast".  Saying "it's timed" (which imply slow) and "it's fast" is maybe
> still counter intuitive, but at least have a better chance to see there's
> something going on and refer to the doc if you don't get it.

Just to be clear, I do think that it's worthwhile to add some information that
some backends are waiting for that next checkpoint.  As discussed before, an
int for the number of backends looks like enough information to me.



On Fri, Feb 25, 2022 at 8:38 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Had a quick look over the v3 patch. I'm not sure if it's the best way
to have pg_stat_get_progress_checkpoint_type,
pg_stat_get_progress_checkpoint_kind and
pg_stat_get_progress_checkpoint_start_time just for printing info in
readable format in pg_stat_progress_checkpoint. I don't think these
functions will ever be useful for the users.

1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
or checkpoint instead of having a new function
pg_stat_get_progress_checkpoint_type?

2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
directly instead of new function pg_stat_get_progress_checkpoint_kind?
+ snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+ (flags == 0) ? "unknown" : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+ (flags & CHECKPOINT_FORCE) ? "force " : "",
+ (flags & CHECKPOINT_WAIT) ? "wait " : "",
+ (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+ (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+ (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");

3) Why do we need this extra calculation for start_lsn? Do you ever
see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,

4) Can't you use timestamptz_in(to_char(s.param4))  instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

Regards,
Bharath Rupireddy.



On Sun, Feb 27, 2022 at 8:44 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Feb 25, 2022 at 8:38 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
>
> Had a quick look over the v3 patch. I'm not sure if it's the best way
> to have pg_stat_get_progress_checkpoint_type,
> pg_stat_get_progress_checkpoint_kind and
> pg_stat_get_progress_checkpoint_start_time just for printing info in
> readable format in pg_stat_progress_checkpoint. I don't think these
> functions will ever be useful for the users.
>
> 1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
> or checkpoint instead of having a new function
> pg_stat_get_progress_checkpoint_type?
>
> 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> directly instead of new function pg_stat_get_progress_checkpoint_kind?
> + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> + (flags == 0) ? "unknown" : "",
> + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> + (flags & CHECKPOINT_FORCE) ? "force " : "",
> + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
>
> 3) Why do we need this extra calculation for start_lsn? Do you ever
> see a negative LSN or something here?
> +    ('0/0'::pg_lsn + (
> +        CASE
> +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> +            ELSE (0)::numeric
> +        END + (s.param3)::numeric)) AS start_lsn,
>
> 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> the reasoning for having this function and it's named as *checkpoint*
> when it doesn't do anything specific to the checkpoint at all?
>
> Having 3 unnecessary functions that aren't useful to the users at all
> in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> let's try to do without extra functions.

Another thought for my review comment:
> 1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
> or checkpoint instead of having a new function
> pg_stat_get_progress_checkpoint_type?

I don't think using pg_is_in_recovery work here as it is taken after
the checkpoint has started. So, I think the right way here is to send
1 in CreateCheckPoint  and 2 in CreateRestartPoint and use
CASE-WHEN-ELSE-END to show "1": "checkpoint" "2":"restartpoint".

Continuing my review:

5) Do we need a special phase for this checkpoint operation? I'm not
sure in which cases it will take a long time, but it looks like
there's a wait loop here.
vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

6) SLRU (Simple LRU) isn't a phase here, you can just say
PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
+
+ pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+ PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
  CheckPointPredicate();

And :s/checkpointing SLRU pages/checkpointing predicate lock pages
+                      WHEN 9 THEN 'checkpointing SLRU pages'


7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
'processing file sync requests'

8) :s/Finalizing/finalizing
+                      WHEN 14 THEN 'Finalizing'

9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
+                      WHEN 3 THEN 'checkpointing snapshots'
:s/checkpointing logical rewrite mappings/checkpointing logical
replication rewrite mapping files
+                      WHEN 4 THEN 'checkpointing logical rewrite mappings'

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

11) I think it's discussed, are we going to add the pid of the
checkpoint requestor?

Regards,
Bharath Rupireddy.



Hi,

On Mon, Feb 28, 2022 at 10:21:23AM +0530, Bharath Rupireddy wrote:
> 
> Another thought for my review comment:
> > 1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
> > or checkpoint instead of having a new function
> > pg_stat_get_progress_checkpoint_type?
> 
> I don't think using pg_is_in_recovery work here as it is taken after
> the checkpoint has started. So, I think the right way here is to send
> 1 in CreateCheckPoint  and 2 in CreateRestartPoint and use
> CASE-WHEN-ELSE-END to show "1": "checkpoint" "2":"restartpoint".

I suggested upthread to store the starting timeline instead.  This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

> 11) I think it's discussed, are we going to add the pid of the
> checkpoint requestor?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.



On Mon, Feb 28, 2022 at 12:02 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Mon, Feb 28, 2022 at 10:21:23AM +0530, Bharath Rupireddy wrote:
> >
> > Another thought for my review comment:
> > > 1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
> > > or checkpoint instead of having a new function
> > > pg_stat_get_progress_checkpoint_type?
> >
> > I don't think using pg_is_in_recovery work here as it is taken after
> > the checkpoint has started. So, I think the right way here is to send
> > 1 in CreateCheckPoint  and 2 in CreateRestartPoint and use
> > CASE-WHEN-ELSE-END to show "1": "checkpoint" "2":"restartpoint".
>
> I suggested upthread to store the starting timeline instead.  This way you can
> deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> other information, like what was the starting WAL.

I don't understand why we need the timeline here to just determine
whether it's a restartpoint or checkpoint. I know that the
InsertTimeLineID is 0 during recovery. IMO, emitting 1 for checkpoint
and 2 for restartpoint in CreateCheckPoint and CreateRestartPoint
respectively and using CASE-WHEN-ELSE-END to show it in readable
format is the easiest way.

Can't the checkpoint start LSN be deduced from
PROGRESS_CHECKPOINT_LSN, checkPoint.redo?

I'm completely against these pg_stat_get_progress_checkpoint_{type,
kind, start_time} functions unless there's a strong case. IMO, we can
achieve what we want without these functions as well.

> > 11) I think it's discussed, are we going to add the pid of the
> > checkpoint requestor?
>
> As mentioned upthread, there can be multiple backends that request a
> checkpoint, so unless we want to store an array of pid we should store a number
> of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

Regards,
Bharath Rupireddy.



On Mon, Feb 28, 2022 at 06:03:54PM +0530, Bharath Rupireddy wrote:
> On Mon, Feb 28, 2022 at 12:02 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > I suggested upthread to store the starting timeline instead.  This way you can
> > deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> > other information, like what was the starting WAL.
> 
> I don't understand why we need the timeline here to just determine
> whether it's a restartpoint or checkpoint.

I'm not saying it's necessary, I'm saying that for the same space usage we can
store something a bit more useful.  If no one cares about having the starting
timeline available for no extra cost then sure, let's just store the kind
directly.

> Can't the checkpoint start LSN be deduced from
> PROGRESS_CHECKPOINT_LSN, checkPoint.redo?

I'm not sure I'm following, isn't checkPoint.redo the checkpoint start LSN?

> > As mentioned upthread, there can be multiple backends that request a
> > checkpoint, so unless we want to store an array of pid we should store a number
> > of backend that are waiting for a new checkpoint.
> 
> Yeah, you are right. Let's not go that path and store an array of
> pids. I don't see a strong use-case with the pid of the process
> requesting checkpoint. If required, we can add it later once the
> pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen.  We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

От
Matthias van de Meent
Дата:
On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> 3) Why do we need this extra calculation for start_lsn?
> Do you ever see a negative LSN or something here?
> +    ('0/0'::pg_lsn + (
> +        CASE
> +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> +            ELSE (0)::numeric
> +        END + (s.param3)::numeric)) AS start_lsn,

Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

As to whether it is reasonable: Generating 16GB of wal every second
(2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
database runtime; or about 17 years. Seeing that a cluster can be
`pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
we can assume that clusters hitting LSN=2^63 will be a reasonable
possibility within the next few years. As the lifespan of a PG release
is about 5 years, it doesn't seem impossible that there will be actual
clusters that are going to hit this naturally in the lifespan of PG15.

It is also possible that someone fat-fingers pg_resetwal; and creates
a cluster with LSN >= 2^63; resulting in negative values in the
s.param3 field. Not likely, but we can force such situations; and as
such we should handle that gracefully.

> 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> the reasoning for having this function and it's named as *checkpoint*
> when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

> Having 3 unnecessary functions that aren't useful to the users at all
> in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> let's try to do without extra functions.

I agree that (1) could be simplified, or at least fully expressed in
SQL without exposing too many internals. If we're fine with exposing
internals like flags and type layouts, then (2), and arguably (4), can
be expressed in SQL as well.

-Matthias



> > 3) Why do we need this extra calculation for start_lsn?
> > Do you ever see a negative LSN or something here?
> > +    ('0/0'::pg_lsn + (
> > +        CASE
> > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > +            ELSE (0)::numeric
> > +        END + (s.param3)::numeric)) AS start_lsn,
>
> Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> bigint type; thus the signed int64. This cast is OK as it wraps
> around, but that means we have to take care to correctly display the
> LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> the special-casing for negative values.

Yes. The extra calculation is required here as we are storing unit64
value in the variable of type int64. When we convert uint64 to int64
then the bit pattern is preserved (so no data is lost). The high-order
bit becomes the sign bit and if the sign bit is set, both the sign and
magnitude of the value changes. To safely get the actual uint64 value
whatever was assigned, we need the above calculations.

> > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > the reasoning for having this function and it's named as *checkpoint*
> > when it doesn't do anything specific to the checkpoint at all?
>
> I hadn't thought of using the types' inout functions, but it looks
> like timestamp IO functions use a formatted timestring, which won't
> work with the epoch-based timestamp stored in the view.

There is a variation of to_timestamp() which takes UNIX epoch (float8)
as an argument and converts it to timestamptz but we cannot directly
call this function with S.param4.

TimestampTz
GetCurrentTimestamp(void)
{
    TimestampTz result;
    struct timeval tp;

    gettimeofday(&tp, NULL);

    result = (TimestampTz) tp.tv_sec -
        ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
    result = (result * USECS_PER_SEC) + tp.tv_usec;

    return result;
}

S.param4 contains the output of the above function
(GetCurrentTimestamp()) which returns Postgres epoch but the
to_timestamp() expects UNIX epoch as input. So some calculation is
required here. I feel the SQL 'to_timestamp(946684800 +
(S.param4::float / 1000000)) AS start_time' works fine. The value
'946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
SECS_PER_DAY). I am not sure whether it is good practice to use this
way. Kindly share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > 3) Why do we need this extra calculation for start_lsn?
> > Do you ever see a negative LSN or something here?
> > +    ('0/0'::pg_lsn + (
> > +        CASE
> > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > +            ELSE (0)::numeric
> > +        END + (s.param3)::numeric)) AS start_lsn,
>
> Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> bigint type; thus the signed int64. This cast is OK as it wraps
> around, but that means we have to take care to correctly display the
> LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> the special-casing for negative values.
>
> As to whether it is reasonable: Generating 16GB of wal every second
> (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> database runtime; or about 17 years. Seeing that a cluster can be
> `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> we can assume that clusters hitting LSN=2^63 will be a reasonable
> possibility within the next few years. As the lifespan of a PG release
> is about 5 years, it doesn't seem impossible that there will be actual
> clusters that are going to hit this naturally in the lifespan of PG15.
>
> It is also possible that someone fat-fingers pg_resetwal; and creates
> a cluster with LSN >= 2^63; resulting in negative values in the
> s.param3 field. Not likely, but we can force such situations; and as
> such we should handle that gracefully.
>
> > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > the reasoning for having this function and it's named as *checkpoint*
> > when it doesn't do anything specific to the checkpoint at all?
>
> I hadn't thought of using the types' inout functions, but it looks
> like timestamp IO functions use a formatted timestring, which won't
> work with the epoch-based timestamp stored in the view.
>
> > Having 3 unnecessary functions that aren't useful to the users at all
> > in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> > let's try to do without extra functions.
>
> I agree that (1) could be simplified, or at least fully expressed in
> SQL without exposing too many internals. If we're fine with exposing
> internals like flags and type layouts, then (2), and arguably (4), can
> be expressed in SQL as well.
>
> -Matthias



Thanks for reviewing.

> > > I suggested upthread to store the starting timeline instead.  This way you can
> > > deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> > > other information, like what was the starting WAL.
> >
> > I don't understand why we need the timeline here to just determine
> > whether it's a restartpoint or checkpoint.
>
> I'm not saying it's necessary, I'm saying that for the same space usage we can
> store something a bit more useful.  If no one cares about having the starting
> timeline available for no extra cost then sure, let's just store the kind
> directly.

Fixed.

> 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> directly instead of new function pg_stat_get_progress_checkpoint_kind?
> + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> + (flags == 0) ? "unknown" : "",
> + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> + (flags & CHECKPOINT_FORCE) ? "force " : "",
> + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");

Fixed.
---

> 5) Do we need a special phase for this checkpoint operation? I'm not
> sure in which cases it will take a long time, but it looks like
> there's a wait loop here.
> vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
> if (nvxids > 0)
> {
> do
> {
> pg_usleep(10000L); /* wait for 10 msec */
> } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
> }

Yes. It is better to add a separate phase here.
---

> Also, how about special phases for SyncPostCheckpoint(),
> SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
---

> 6) SLRU (Simple LRU) isn't a phase here, you can just say
> PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> + PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
>  CheckPointPredicate();
>
> And :s/checkpointing SLRU pages/checkpointing predicate lock pages
>+                      WHEN 9 THEN 'checkpointing SLRU pages'

Fixed.
---

> 7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
it describes the purpose in less words.

> And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
> 'processing file sync requests'

Fixed.
---

> 8) :s/Finalizing/finalizing
> +                      WHEN 14 THEN 'Finalizing'

Fixed.
---

> 9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
> +                      WHEN 3 THEN 'checkpointing snapshots'
> :s/checkpointing logical rewrite mappings/checkpointing logical
> replication rewrite mapping files
> +                      WHEN 4 THEN 'checkpointing logical rewrite mappings'

Fixed.
---

> 10) I'm not sure if it's discussed, how about adding the number of
> snapshot/mapping files so far the checkpoint has processed in file
> processing while loops of
> CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> be many logical snapshot or mapping files and users may be interested
> in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

> > > As mentioned upthread, there can be multiple backends that request a
> > > checkpoint, so unless we want to store an array of pid we should store a number
> > > of backend that are waiting for a new checkpoint.
> >
> > Yeah, you are right. Let's not go that path and store an array of
> > pids. I don't see a strong use-case with the pid of the process
> > requesting checkpoint. If required, we can add it later once the
> > pg_stat_progress_checkpoint view gets in.
>
> I don't think that's really necessary to give the pid list.
>
> If you requested a new checkpoint, it doesn't matter if it's only your backend
> that triggered it, another backend or a few other dozen, the result will be the
> same and you have the information that the request has been seen.  We could
> store just a bool for that but having a number instead also gives a bit more
> information and may allow you to detect some broken logic on your client code
> if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Please find the v4 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > 3) Why do we need this extra calculation for start_lsn?
> > > Do you ever see a negative LSN or something here?
> > > +    ('0/0'::pg_lsn + (
> > > +        CASE
> > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > +            ELSE (0)::numeric
> > > +        END + (s.param3)::numeric)) AS start_lsn,
> >
> > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > bigint type; thus the signed int64. This cast is OK as it wraps
> > around, but that means we have to take care to correctly display the
> > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > the special-casing for negative values.
>
> Yes. The extra calculation is required here as we are storing unit64
> value in the variable of type int64. When we convert uint64 to int64
> then the bit pattern is preserved (so no data is lost). The high-order
> bit becomes the sign bit and if the sign bit is set, both the sign and
> magnitude of the value changes. To safely get the actual uint64 value
> whatever was assigned, we need the above calculations.
>
> > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > the reasoning for having this function and it's named as *checkpoint*
> > > when it doesn't do anything specific to the checkpoint at all?
> >
> > I hadn't thought of using the types' inout functions, but it looks
> > like timestamp IO functions use a formatted timestring, which won't
> > work with the epoch-based timestamp stored in the view.
>
> There is a variation of to_timestamp() which takes UNIX epoch (float8)
> as an argument and converts it to timestamptz but we cannot directly
> call this function with S.param4.
>
> TimestampTz
> GetCurrentTimestamp(void)
> {
>     TimestampTz result;
>     struct timeval tp;
>
>     gettimeofday(&tp, NULL);
>
>     result = (TimestampTz) tp.tv_sec -
>         ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
>     result = (result * USECS_PER_SEC) + tp.tv_usec;
>
>     return result;
> }
>
> S.param4 contains the output of the above function
> (GetCurrentTimestamp()) which returns Postgres epoch but the
> to_timestamp() expects UNIX epoch as input. So some calculation is
> required here. I feel the SQL 'to_timestamp(946684800 +
> (S.param4::float / 1000000)) AS start_time' works fine. The value
> '946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> SECS_PER_DAY). I am not sure whether it is good practice to use this
> way. Kindly share your thoughts.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > 3) Why do we need this extra calculation for start_lsn?
> > > Do you ever see a negative LSN or something here?
> > > +    ('0/0'::pg_lsn + (
> > > +        CASE
> > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > +            ELSE (0)::numeric
> > > +        END + (s.param3)::numeric)) AS start_lsn,
> >
> > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > bigint type; thus the signed int64. This cast is OK as it wraps
> > around, but that means we have to take care to correctly display the
> > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > the special-casing for negative values.
> >
> > As to whether it is reasonable: Generating 16GB of wal every second
> > (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> > has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> > database runtime; or about 17 years. Seeing that a cluster can be
> > `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> > least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> > we can assume that clusters hitting LSN=2^63 will be a reasonable
> > possibility within the next few years. As the lifespan of a PG release
> > is about 5 years, it doesn't seem impossible that there will be actual
> > clusters that are going to hit this naturally in the lifespan of PG15.
> >
> > It is also possible that someone fat-fingers pg_resetwal; and creates
> > a cluster with LSN >= 2^63; resulting in negative values in the
> > s.param3 field. Not likely, but we can force such situations; and as
> > such we should handle that gracefully.
> >
> > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > the reasoning for having this function and it's named as *checkpoint*
> > > when it doesn't do anything specific to the checkpoint at all?
> >
> > I hadn't thought of using the types' inout functions, but it looks
> > like timestamp IO functions use a formatted timestring, which won't
> > work with the epoch-based timestamp stored in the view.
> >
> > > Having 3 unnecessary functions that aren't useful to the users at all
> > > in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> > > let's try to do without extra functions.
> >
> > I agree that (1) could be simplified, or at least fully expressed in
> > SQL without exposing too many internals. If we're fine with exposing
> > internals like flags and type layouts, then (2), and arguably (4), can
> > be expressed in SQL as well.
> >
> > -Matthias

Вложения
On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
> > Also, how about special phases for SyncPostCheckpoint(),
> > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > it might be increase in future (?)), TruncateSUBTRANS()?
>
> SyncPreCheckpoint() is just incrementing a counter and
> PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> there is no need to add any phases for these as of now. We can add in
> the future if necessary. Added phases for SyncPostCheckpoint(),
> InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().

Okay.

> > 10) I'm not sure if it's discussed, how about adding the number of
> > snapshot/mapping files so far the checkpoint has processed in file
> > processing while loops of
> > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > be many logical snapshot or mapping files and users may be interested
> > in knowing the so-far-processed-file-count.
>
> I had thought about this while sharing the v1 patch and mentioned my
> views upthread. I feel it won't give meaningful progress information
> (It can be treated as statistics). Hence not included. Thoughts?

Okay. If there are any complaints about it we can always add them later.

> > > > As mentioned upthread, there can be multiple backends that request a
> > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > of backend that are waiting for a new checkpoint.
> > >
> > > Yeah, you are right. Let's not go that path and store an array of
> > > pids. I don't see a strong use-case with the pid of the process
> > > requesting checkpoint. If required, we can add it later once the
> > > pg_stat_progress_checkpoint view gets in.
> >
> > I don't think that's really necessary to give the pid list.
> >
> > If you requested a new checkpoint, it doesn't matter if it's only your backend
> > that triggered it, another backend or a few other dozen, the result will be the
> > same and you have the information that the request has been seen.  We could
> > store just a bool for that but having a number instead also gives a bit more
> > information and may allow you to detect some broken logic on your client code
> > if it keeps increasing.
>
> It's a good metric to show in the view but the information is not
> readily available. Additional code is required to calculate the number
> of requests. Is it worth doing that? I feel this can be added later if
> required.

Yes, we can always add it later if required.

> Please find the v4 patch attached and share your thoughts.

I reviewed v4 patch, here are my comments:

1) Can we convert below into pgstat_progress_update_multi_param, just
to avoid function calls?
pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,

2) Why are we not having special phase for CheckPointReplicationOrigin
as it does good bunch of work (writing to disk, XLogFlush,
durable_rename) especially when max_replication_slots is large?

3) I don't think "requested" is necessary here as it doesn't add any
value or it's not a checkpoint kind or such, you can remove it.

4) s:/'recycling old XLOG files'/'recycling old WAL files'
+                      WHEN 16 THEN 'recycling old XLOG files'

5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
next to pg_stat_progress_copy in system_view.sql? It looks like all
the progress reporting views are next to each other.

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?

8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
can't show progress report for shutdown and end-of-recovery
checkpoint, I think you need to specify that here in wal.sgml and
checkpoint.sgml.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.

9) Can you add a test case for pg_stat_progress_checkpoint view? I
think it's good to add one. See, below for reference:
-- Add a trigger to catch and print the contents of the catalog view
-- pg_stat_progress_copy during data insertion.  This allows to test
-- the validation of some progress reports for COPY FROM where the trigger
-- would fire.
create function notice_after_tab_progress_reporting() returns trigger AS
$$
declare report record;

10) Typo: it's not "is happens"
+       The checkpoint is happens without delays.

11) Can you be specific what are those "some operations" that forced a
checkpoint? May be like, basebackup, createdb or something?
+       The checkpoint is started because some operation forced a checkpoint.

12) Can you be a bit elobartive here who waits? Something like the
backend that requested checkpoint will wait until it's completion ....
+       Wait for completion before returning.

13) "removing unneeded or flushing needed logical rewrite mapping files"
+       The checkpointer process is currently removing/flushing the logical

14) "old WAL files"
+       The checkpointer process is currently recycling old XLOG files.

Regards,
Bharath Rupireddy.



Here are some of my review comments on the latest patch:

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>

This looks a bit confusing. Two columns, one with the name "checkpoint
types" and another "checkpoint kinds". You can probably rename
checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
as-it-is.

==

+
<entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.

Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid               | 22043
type              | checkpoint
kind              | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid               | 22043
type              | checkpoint
kind              | immediate force wait requested time
start_lsn         | 0/14C60F8
start_time        | 2022-03-03 18:59:56.018662+05:30
phase             | performing two phase checkpoint


This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

==

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid               | 22043
type              | checkpoint
kind              | immediate force wait requested time
start_lsn         | 0/14C60F8
start_time        | 2022-03-03 18:59:56.018662+05:30
phase             | finalizing
buffers_total     | 0
buffers_processed | 0
buffers_written   | 0

Any idea why this mismatch?

==

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> Thanks for reviewing.
>
> > > > I suggested upthread to store the starting timeline instead.  This way you can
> > > > deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> > > > other information, like what was the starting WAL.
> > >
> > > I don't understand why we need the timeline here to just determine
> > > whether it's a restartpoint or checkpoint.
> >
> > I'm not saying it's necessary, I'm saying that for the same space usage we can
> > store something a bit more useful.  If no one cares about having the starting
> > timeline available for no extra cost then sure, let's just store the kind
> > directly.
>
> Fixed.
>
> > 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> > directly instead of new function pg_stat_get_progress_checkpoint_kind?
> > + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> > + (flags == 0) ? "unknown" : "",
> > + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> > + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> > + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> > + (flags & CHECKPOINT_FORCE) ? "force " : "",
> > + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> > + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> > + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> > + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
>
> Fixed.
> ---
>
> > 5) Do we need a special phase for this checkpoint operation? I'm not
> > sure in which cases it will take a long time, but it looks like
> > there's a wait loop here.
> > vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
> > if (nvxids > 0)
> > {
> > do
> > {
> > pg_usleep(10000L); /* wait for 10 msec */
> > } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
> > }
>
> Yes. It is better to add a separate phase here.
> ---
>
> > Also, how about special phases for SyncPostCheckpoint(),
> > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > it might be increase in future (?)), TruncateSUBTRANS()?
>
> SyncPreCheckpoint() is just incrementing a counter and
> PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> there is no need to add any phases for these as of now. We can add in
> the future if necessary. Added phases for SyncPostCheckpoint(),
> InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
> ---
>
> > 6) SLRU (Simple LRU) isn't a phase here, you can just say
> > PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
> > +
> > + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> > + PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
> >  CheckPointPredicate();
> >
> > And :s/checkpointing SLRU pages/checkpointing predicate lock pages
> >+                      WHEN 9 THEN 'checkpointing SLRU pages'
>
> Fixed.
> ---
>
> > 7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS
>
> I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
> it describes the purpose in less words.
>
> > And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
> > 'processing file sync requests'
>
> Fixed.
> ---
>
> > 8) :s/Finalizing/finalizing
> > +                      WHEN 14 THEN 'Finalizing'
>
> Fixed.
> ---
>
> > 9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
> > +                      WHEN 3 THEN 'checkpointing snapshots'
> > :s/checkpointing logical rewrite mappings/checkpointing logical
> > replication rewrite mapping files
> > +                      WHEN 4 THEN 'checkpointing logical rewrite mappings'
>
> Fixed.
> ---
>
> > 10) I'm not sure if it's discussed, how about adding the number of
> > snapshot/mapping files so far the checkpoint has processed in file
> > processing while loops of
> > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > be many logical snapshot or mapping files and users may be interested
> > in knowing the so-far-processed-file-count.
>
> I had thought about this while sharing the v1 patch and mentioned my
> views upthread. I feel it won't give meaningful progress information
> (It can be treated as statistics). Hence not included. Thoughts?
>
> > > > As mentioned upthread, there can be multiple backends that request a
> > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > of backend that are waiting for a new checkpoint.
> > >
> > > Yeah, you are right. Let's not go that path and store an array of
> > > pids. I don't see a strong use-case with the pid of the process
> > > requesting checkpoint. If required, we can add it later once the
> > > pg_stat_progress_checkpoint view gets in.
> >
> > I don't think that's really necessary to give the pid list.
> >
> > If you requested a new checkpoint, it doesn't matter if it's only your backend
> > that triggered it, another backend or a few other dozen, the result will be the
> > same and you have the information that the request has been seen.  We could
> > store just a bool for that but having a number instead also gives a bit more
> > information and may allow you to detect some broken logic on your client code
> > if it keeps increasing.
>
> It's a good metric to show in the view but the information is not
> readily available. Additional code is required to calculate the number
> of requests. Is it worth doing that? I feel this can be added later if
> required.
>
> Please find the v4 patch attached and share your thoughts.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > 3) Why do we need this extra calculation for start_lsn?
> > > > Do you ever see a negative LSN or something here?
> > > > +    ('0/0'::pg_lsn + (
> > > > +        CASE
> > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > +            ELSE (0)::numeric
> > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > >
> > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > around, but that means we have to take care to correctly display the
> > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > the special-casing for negative values.
> >
> > Yes. The extra calculation is required here as we are storing unit64
> > value in the variable of type int64. When we convert uint64 to int64
> > then the bit pattern is preserved (so no data is lost). The high-order
> > bit becomes the sign bit and if the sign bit is set, both the sign and
> > magnitude of the value changes. To safely get the actual uint64 value
> > whatever was assigned, we need the above calculations.
> >
> > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > the reasoning for having this function and it's named as *checkpoint*
> > > > when it doesn't do anything specific to the checkpoint at all?
> > >
> > > I hadn't thought of using the types' inout functions, but it looks
> > > like timestamp IO functions use a formatted timestring, which won't
> > > work with the epoch-based timestamp stored in the view.
> >
> > There is a variation of to_timestamp() which takes UNIX epoch (float8)
> > as an argument and converts it to timestamptz but we cannot directly
> > call this function with S.param4.
> >
> > TimestampTz
> > GetCurrentTimestamp(void)
> > {
> >     TimestampTz result;
> >     struct timeval tp;
> >
> >     gettimeofday(&tp, NULL);
> >
> >     result = (TimestampTz) tp.tv_sec -
> >         ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
> >     result = (result * USECS_PER_SEC) + tp.tv_usec;
> >
> >     return result;
> > }
> >
> > S.param4 contains the output of the above function
> > (GetCurrentTimestamp()) which returns Postgres epoch but the
> > to_timestamp() expects UNIX epoch as input. So some calculation is
> > required here. I feel the SQL 'to_timestamp(946684800 +
> > (S.param4::float / 1000000)) AS start_time' works fine. The value
> > '946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> > SECS_PER_DAY). I am not sure whether it is good practice to use this
> > way. Kindly share your thoughts.
> >
> > Thanks & Regards,
> > Nitin Jadhav
> >
> > On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> > >
> > > On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> > > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > > 3) Why do we need this extra calculation for start_lsn?
> > > > Do you ever see a negative LSN or something here?
> > > > +    ('0/0'::pg_lsn + (
> > > > +        CASE
> > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > +            ELSE (0)::numeric
> > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > >
> > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > around, but that means we have to take care to correctly display the
> > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > the special-casing for negative values.
> > >
> > > As to whether it is reasonable: Generating 16GB of wal every second
> > > (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> > > has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> > > database runtime; or about 17 years. Seeing that a cluster can be
> > > `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> > > least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> > > we can assume that clusters hitting LSN=2^63 will be a reasonable
> > > possibility within the next few years. As the lifespan of a PG release
> > > is about 5 years, it doesn't seem impossible that there will be actual
> > > clusters that are going to hit this naturally in the lifespan of PG15.
> > >
> > > It is also possible that someone fat-fingers pg_resetwal; and creates
> > > a cluster with LSN >= 2^63; resulting in negative values in the
> > > s.param3 field. Not likely, but we can force such situations; and as
> > > such we should handle that gracefully.
> > >
> > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > the reasoning for having this function and it's named as *checkpoint*
> > > > when it doesn't do anything specific to the checkpoint at all?
> > >
> > > I hadn't thought of using the types' inout functions, but it looks
> > > like timestamp IO functions use a formatted timestring, which won't
> > > work with the epoch-based timestamp stored in the view.
> > >
> > > > Having 3 unnecessary functions that aren't useful to the users at all
> > > > in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> > > > let's try to do without extra functions.
> > >
> > > I agree that (1) could be simplified, or at least fully expressed in
> > > SQL without exposing too many internals. If we're fine with exposing
> > > internals like flags and type layouts, then (2), and arguably (4), can
> > > be expressed in SQL as well.
> > >
> > > -Matthias



On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > > As mentioned upthread, there can be multiple backends that request a
> > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > of backend that are waiting for a new checkpoint.
>
> It's a good metric to show in the view but the information is not
> readily available. Additional code is required to calculate the number
> of requests. Is it worth doing that? I feel this can be added later if
> required.

Is it that hard or costly to do?  Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong.  You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.



Thanks for reviewing.

> 6) How about shutdown and end-of-recovery checkpoint? Are you planning
> to have an ereport_startup_progress mechanism as 0002?

I thought of including it earlier then I felt lets first make the
current patch stable. Once all the fields are properly decided and the
patch gets in then we can easily extend the functionality to shutdown
and end-of-recovery cases. I have also observed that the timer
functionality wont work properly in case of shutdown as we are doing
an immediate checkpoint. So this needs a lot of discussion and I would
like to handle this on a separate thread.
---

> 7) I think you don't need to call checkpoint_progress_start and
> pgstat_progress_update_param, any other progress reporting function
> for shutdown and end-of-recovery checkpoint right?

I had included the guards earlier and then removed later based on the
discussion upthread.
---

> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
> start_lsn         | 0/14C60F8
> start_time        | 2022-03-03 18:59:56.018662+05:30
> phase             | performing two phase checkpoint
>
>
> This is the output I see when the checkpointer process has come out of
> the two phase checkpoint and is currently writing checkpoint xlog
> records and doing other stuff like updating control files etc. Is this
> okay?

The idea behind choosing the phases is based on the functionality
which takes longer time to execute. Since after two phase checkpoint
till post checkpoint cleanup won't take much time to execute, I have
not added any additional phase for that. But I also agree that this
gives wrong information to the user. How about mentioning the phase
information at the end of each phase like "Initializing",
"Initialization done", ..., "two phase checkpoint done", "post
checkpoint cleanup done", .., "finalizing". Except for the first phase
("initializing") and last phase ("finalizing"), all the other phases
describe the end of a certain operation. I feel this gives correct
information even though the phase name/description does not represent
the entire code block between two phases. For example if the current
phase is ''two phase checkpoint done". Then the user can infer that
the checkpointer has done till two phase checkpoint and it is doing
other stuff that are after that. Thoughts?

> The output of log_checkpoint shows the number of buffers written is 3
> whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> below:
>
> 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
>
> --
>
> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
> start_lsn         | 0/14C60F8
> start_time        | 2022-03-03 18:59:56.018662+05:30
> phase             | finalizing
> buffers_total     | 0
> buffers_processed | 0
> buffers_written   | 0
>
> Any idea why this mismatch?

Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
which indicates the total number of buffers to be processed. Based on
that, the 'buffers_processed' and 'buffers_written' counter gets
incremented. I meant these values may reach upto 'buffers_total'. The
current pg_stat_progress_view support above information. There is
another place when 'ckpt_bufs_written' gets incremented (In
SlruInternalWritePage()). This increment is above the 'buffers_total'
value and it is included in the server log message (checkpoint end)
and not included in the view. I am a bit confused here. If we include
this increment in the view then we cannot calculate the exact
'buffers_total' beforehand. Can we increment the 'buffers_toal' also
when 'ckpt_bufs_written' gets incremented so that we can match the
behaviour with checkpoint end message?  Please share your thoughts.
---

> I think we can add a couple of more information to this view -
> start_time for buffer write operation and start_time for buffer sync
> operation. These are two very time consuming tasks in a checkpoint and
> people would find it useful to know how much time is being taken by
> the checkpoint in I/O operation phase. thoughts?

I felt the detailed progress is getting shown for these 2 phases of
the checkpoint like 'buffers_processed', 'buffers_written' and
'files_synced'. Hence I did not think about adding start time and If
it is really required, then I can add.

> Is it that hard or costly to do?  Just sending a message to increment
> the stat counter in RequestCheckpoint() would be enough.
>
> Also, unless I'm missing something it's still only showing the initial
> checkpoint flags, so it's *not* showing what the checkpoint is really
> doing, only what the checkpoint may be doing if nothing else happens.
> It just feels wrong.  You could even use that ckpt_flags info to know
> that at least one backend has requested a new checkpoint, if you don't
> want to have a number of backends.

I think using ckpt_flags to display whether any new requests have been
made or not is a good idea. I will include it in the next patch.

Thanks & Regards,
Nitin Jadhav
On Thu, Mar 3, 2022 at 11:58 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > of backend that are waiting for a new checkpoint.
> >
> > It's a good metric to show in the view but the information is not
> > readily available. Additional code is required to calculate the number
> > of requests. Is it worth doing that? I feel this can be added later if
> > required.
>
> Is it that hard or costly to do?  Just sending a message to increment
> the stat counter in RequestCheckpoint() would be enough.
>
> Also, unless I'm missing something it's still only showing the initial
> checkpoint flags, so it's *not* showing what the checkpoint is really
> doing, only what the checkpoint may be doing if nothing else happens.
> It just feels wrong.  You could even use that ckpt_flags info to know
> that at least one backend has requested a new checkpoint, if you don't
> want to have a number of backends.



Please don't mix comments from multiple reviewers into one thread.
It's hard to understand which comments are mine or Julien's or from
others. Can you please respond to the email from each of us separately
with an inline response. That will be helpful to understand your
thoughts on our review comments.

--
With Regards,
Ashutosh Sharma.

On Fri, Mar 4, 2022 at 4:59 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> Thanks for reviewing.
>
> > 6) How about shutdown and end-of-recovery checkpoint? Are you planning
> > to have an ereport_startup_progress mechanism as 0002?
>
> I thought of including it earlier then I felt lets first make the
> current patch stable. Once all the fields are properly decided and the
> patch gets in then we can easily extend the functionality to shutdown
> and end-of-recovery cases. I have also observed that the timer
> functionality wont work properly in case of shutdown as we are doing
> an immediate checkpoint. So this needs a lot of discussion and I would
> like to handle this on a separate thread.
> ---
>
> > 7) I think you don't need to call checkpoint_progress_start and
> > pgstat_progress_update_param, any other progress reporting function
> > for shutdown and end-of-recovery checkpoint right?
>
> I had included the guards earlier and then removed later based on the
> discussion upthread.
> ---
>
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | performing two phase checkpoint
> >
> >
> > This is the output I see when the checkpointer process has come out of
> > the two phase checkpoint and is currently writing checkpoint xlog
> > records and doing other stuff like updating control files etc. Is this
> > okay?
>
> The idea behind choosing the phases is based on the functionality
> which takes longer time to execute. Since after two phase checkpoint
> till post checkpoint cleanup won't take much time to execute, I have
> not added any additional phase for that. But I also agree that this
> gives wrong information to the user. How about mentioning the phase
> information at the end of each phase like "Initializing",
> "Initialization done", ..., "two phase checkpoint done", "post
> checkpoint cleanup done", .., "finalizing". Except for the first phase
> ("initializing") and last phase ("finalizing"), all the other phases
> describe the end of a certain operation. I feel this gives correct
> information even though the phase name/description does not represent
> the entire code block between two phases. For example if the current
> phase is ''two phase checkpoint done". Then the user can infer that
> the checkpointer has done till two phase checkpoint and it is doing
> other stuff that are after that. Thoughts?
>
> > The output of log_checkpoint shows the number of buffers written is 3
> > whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> > below:
> >
> > 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> > longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
> >
> > --
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | finalizing
> > buffers_total     | 0
> > buffers_processed | 0
> > buffers_written   | 0
> >
> > Any idea why this mismatch?
>
> Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
> which indicates the total number of buffers to be processed. Based on
> that, the 'buffers_processed' and 'buffers_written' counter gets
> incremented. I meant these values may reach upto 'buffers_total'. The
> current pg_stat_progress_view support above information. There is
> another place when 'ckpt_bufs_written' gets incremented (In
> SlruInternalWritePage()). This increment is above the 'buffers_total'
> value and it is included in the server log message (checkpoint end)
> and not included in the view. I am a bit confused here. If we include
> this increment in the view then we cannot calculate the exact
> 'buffers_total' beforehand. Can we increment the 'buffers_toal' also
> when 'ckpt_bufs_written' gets incremented so that we can match the
> behaviour with checkpoint end message?  Please share your thoughts.
> ---
>
> > I think we can add a couple of more information to this view -
> > start_time for buffer write operation and start_time for buffer sync
> > operation. These are two very time consuming tasks in a checkpoint and
> > people would find it useful to know how much time is being taken by
> > the checkpoint in I/O operation phase. thoughts?
>
> I felt the detailed progress is getting shown for these 2 phases of
> the checkpoint like 'buffers_processed', 'buffers_written' and
> 'files_synced'. Hence I did not think about adding start time and If
> it is really required, then I can add.
>
> > Is it that hard or costly to do?  Just sending a message to increment
> > the stat counter in RequestCheckpoint() would be enough.
> >
> > Also, unless I'm missing something it's still only showing the initial
> > checkpoint flags, so it's *not* showing what the checkpoint is really
> > doing, only what the checkpoint may be doing if nothing else happens.
> > It just feels wrong.  You could even use that ckpt_flags info to know
> > that at least one backend has requested a new checkpoint, if you don't
> > want to have a number of backends.
>
> I think using ckpt_flags to display whether any new requests have been
> made or not is a good idea. I will include it in the next patch.
>
> Thanks & Regards,
> Nitin Jadhav
> On Thu, Mar 3, 2022 at 11:58 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > > of backend that are waiting for a new checkpoint.
> > >
> > > It's a good metric to show in the view but the information is not
> > > readily available. Additional code is required to calculate the number
> > > of requests. Is it worth doing that? I feel this can be added later if
> > > required.
> >
> > Is it that hard or costly to do?  Just sending a message to increment
> > the stat counter in RequestCheckpoint() would be enough.
> >
> > Also, unless I'm missing something it's still only showing the initial
> > checkpoint flags, so it's *not* showing what the checkpoint is really
> > doing, only what the checkpoint may be doing if nothing else happens.
> > It just feels wrong.  You could even use that ckpt_flags info to know
> > that at least one backend has requested a new checkpoint, if you don't
> > want to have a number of backends.



> 1) Can we convert below into pgstat_progress_update_multi_param, just
> to avoid function calls?
> pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
> pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
>
> 2) Why are we not having special phase for CheckPointReplicationOrigin
> as it does good bunch of work (writing to disk, XLogFlush,
> durable_rename) especially when max_replication_slots is large?
>
> 3) I don't think "requested" is necessary here as it doesn't add any
> value or it's not a checkpoint kind or such, you can remove it.
>
> 4) s:/'recycling old XLOG files'/'recycling old WAL files'
> +                      WHEN 16 THEN 'recycling old XLOG files'
>
> 5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
> next to pg_stat_progress_copy in system_view.sql? It looks like all
> the progress reporting views are next to each other.

I will take care in the next patch.
---

> 6) How about shutdown and end-of-recovery checkpoint? Are you planning
> to have an ereport_startup_progress mechanism as 0002?

I thought of including it earlier then I felt lets first make the
current patch stable. Once all the fields are properly decided and the
patch gets in then we can easily extend the functionality to shutdown
and end-of-recovery cases. I have also observed that the timer
functionality wont work properly in case of shutdown as we are doing
an immediate checkpoint. So this needs a lot of discussion and I would
like to handle this on a separate thread.
---

> 7) I think you don't need to call checkpoint_progress_start and
> pgstat_progress_update_param, any other progress reporting function
> for shutdown and end-of-recovery checkpoint right?

I had included the guards earlier and then removed later based on the
discussion upthread.
---

> 8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
> can't show progress report for shutdown and end-of-recovery
> checkpoint, I think you need to specify that here in wal.sgml and
> checkpoint.sgml.
> +   command <command>CHECKPOINT</command>. The checkpointer process running the
> +   checkpoint will report its progress in the
> +   <structname>pg_stat_progress_checkpoint</structname> view. See
> +   <xref linkend="checkpoint-progress-reporting"/> for details.
>
> 9) Can you add a test case for pg_stat_progress_checkpoint view? I
> think it's good to add one. See, below for reference:
> -- Add a trigger to catch and print the contents of the catalog view
> -- pg_stat_progress_copy during data insertion.  This allows to test
> -- the validation of some progress reports for COPY FROM where the trigger
> -- would fire.
> create function notice_after_tab_progress_reporting() returns trigger AS
> $$
> declare report record;
>
> 10) Typo: it's not "is happens"
> +       The checkpoint is happens without delays.
>
> 11) Can you be specific what are those "some operations" that forced a
> checkpoint? May be like, basebackup, createdb or something?
> +       The checkpoint is started because some operation forced a checkpoint.
>
> 12) Can you be a bit elobartive here who waits? Something like the
> backend that requested checkpoint will wait until it's completion ....
> +       Wait for completion before returning.
>
> 13) "removing unneeded or flushing needed logical rewrite mapping files"
> +       The checkpointer process is currently removing/flushing the logical
>
> 14) "old WAL files"
> +       The checkpointer process is currently recycling old XLOG files.

I will take care in the next patch.

Thanks & Regards,
Nitin Jadhav

On Wed, Mar 2, 2022 at 11:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> > > Also, how about special phases for SyncPostCheckpoint(),
> > > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > > it might be increase in future (?)), TruncateSUBTRANS()?
> >
> > SyncPreCheckpoint() is just incrementing a counter and
> > PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> > there is no need to add any phases for these as of now. We can add in
> > the future if necessary. Added phases for SyncPostCheckpoint(),
> > InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
>
> Okay.
>
> > > 10) I'm not sure if it's discussed, how about adding the number of
> > > snapshot/mapping files so far the checkpoint has processed in file
> > > processing while loops of
> > > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > > be many logical snapshot or mapping files and users may be interested
> > > in knowing the so-far-processed-file-count.
> >
> > I had thought about this while sharing the v1 patch and mentioned my
> > views upthread. I feel it won't give meaningful progress information
> > (It can be treated as statistics). Hence not included. Thoughts?
>
> Okay. If there are any complaints about it we can always add them later.
>
> > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > of backend that are waiting for a new checkpoint.
> > > >
> > > > Yeah, you are right. Let's not go that path and store an array of
> > > > pids. I don't see a strong use-case with the pid of the process
> > > > requesting checkpoint. If required, we can add it later once the
> > > > pg_stat_progress_checkpoint view gets in.
> > >
> > > I don't think that's really necessary to give the pid list.
> > >
> > > If you requested a new checkpoint, it doesn't matter if it's only your backend
> > > that triggered it, another backend or a few other dozen, the result will be the
> > > same and you have the information that the request has been seen.  We could
> > > store just a bool for that but having a number instead also gives a bit more
> > > information and may allow you to detect some broken logic on your client code
> > > if it keeps increasing.
> >
> > It's a good metric to show in the view but the information is not
> > readily available. Additional code is required to calculate the number
> > of requests. Is it worth doing that? I feel this can be added later if
> > required.
>
> Yes, we can always add it later if required.
>
> > Please find the v4 patch attached and share your thoughts.
>
> I reviewed v4 patch, here are my comments:
>
> 1) Can we convert below into pgstat_progress_update_multi_param, just
> to avoid function calls?
> pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
> pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
>
> 2) Why are we not having special phase for CheckPointReplicationOrigin
> as it does good bunch of work (writing to disk, XLogFlush,
> durable_rename) especially when max_replication_slots is large?
>
> 3) I don't think "requested" is necessary here as it doesn't add any
> value or it's not a checkpoint kind or such, you can remove it.
>
> 4) s:/'recycling old XLOG files'/'recycling old WAL files'
> +                      WHEN 16 THEN 'recycling old XLOG files'
>
> 5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
> next to pg_stat_progress_copy in system_view.sql? It looks like all
> the progress reporting views are next to each other.
>
> 6) How about shutdown and end-of-recovery checkpoint? Are you planning
> to have an ereport_startup_progress mechanism as 0002?
>
> 7) I think you don't need to call checkpoint_progress_start and
> pgstat_progress_update_param, any other progress reporting function
> for shutdown and end-of-recovery checkpoint right?
>
> 8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
> can't show progress report for shutdown and end-of-recovery
> checkpoint, I think you need to specify that here in wal.sgml and
> checkpoint.sgml.
> +   command <command>CHECKPOINT</command>. The checkpointer process running the
> +   checkpoint will report its progress in the
> +   <structname>pg_stat_progress_checkpoint</structname> view. See
> +   <xref linkend="checkpoint-progress-reporting"/> for details.
>
> 9) Can you add a test case for pg_stat_progress_checkpoint view? I
> think it's good to add one. See, below for reference:
> -- Add a trigger to catch and print the contents of the catalog view
> -- pg_stat_progress_copy during data insertion.  This allows to test
> -- the validation of some progress reports for COPY FROM where the trigger
> -- would fire.
> create function notice_after_tab_progress_reporting() returns trigger AS
> $$
> declare report record;
>
> 10) Typo: it's not "is happens"
> +       The checkpoint is happens without delays.
>
> 11) Can you be specific what are those "some operations" that forced a
> checkpoint? May be like, basebackup, createdb or something?
> +       The checkpoint is started because some operation forced a checkpoint.
>
> 12) Can you be a bit elobartive here who waits? Something like the
> backend that requested checkpoint will wait until it's completion ....
> +       Wait for completion before returning.
>
> 13) "removing unneeded or flushing needed logical rewrite mapping files"
> +       The checkpointer process is currently removing/flushing the logical
>
> 14) "old WAL files"
> +       The checkpointer process is currently recycling old XLOG files.
>
> Regards,
> Bharath Rupireddy.



> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>type</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>kind</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
> +      </para></entry>
> +     </row>
>
> This looks a bit confusing. Two columns, one with the name "checkpoint
> types" and another "checkpoint kinds". You can probably rename
> checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
> as-it-is.

Makes sense. I will change in the next patch.
---

> +
<entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
> +      <entry>One row only, showing the progress of the checkpoint.
>
> Let's make this message consistent with the already existing message
> for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
> view in "Dynamic Statistics Views" table.

You want me to change "One row only" to "Only one row" ? If that is
the case then for other views in the "Collected Statistics Views"
table, it is referred as "One row only". Let me know if you are
pointing out something else.
---

> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
>
> I think the output in the kind column can be displayed as {immediate,
> force, wait, requested, time}. By the way these are all checkpoint
> flags so it is better to display it as checkpoint flags instead of
> checkpoint kind as mentioned in one of my previous comments.

I will update in the next patch.
---

> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
> start_lsn         | 0/14C60F8
> start_time        | 2022-03-03 18:59:56.018662+05:30
> phase             | performing two phase checkpoint
>
> This is the output I see when the checkpointer process has come out of
> the two phase checkpoint and is currently writing checkpoint xlog
> records and doing other stuff like updating control files etc. Is this
> okay?

The idea behind choosing the phases is based on the functionality
which takes longer time to execute. Since after two phase checkpoint
till post checkpoint cleanup won't take much time to execute, I have
not added any additional phase for that. But I also agree that this
gives wrong information to the user. How about mentioning the phase
information at the end of each phase like "Initializing",
"Initialization done", ..., "two phase checkpoint done", "post
checkpoint cleanup done", .., "finalizing". Except for the first phase
("initializing") and last phase ("finalizing"), all the other phases
describe the end of a certain operation. I feel this gives correct
information even though the phase name/description does not represent
the entire code block between two phases. For example if the current
phase is ''two phase checkpoint done". Then the user can infer that
the checkpointer has done till two phase checkpoint and it is doing
other stuff that are after that. Thoughts?
---

> The output of log_checkpoint shows the number of buffers written is 3
> whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> below:
>
> 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
>
> --
>
> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
> start_lsn         | 0/14C60F8
> start_time        | 2022-03-03 18:59:56.018662+05:30
> phase             | finalizing
> buffers_total     | 0
> buffers_processed | 0
> buffers_written   | 0
>
> Any idea why this mismatch?

Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
which indicates the total number of buffers to be processed. Based on
that, the 'buffers_processed' and 'buffers_written' counter gets
incremented. I meant these values may reach upto 'buffers_total'. The
current pg_stat_progress_view support above information. There is
another place when 'ckpt_bufs_written' gets incremented (In
SlruInternalWritePage()). This increment is above the 'buffers_total'
value and it is included in the server log message (checkpoint end)
and not included in the view. I am a bit confused here. If we include
this increment in the view then we cannot calculate the exact
'buffers_total' beforehand. Can we increment the 'buffers_toal' also
when 'ckpt_bufs_written' gets incremented so that we can match the
behaviour with checkpoint end message?  Please share your thoughts.
---

> I think we can add a couple of more information to this view -
> start_time for buffer write operation and start_time for buffer sync
> operation. These are two very time consuming tasks in a checkpoint and
> people would find it useful to know how much time is being taken by
> the checkpoint in I/O operation phase. thoughts?

The detailed progress is getting shown for these 2 phases of the
checkpoint like 'buffers_processed', 'buffers_written' and
'files_synced'. Hence I did not think about adding start time and If
it is really required, then I can add.

Thanks & Regards,
Nitin Jadhav

On Thu, Mar 3, 2022 at 8:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Here are some of my review comments on the latest patch:
>
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>type</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>kind</structfield> <type>text</type>
> +      </para>
> +      <para>
> +       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
> +      </para></entry>
> +     </row>
>
> This looks a bit confusing. Two columns, one with the name "checkpoint
> types" and another "checkpoint kinds". You can probably rename
> checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
> as-it-is.
>
> ==
>
> +
<entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
> +      <entry>One row only, showing the progress of the checkpoint.
>
> Let's make this message consistent with the already existing message
> for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
> view in "Dynamic Statistics Views" table.
>
> ==
>
> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
>
> I think the output in the kind column can be displayed as {immediate,
> force, wait, requested, time}. By the way these are all checkpoint
> flags so it is better to display it as checkpoint flags instead of
> checkpoint kind as mentioned in one of my previous comments.
>
> ==
>
> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
> start_lsn         | 0/14C60F8
> start_time        | 2022-03-03 18:59:56.018662+05:30
> phase             | performing two phase checkpoint
>
>
> This is the output I see when the checkpointer process has come out of
> the two phase checkpoint and is currently writing checkpoint xlog
> records and doing other stuff like updating control files etc. Is this
> okay?
>
> ==
>
> The output of log_checkpoint shows the number of buffers written is 3
> whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> below:
>
> 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
>
> --
>
> [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> -[ RECORD 1 ]-----+-------------------------------------
> pid               | 22043
> type              | checkpoint
> kind              | immediate force wait requested time
> start_lsn         | 0/14C60F8
> start_time        | 2022-03-03 18:59:56.018662+05:30
> phase             | finalizing
> buffers_total     | 0
> buffers_processed | 0
> buffers_written   | 0
>
> Any idea why this mismatch?
>
> ==
>
> I think we can add a couple of more information to this view -
> start_time for buffer write operation and start_time for buffer sync
> operation. These are two very time consuming tasks in a checkpoint and
> people would find it useful to know how much time is being taken by
> the checkpoint in I/O operation phase. thoughts?
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > Thanks for reviewing.
> >
> > > > > I suggested upthread to store the starting timeline instead.  This way you can
> > > > > deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> > > > > other information, like what was the starting WAL.
> > > >
> > > > I don't understand why we need the timeline here to just determine
> > > > whether it's a restartpoint or checkpoint.
> > >
> > > I'm not saying it's necessary, I'm saying that for the same space usage we can
> > > store something a bit more useful.  If no one cares about having the starting
> > > timeline available for no extra cost then sure, let's just store the kind
> > > directly.
> >
> > Fixed.
> >
> > > 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> > > directly instead of new function pg_stat_get_progress_checkpoint_kind?
> > > + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> > > + (flags == 0) ? "unknown" : "",
> > > + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> > > + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> > > + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> > > + (flags & CHECKPOINT_FORCE) ? "force " : "",
> > > + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> > > + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> > > + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> > > + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
> >
> > Fixed.
> > ---
> >
> > > 5) Do we need a special phase for this checkpoint operation? I'm not
> > > sure in which cases it will take a long time, but it looks like
> > > there's a wait loop here.
> > > vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
> > > if (nvxids > 0)
> > > {
> > > do
> > > {
> > > pg_usleep(10000L); /* wait for 10 msec */
> > > } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
> > > }
> >
> > Yes. It is better to add a separate phase here.
> > ---
> >
> > > Also, how about special phases for SyncPostCheckpoint(),
> > > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > > it might be increase in future (?)), TruncateSUBTRANS()?
> >
> > SyncPreCheckpoint() is just incrementing a counter and
> > PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> > there is no need to add any phases for these as of now. We can add in
> > the future if necessary. Added phases for SyncPostCheckpoint(),
> > InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
> > ---
> >
> > > 6) SLRU (Simple LRU) isn't a phase here, you can just say
> > > PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
> > > +
> > > + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> > > + PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
> > >  CheckPointPredicate();
> > >
> > > And :s/checkpointing SLRU pages/checkpointing predicate lock pages
> > >+                      WHEN 9 THEN 'checkpointing SLRU pages'
> >
> > Fixed.
> > ---
> >
> > > 7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS
> >
> > I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
> > it describes the purpose in less words.
> >
> > > And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
> > > 'processing file sync requests'
> >
> > Fixed.
> > ---
> >
> > > 8) :s/Finalizing/finalizing
> > > +                      WHEN 14 THEN 'Finalizing'
> >
> > Fixed.
> > ---
> >
> > > 9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
> > > +                      WHEN 3 THEN 'checkpointing snapshots'
> > > :s/checkpointing logical rewrite mappings/checkpointing logical
> > > replication rewrite mapping files
> > > +                      WHEN 4 THEN 'checkpointing logical rewrite mappings'
> >
> > Fixed.
> > ---
> >
> > > 10) I'm not sure if it's discussed, how about adding the number of
> > > snapshot/mapping files so far the checkpoint has processed in file
> > > processing while loops of
> > > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > > be many logical snapshot or mapping files and users may be interested
> > > in knowing the so-far-processed-file-count.
> >
> > I had thought about this while sharing the v1 patch and mentioned my
> > views upthread. I feel it won't give meaningful progress information
> > (It can be treated as statistics). Hence not included. Thoughts?
> >
> > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > of backend that are waiting for a new checkpoint.
> > > >
> > > > Yeah, you are right. Let's not go that path and store an array of
> > > > pids. I don't see a strong use-case with the pid of the process
> > > > requesting checkpoint. If required, we can add it later once the
> > > > pg_stat_progress_checkpoint view gets in.
> > >
> > > I don't think that's really necessary to give the pid list.
> > >
> > > If you requested a new checkpoint, it doesn't matter if it's only your backend
> > > that triggered it, another backend or a few other dozen, the result will be the
> > > same and you have the information that the request has been seen.  We could
> > > store just a bool for that but having a number instead also gives a bit more
> > > information and may allow you to detect some broken logic on your client code
> > > if it keeps increasing.
> >
> > It's a good metric to show in the view but the information is not
> > readily available. Additional code is required to calculate the number
> > of requests. Is it worth doing that? I feel this can be added later if
> > required.
> >
> > Please find the v4 patch attached and share your thoughts.
> >
> > Thanks & Regards,
> > Nitin Jadhav
> >
> > On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > > > 3) Why do we need this extra calculation for start_lsn?
> > > > > Do you ever see a negative LSN or something here?
> > > > > +    ('0/0'::pg_lsn + (
> > > > > +        CASE
> > > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > > +            ELSE (0)::numeric
> > > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > > >
> > > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > > around, but that means we have to take care to correctly display the
> > > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > > the special-casing for negative values.
> > >
> > > Yes. The extra calculation is required here as we are storing unit64
> > > value in the variable of type int64. When we convert uint64 to int64
> > > then the bit pattern is preserved (so no data is lost). The high-order
> > > bit becomes the sign bit and if the sign bit is set, both the sign and
> > > magnitude of the value changes. To safely get the actual uint64 value
> > > whatever was assigned, we need the above calculations.
> > >
> > > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > > the reasoning for having this function and it's named as *checkpoint*
> > > > > when it doesn't do anything specific to the checkpoint at all?
> > > >
> > > > I hadn't thought of using the types' inout functions, but it looks
> > > > like timestamp IO functions use a formatted timestring, which won't
> > > > work with the epoch-based timestamp stored in the view.
> > >
> > > There is a variation of to_timestamp() which takes UNIX epoch (float8)
> > > as an argument and converts it to timestamptz but we cannot directly
> > > call this function with S.param4.
> > >
> > > TimestampTz
> > > GetCurrentTimestamp(void)
> > > {
> > >     TimestampTz result;
> > >     struct timeval tp;
> > >
> > >     gettimeofday(&tp, NULL);
> > >
> > >     result = (TimestampTz) tp.tv_sec -
> > >         ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
> > >     result = (result * USECS_PER_SEC) + tp.tv_usec;
> > >
> > >     return result;
> > > }
> > >
> > > S.param4 contains the output of the above function
> > > (GetCurrentTimestamp()) which returns Postgres epoch but the
> > > to_timestamp() expects UNIX epoch as input. So some calculation is
> > > required here. I feel the SQL 'to_timestamp(946684800 +
> > > (S.param4::float / 1000000)) AS start_time' works fine. The value
> > > '946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> > > SECS_PER_DAY). I am not sure whether it is good practice to use this
> > > way. Kindly share your thoughts.
> > >
> > > Thanks & Regards,
> > > Nitin Jadhav
> > >
> > > On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
> > > <boekewurm+postgres@gmail.com> wrote:
> > > >
> > > > On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> > > > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > > > 3) Why do we need this extra calculation for start_lsn?
> > > > > Do you ever see a negative LSN or something here?
> > > > > +    ('0/0'::pg_lsn + (
> > > > > +        CASE
> > > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > > +            ELSE (0)::numeric
> > > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > > >
> > > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > > around, but that means we have to take care to correctly display the
> > > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > > the special-casing for negative values.
> > > >
> > > > As to whether it is reasonable: Generating 16GB of wal every second
> > > > (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> > > > has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> > > > database runtime; or about 17 years. Seeing that a cluster can be
> > > > `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> > > > least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> > > > we can assume that clusters hitting LSN=2^63 will be a reasonable
> > > > possibility within the next few years. As the lifespan of a PG release
> > > > is about 5 years, it doesn't seem impossible that there will be actual
> > > > clusters that are going to hit this naturally in the lifespan of PG15.
> > > >
> > > > It is also possible that someone fat-fingers pg_resetwal; and creates
> > > > a cluster with LSN >= 2^63; resulting in negative values in the
> > > > s.param3 field. Not likely, but we can force such situations; and as
> > > > such we should handle that gracefully.
> > > >
> > > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > > the reasoning for having this function and it's named as *checkpoint*
> > > > > when it doesn't do anything specific to the checkpoint at all?
> > > >
> > > > I hadn't thought of using the types' inout functions, but it looks
> > > > like timestamp IO functions use a formatted timestring, which won't
> > > > work with the epoch-based timestamp stored in the view.
> > > >
> > > > > Having 3 unnecessary functions that aren't useful to the users at all
> > > > > in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> > > > > let's try to do without extra functions.
> > > >
> > > > I agree that (1) could be simplified, or at least fully expressed in
> > > > SQL without exposing too many internals. If we're fine with exposing
> > > > internals like flags and type layouts, then (2), and arguably (4), can
> > > > be expressed in SQL as well.
> > > >
> > > > -Matthias



> > 11) Can you be specific what are those "some operations" that forced a
> > checkpoint? May be like, basebackup, createdb or something?
> > +       The checkpoint is started because some operation forced a checkpoint.
> >
> I will take care in the next patch.

I feel mentioning/listing the specific operation makes it difficult to
maintain the document. If we add any new functionality in future which
needs a force checkpoint, then there is a high chance that we will
miss to update here. Hence modified it to "The checkpoint is started
because some operation (for which the checkpoint is necessary) is
forced the checkpoint".

Fixed other comments as per the discussion above.
Please find the v5 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Mar 7, 2022 at 7:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > 1) Can we convert below into pgstat_progress_update_multi_param, just
> > to avoid function calls?
> > pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
> > pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> >
> > 2) Why are we not having special phase for CheckPointReplicationOrigin
> > as it does good bunch of work (writing to disk, XLogFlush,
> > durable_rename) especially when max_replication_slots is large?
> >
> > 3) I don't think "requested" is necessary here as it doesn't add any
> > value or it's not a checkpoint kind or such, you can remove it.
> >
> > 4) s:/'recycling old XLOG files'/'recycling old WAL files'
> > +                      WHEN 16 THEN 'recycling old XLOG files'
> >
> > 5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
> > next to pg_stat_progress_copy in system_view.sql? It looks like all
> > the progress reporting views are next to each other.
>
> I will take care in the next patch.
> ---
>
> > 6) How about shutdown and end-of-recovery checkpoint? Are you planning
> > to have an ereport_startup_progress mechanism as 0002?
>
> I thought of including it earlier then I felt lets first make the
> current patch stable. Once all the fields are properly decided and the
> patch gets in then we can easily extend the functionality to shutdown
> and end-of-recovery cases. I have also observed that the timer
> functionality wont work properly in case of shutdown as we are doing
> an immediate checkpoint. So this needs a lot of discussion and I would
> like to handle this on a separate thread.
> ---
>
> > 7) I think you don't need to call checkpoint_progress_start and
> > pgstat_progress_update_param, any other progress reporting function
> > for shutdown and end-of-recovery checkpoint right?
>
> I had included the guards earlier and then removed later based on the
> discussion upthread.
> ---
>
> > 8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
> > can't show progress report for shutdown and end-of-recovery
> > checkpoint, I think you need to specify that here in wal.sgml and
> > checkpoint.sgml.
> > +   command <command>CHECKPOINT</command>. The checkpointer process running the
> > +   checkpoint will report its progress in the
> > +   <structname>pg_stat_progress_checkpoint</structname> view. See
> > +   <xref linkend="checkpoint-progress-reporting"/> for details.
> >
> > 9) Can you add a test case for pg_stat_progress_checkpoint view? I
> > think it's good to add one. See, below for reference:
> > -- Add a trigger to catch and print the contents of the catalog view
> > -- pg_stat_progress_copy during data insertion.  This allows to test
> > -- the validation of some progress reports for COPY FROM where the trigger
> > -- would fire.
> > create function notice_after_tab_progress_reporting() returns trigger AS
> > $$
> > declare report record;
> >
> > 10) Typo: it's not "is happens"
> > +       The checkpoint is happens without delays.
> >
> > 11) Can you be specific what are those "some operations" that forced a
> > checkpoint? May be like, basebackup, createdb or something?
> > +       The checkpoint is started because some operation forced a checkpoint.
> >
> > 12) Can you be a bit elobartive here who waits? Something like the
> > backend that requested checkpoint will wait until it's completion ....
> > +       Wait for completion before returning.
> >
> > 13) "removing unneeded or flushing needed logical rewrite mapping files"
> > +       The checkpointer process is currently removing/flushing the logical
> >
> > 14) "old WAL files"
> > +       The checkpointer process is currently recycling old XLOG files.
>
> I will take care in the next patch.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Wed, Mar 2, 2022 at 11:52 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > > > Also, how about special phases for SyncPostCheckpoint(),
> > > > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > > > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > > > it might be increase in future (?)), TruncateSUBTRANS()?
> > >
> > > SyncPreCheckpoint() is just incrementing a counter and
> > > PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> > > there is no need to add any phases for these as of now. We can add in
> > > the future if necessary. Added phases for SyncPostCheckpoint(),
> > > InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
> >
> > Okay.
> >
> > > > 10) I'm not sure if it's discussed, how about adding the number of
> > > > snapshot/mapping files so far the checkpoint has processed in file
> > > > processing while loops of
> > > > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > > > be many logical snapshot or mapping files and users may be interested
> > > > in knowing the so-far-processed-file-count.
> > >
> > > I had thought about this while sharing the v1 patch and mentioned my
> > > views upthread. I feel it won't give meaningful progress information
> > > (It can be treated as statistics). Hence not included. Thoughts?
> >
> > Okay. If there are any complaints about it we can always add them later.
> >
> > > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > > of backend that are waiting for a new checkpoint.
> > > > >
> > > > > Yeah, you are right. Let's not go that path and store an array of
> > > > > pids. I don't see a strong use-case with the pid of the process
> > > > > requesting checkpoint. If required, we can add it later once the
> > > > > pg_stat_progress_checkpoint view gets in.
> > > >
> > > > I don't think that's really necessary to give the pid list.
> > > >
> > > > If you requested a new checkpoint, it doesn't matter if it's only your backend
> > > > that triggered it, another backend or a few other dozen, the result will be the
> > > > same and you have the information that the request has been seen.  We could
> > > > store just a bool for that but having a number instead also gives a bit more
> > > > information and may allow you to detect some broken logic on your client code
> > > > if it keeps increasing.
> > >
> > > It's a good metric to show in the view but the information is not
> > > readily available. Additional code is required to calculate the number
> > > of requests. Is it worth doing that? I feel this can be added later if
> > > required.
> >
> > Yes, we can always add it later if required.
> >
> > > Please find the v4 patch attached and share your thoughts.
> >
> > I reviewed v4 patch, here are my comments:
> >
> > 1) Can we convert below into pgstat_progress_update_multi_param, just
> > to avoid function calls?
> > pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
> > pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> >
> > 2) Why are we not having special phase for CheckPointReplicationOrigin
> > as it does good bunch of work (writing to disk, XLogFlush,
> > durable_rename) especially when max_replication_slots is large?
> >
> > 3) I don't think "requested" is necessary here as it doesn't add any
> > value or it's not a checkpoint kind or such, you can remove it.
> >
> > 4) s:/'recycling old XLOG files'/'recycling old WAL files'
> > +                      WHEN 16 THEN 'recycling old XLOG files'
> >
> > 5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
> > next to pg_stat_progress_copy in system_view.sql? It looks like all
> > the progress reporting views are next to each other.
> >
> > 6) How about shutdown and end-of-recovery checkpoint? Are you planning
> > to have an ereport_startup_progress mechanism as 0002?
> >
> > 7) I think you don't need to call checkpoint_progress_start and
> > pgstat_progress_update_param, any other progress reporting function
> > for shutdown and end-of-recovery checkpoint right?
> >
> > 8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
> > can't show progress report for shutdown and end-of-recovery
> > checkpoint, I think you need to specify that here in wal.sgml and
> > checkpoint.sgml.
> > +   command <command>CHECKPOINT</command>. The checkpointer process running the
> > +   checkpoint will report its progress in the
> > +   <structname>pg_stat_progress_checkpoint</structname> view. See
> > +   <xref linkend="checkpoint-progress-reporting"/> for details.
> >
> > 9) Can you add a test case for pg_stat_progress_checkpoint view? I
> > think it's good to add one. See, below for reference:
> > -- Add a trigger to catch and print the contents of the catalog view
> > -- pg_stat_progress_copy during data insertion.  This allows to test
> > -- the validation of some progress reports for COPY FROM where the trigger
> > -- would fire.
> > create function notice_after_tab_progress_reporting() returns trigger AS
> > $$
> > declare report record;
> >
> > 10) Typo: it's not "is happens"
> > +       The checkpoint is happens without delays.
> >
> > 11) Can you be specific what are those "some operations" that forced a
> > checkpoint? May be like, basebackup, createdb or something?
> > +       The checkpoint is started because some operation forced a checkpoint.
> >
> > 12) Can you be a bit elobartive here who waits? Something like the
> > backend that requested checkpoint will wait until it's completion ....
> > +       Wait for completion before returning.
> >
> > 13) "removing unneeded or flushing needed logical rewrite mapping files"
> > +       The checkpointer process is currently removing/flushing the logical
> >
> > 14) "old WAL files"
> > +       The checkpointer process is currently recycling old XLOG files.
> >
> > Regards,
> > Bharath Rupireddy.

Вложения
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> >
> > I think the output in the kind column can be displayed as {immediate,
> > force, wait, requested, time}. By the way these are all checkpoint
> > flags so it is better to display it as checkpoint flags instead of
> > checkpoint kind as mentioned in one of my previous comments.
>
> I will update in the next patch.

The current format matches with the server log message for the
checkpoint start in LogCheckpointStart(). Just to be consistent, I
have not changed the code.

I have taken care of the rest of the comments in v5 patch for which
there was clarity.

Thanks & Regards,
Nitin Jadhav

On Mon, Mar 7, 2022 at 8:15 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>type</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
> > +      </para></entry>
> > +     </row>
> > +
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>kind</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
> > +      </para></entry>
> > +     </row>
> >
> > This looks a bit confusing. Two columns, one with the name "checkpoint
> > types" and another "checkpoint kinds". You can probably rename
> > checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
> > as-it-is.
>
> Makes sense. I will change in the next patch.
> ---
>
> > +
<entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
> > +      <entry>One row only, showing the progress of the checkpoint.
> >
> > Let's make this message consistent with the already existing message
> > for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
> > view in "Dynamic Statistics Views" table.
>
> You want me to change "One row only" to "Only one row" ? If that is
> the case then for other views in the "Collected Statistics Views"
> table, it is referred as "One row only". Let me know if you are
> pointing out something else.
> ---
>
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> >
> > I think the output in the kind column can be displayed as {immediate,
> > force, wait, requested, time}. By the way these are all checkpoint
> > flags so it is better to display it as checkpoint flags instead of
> > checkpoint kind as mentioned in one of my previous comments.
>
> I will update in the next patch.
> ---
>
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | performing two phase checkpoint
> >
> > This is the output I see when the checkpointer process has come out of
> > the two phase checkpoint and is currently writing checkpoint xlog
> > records and doing other stuff like updating control files etc. Is this
> > okay?
>
> The idea behind choosing the phases is based on the functionality
> which takes longer time to execute. Since after two phase checkpoint
> till post checkpoint cleanup won't take much time to execute, I have
> not added any additional phase for that. But I also agree that this
> gives wrong information to the user. How about mentioning the phase
> information at the end of each phase like "Initializing",
> "Initialization done", ..., "two phase checkpoint done", "post
> checkpoint cleanup done", .., "finalizing". Except for the first phase
> ("initializing") and last phase ("finalizing"), all the other phases
> describe the end of a certain operation. I feel this gives correct
> information even though the phase name/description does not represent
> the entire code block between two phases. For example if the current
> phase is ''two phase checkpoint done". Then the user can infer that
> the checkpointer has done till two phase checkpoint and it is doing
> other stuff that are after that. Thoughts?
> ---
>
> > The output of log_checkpoint shows the number of buffers written is 3
> > whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> > below:
> >
> > 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> > longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
> >
> > --
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | finalizing
> > buffers_total     | 0
> > buffers_processed | 0
> > buffers_written   | 0
> >
> > Any idea why this mismatch?
>
> Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
> which indicates the total number of buffers to be processed. Based on
> that, the 'buffers_processed' and 'buffers_written' counter gets
> incremented. I meant these values may reach upto 'buffers_total'. The
> current pg_stat_progress_view support above information. There is
> another place when 'ckpt_bufs_written' gets incremented (In
> SlruInternalWritePage()). This increment is above the 'buffers_total'
> value and it is included in the server log message (checkpoint end)
> and not included in the view. I am a bit confused here. If we include
> this increment in the view then we cannot calculate the exact
> 'buffers_total' beforehand. Can we increment the 'buffers_toal' also
> when 'ckpt_bufs_written' gets incremented so that we can match the
> behaviour with checkpoint end message?  Please share your thoughts.
> ---
>
> > I think we can add a couple of more information to this view -
> > start_time for buffer write operation and start_time for buffer sync
> > operation. These are two very time consuming tasks in a checkpoint and
> > people would find it useful to know how much time is being taken by
> > the checkpoint in I/O operation phase. thoughts?
>
> The detailed progress is getting shown for these 2 phases of the
> checkpoint like 'buffers_processed', 'buffers_written' and
> 'files_synced'. Hence I did not think about adding start time and If
> it is really required, then I can add.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Thu, Mar 3, 2022 at 8:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Here are some of my review comments on the latest patch:
> >
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>type</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
> > +      </para></entry>
> > +     </row>
> > +
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>kind</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
> > +      </para></entry>
> > +     </row>
> >
> > This looks a bit confusing. Two columns, one with the name "checkpoint
> > types" and another "checkpoint kinds". You can probably rename
> > checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
> > as-it-is.
> >
> > ==
> >
> > +
<entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
> > +      <entry>One row only, showing the progress of the checkpoint.
> >
> > Let's make this message consistent with the already existing message
> > for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
> > view in "Dynamic Statistics Views" table.
> >
> > ==
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> >
> > I think the output in the kind column can be displayed as {immediate,
> > force, wait, requested, time}. By the way these are all checkpoint
> > flags so it is better to display it as checkpoint flags instead of
> > checkpoint kind as mentioned in one of my previous comments.
> >
> > ==
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | performing two phase checkpoint
> >
> >
> > This is the output I see when the checkpointer process has come out of
> > the two phase checkpoint and is currently writing checkpoint xlog
> > records and doing other stuff like updating control files etc. Is this
> > okay?
> >
> > ==
> >
> > The output of log_checkpoint shows the number of buffers written is 3
> > whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> > below:
> >
> > 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> > longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
> >
> > --
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | finalizing
> > buffers_total     | 0
> > buffers_processed | 0
> > buffers_written   | 0
> >
> > Any idea why this mismatch?
> >
> > ==
> >
> > I think we can add a couple of more information to this view -
> > start_time for buffer write operation and start_time for buffer sync
> > operation. These are two very time consuming tasks in a checkpoint and
> > people would find it useful to know how much time is being taken by
> > the checkpoint in I/O operation phase. thoughts?
> >
> > --
> > With Regards,
> > Ashutosh Sharma.
> >
> > On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
> > <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > Thanks for reviewing.
> > >
> > > > > > I suggested upthread to store the starting timeline instead.  This way you can
> > > > > > deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> > > > > > other information, like what was the starting WAL.
> > > > >
> > > > > I don't understand why we need the timeline here to just determine
> > > > > whether it's a restartpoint or checkpoint.
> > > >
> > > > I'm not saying it's necessary, I'm saying that for the same space usage we can
> > > > store something a bit more useful.  If no one cares about having the starting
> > > > timeline available for no extra cost then sure, let's just store the kind
> > > > directly.
> > >
> > > Fixed.
> > >
> > > > 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> > > > directly instead of new function pg_stat_get_progress_checkpoint_kind?
> > > > + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> > > > + (flags == 0) ? "unknown" : "",
> > > > + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> > > > + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> > > > + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> > > > + (flags & CHECKPOINT_FORCE) ? "force " : "",
> > > > + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> > > > + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> > > > + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> > > > + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
> > >
> > > Fixed.
> > > ---
> > >
> > > > 5) Do we need a special phase for this checkpoint operation? I'm not
> > > > sure in which cases it will take a long time, but it looks like
> > > > there's a wait loop here.
> > > > vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
> > > > if (nvxids > 0)
> > > > {
> > > > do
> > > > {
> > > > pg_usleep(10000L); /* wait for 10 msec */
> > > > } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
> > > > }
> > >
> > > Yes. It is better to add a separate phase here.
> > > ---
> > >
> > > > Also, how about special phases for SyncPostCheckpoint(),
> > > > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > > > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > > > it might be increase in future (?)), TruncateSUBTRANS()?
> > >
> > > SyncPreCheckpoint() is just incrementing a counter and
> > > PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> > > there is no need to add any phases for these as of now. We can add in
> > > the future if necessary. Added phases for SyncPostCheckpoint(),
> > > InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
> > > ---
> > >
> > > > 6) SLRU (Simple LRU) isn't a phase here, you can just say
> > > > PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
> > > > +
> > > > + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> > > > + PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
> > > >  CheckPointPredicate();
> > > >
> > > > And :s/checkpointing SLRU pages/checkpointing predicate lock pages
> > > >+                      WHEN 9 THEN 'checkpointing SLRU pages'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS
> > >
> > > I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
> > > it describes the purpose in less words.
> > >
> > > > And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
> > > > 'processing file sync requests'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 8) :s/Finalizing/finalizing
> > > > +                      WHEN 14 THEN 'Finalizing'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
> > > > +                      WHEN 3 THEN 'checkpointing snapshots'
> > > > :s/checkpointing logical rewrite mappings/checkpointing logical
> > > > replication rewrite mapping files
> > > > +                      WHEN 4 THEN 'checkpointing logical rewrite mappings'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 10) I'm not sure if it's discussed, how about adding the number of
> > > > snapshot/mapping files so far the checkpoint has processed in file
> > > > processing while loops of
> > > > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > > > be many logical snapshot or mapping files and users may be interested
> > > > in knowing the so-far-processed-file-count.
> > >
> > > I had thought about this while sharing the v1 patch and mentioned my
> > > views upthread. I feel it won't give meaningful progress information
> > > (It can be treated as statistics). Hence not included. Thoughts?
> > >
> > > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > > of backend that are waiting for a new checkpoint.
> > > > >
> > > > > Yeah, you are right. Let's not go that path and store an array of
> > > > > pids. I don't see a strong use-case with the pid of the process
> > > > > requesting checkpoint. If required, we can add it later once the
> > > > > pg_stat_progress_checkpoint view gets in.
> > > >
> > > > I don't think that's really necessary to give the pid list.
> > > >
> > > > If you requested a new checkpoint, it doesn't matter if it's only your backend
> > > > that triggered it, another backend or a few other dozen, the result will be the
> > > > same and you have the information that the request has been seen.  We could
> > > > store just a bool for that but having a number instead also gives a bit more
> > > > information and may allow you to detect some broken logic on your client code
> > > > if it keeps increasing.
> > >
> > > It's a good metric to show in the view but the information is not
> > > readily available. Additional code is required to calculate the number
> > > of requests. Is it worth doing that? I feel this can be added later if
> > > required.
> > >
> > > Please find the v4 patch attached and share your thoughts.
> > >
> > > Thanks & Regards,
> > > Nitin Jadhav
> > >
> > > On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
> > > <nitinjadhavpostgres@gmail.com> wrote:
> > > >
> > > > > > 3) Why do we need this extra calculation for start_lsn?
> > > > > > Do you ever see a negative LSN or something here?
> > > > > > +    ('0/0'::pg_lsn + (
> > > > > > +        CASE
> > > > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > > > +            ELSE (0)::numeric
> > > > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > > > >
> > > > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > > > around, but that means we have to take care to correctly display the
> > > > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > > > the special-casing for negative values.
> > > >
> > > > Yes. The extra calculation is required here as we are storing unit64
> > > > value in the variable of type int64. When we convert uint64 to int64
> > > > then the bit pattern is preserved (so no data is lost). The high-order
> > > > bit becomes the sign bit and if the sign bit is set, both the sign and
> > > > magnitude of the value changes. To safely get the actual uint64 value
> > > > whatever was assigned, we need the above calculations.
> > > >
> > > > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > > > the reasoning for having this function and it's named as *checkpoint*
> > > > > > when it doesn't do anything specific to the checkpoint at all?
> > > > >
> > > > > I hadn't thought of using the types' inout functions, but it looks
> > > > > like timestamp IO functions use a formatted timestring, which won't
> > > > > work with the epoch-based timestamp stored in the view.
> > > >
> > > > There is a variation of to_timestamp() which takes UNIX epoch (float8)
> > > > as an argument and converts it to timestamptz but we cannot directly
> > > > call this function with S.param4.
> > > >
> > > > TimestampTz
> > > > GetCurrentTimestamp(void)
> > > > {
> > > >     TimestampTz result;
> > > >     struct timeval tp;
> > > >
> > > >     gettimeofday(&tp, NULL);
> > > >
> > > >     result = (TimestampTz) tp.tv_sec -
> > > >         ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
> > > >     result = (result * USECS_PER_SEC) + tp.tv_usec;
> > > >
> > > >     return result;
> > > > }
> > > >
> > > > S.param4 contains the output of the above function
> > > > (GetCurrentTimestamp()) which returns Postgres epoch but the
> > > > to_timestamp() expects UNIX epoch as input. So some calculation is
> > > > required here. I feel the SQL 'to_timestamp(946684800 +
> > > > (S.param4::float / 1000000)) AS start_time' works fine. The value
> > > > '946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> > > > SECS_PER_DAY). I am not sure whether it is good practice to use this
> > > > way. Kindly share your thoughts.
> > > >
> > > > Thanks & Regards,
> > > > Nitin Jadhav
> > > >
> > > > On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
> > > > <boekewurm+postgres@gmail.com> wrote:
> > > > >
> > > > > On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> > > > > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > > > > 3) Why do we need this extra calculation for start_lsn?
> > > > > > Do you ever see a negative LSN or something here?
> > > > > > +    ('0/0'::pg_lsn + (
> > > > > > +        CASE
> > > > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > > > +            ELSE (0)::numeric
> > > > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > > > >
> > > > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > > > around, but that means we have to take care to correctly display the
> > > > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > > > the special-casing for negative values.
> > > > >
> > > > > As to whether it is reasonable: Generating 16GB of wal every second
> > > > > (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> > > > > has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> > > > > database runtime; or about 17 years. Seeing that a cluster can be
> > > > > `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> > > > > least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> > > > > we can assume that clusters hitting LSN=2^63 will be a reasonable
> > > > > possibility within the next few years. As the lifespan of a PG release
> > > > > is about 5 years, it doesn't seem impossible that there will be actual
> > > > > clusters that are going to hit this naturally in the lifespan of PG15.
> > > > >
> > > > > It is also possible that someone fat-fingers pg_resetwal; and creates
> > > > > a cluster with LSN >= 2^63; resulting in negative values in the
> > > > > s.param3 field. Not likely, but we can force such situations; and as
> > > > > such we should handle that gracefully.
> > > > >
> > > > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > > > the reasoning for having this function and it's named as *checkpoint*
> > > > > > when it doesn't do anything specific to the checkpoint at all?
> > > > >
> > > > > I hadn't thought of using the types' inout functions, but it looks
> > > > > like timestamp IO functions use a formatted timestring, which won't
> > > > > work with the epoch-based timestamp stored in the view.
> > > > >
> > > > > > Having 3 unnecessary functions that aren't useful to the users at all
> > > > > > in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> > > > > > let's try to do without extra functions.
> > > > >
> > > > > I agree that (1) could be simplified, or at least fully expressed in
> > > > > SQL without exposing too many internals. If we're fine with exposing
> > > > > internals like flags and type layouts, then (2), and arguably (4), can
> > > > > be expressed in SQL as well.
> > > > >
> > > > > -Matthias



> > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > of backend that are waiting for a new checkpoint.
> >
> > It's a good metric to show in the view but the information is not
> > readily available. Additional code is required to calculate the number
> > of requests. Is it worth doing that? I feel this can be added later if
> > required.
>
> Is it that hard or costly to do?  Just sending a message to increment
> the stat counter in RequestCheckpoint() would be enough.
>
> Also, unless I'm missing something it's still only showing the initial
> checkpoint flags, so it's *not* showing what the checkpoint is really
> doing, only what the checkpoint may be doing if nothing else happens.
> It just feels wrong.  You could even use that ckpt_flags info to know
> that at least one backend has requested a new checkpoint, if you don't
> want to have a number of backends.

I just wanted to avoid extra calculations just to show the progress in
the view. Since it's a good metric, I have added an additional field
named 'next_flags' to the view which holds all possible flag values of
the next checkpoint. This gives more information than just saying
whether the new checkpoint is requested or not with the same memory. I
am updating the progress of 'next_flags' in
ImmediateCheckpointRequested() which gets called during buffer write
phase. I gave a thought to update the progress in other places also
but I feel updating in ImmediateCheckpointRequested() is enough as the
current checkpoint behaviour gets affected by only
CHECKPOINT_IMMEDIATE flag and all other checkpoint requests done in
case of createdb(), dropdb(), etc gets called with
CHECKPOINT_IMMEDIATE flag. I have updated this in the v5 patch. Please
share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Thu, Mar 3, 2022 at 11:58 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > > of backend that are waiting for a new checkpoint.
> >
> > It's a good metric to show in the view but the information is not
> > readily available. Additional code is required to calculate the number
> > of requests. Is it worth doing that? I feel this can be added later if
> > required.
>
> Is it that hard or costly to do?  Just sending a message to increment
> the stat counter in RequestCheckpoint() would be enough.
>
> Also, unless I'm missing something it's still only showing the initial
> checkpoint flags, so it's *not* showing what the checkpoint is really
> doing, only what the checkpoint may be doing if nothing else happens.
> It just feels wrong.  You could even use that ckpt_flags info to know
> that at least one backend has requested a new checkpoint, if you don't
> want to have a number of backends.



On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > > -[ RECORD 1 ]-----+-------------------------------------
> > > pid               | 22043
> > > type              | checkpoint
> > > kind              | immediate force wait requested time
> > >
> > > I think the output in the kind column can be displayed as {immediate,
> > > force, wait, requested, time}. By the way these are all checkpoint
> > > flags so it is better to display it as checkpoint flags instead of
> > > checkpoint kind as mentioned in one of my previous comments.
> >
> > I will update in the next patch.
>
> The current format matches with the server log message for the
> checkpoint start in LogCheckpointStart(). Just to be consistent, I
> have not changed the code.
>

See below, how flags are shown in other sql functions like:

ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
                raw_flags                | combined_flags
-----------------------------------------+----------------
 {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(1 row)

This looks more readable and it's easy to understand for the
end-users.. Further comparing the way log messages are displayed with
the way sql functions display its output doesn't look like a right
comparison to me. Obviously both should show matching data but the way
it is shown doesn't need to be the same. In fact it is not in most of
the cases.

> I have taken care of the rest of the comments in v5 patch for which
> there was clarity.
>

Thank you very much. Will take a look at it later.

--
With Regards,
Ashutosh Sharma.



On Tue, Mar 08, 2022 at 08:57:23PM +0530, Nitin Jadhav wrote:
> 
> I just wanted to avoid extra calculations just to show the progress in
> the view. Since it's a good metric, I have added an additional field
> named 'next_flags' to the view which holds all possible flag values of
> the next checkpoint.

I still don't think that's ok.  IIUC the only way to know if the current
checkpoint is throttled or not is to be aware that the "next_flags" can apply
to the current checkpoint too, look for it and see if that changes the
semantics of what the view say the current checkpoint is.  Most users will get
it wrong.

> This gives more information than just saying
> whether the new checkpoint is requested or not with the same memory.

So that next_flags will be empty most of the time?  It seems confusing.

Again I would just display a bool flag saying whether a new checkpoint has been
explicitly requested or not, it seems enough.

If you're interested in that next checkpoint, you probably want a quick
completion of the current checkpoint first (and thus need to know if it's
throttled or not).  And then you will have to keep monitoring that view for the
next checkpoint anyway, and at that point the view will show the relevant
information.



> > The current format matches with the server log message for the
> > checkpoint start in LogCheckpointStart(). Just to be consistent, I
> > have not changed the code.
> >
>
> See below, how flags are shown in other sql functions like:
>
> ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
>                raw_flags                | combined_flags
> -----------------------------------------+----------------
> {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
> (1 row)
>
> This looks more readable and it's easy to understand for the
> end-users.. Further comparing the way log messages are displayed with
> the way sql functions display its output doesn't look like a right
> comparison to me. Obviously both should show matching data but the way
> it is shown doesn't need to be the same. In fact it is not in most of
> the cases.

ok. I will take care in the next patch. I would like to handle this at
the SQL level in system_views.sql. The following can be used to
display in the format described above.

      ( '{' ||
          CASE WHEN (S.param2 & 4) > 0 THEN 'immediate' ELSE '' END ||
          CASE WHEN (S.param2 & 4) > 0 AND (S.param2 & -8) > 0 THEN ',
' ELSE '' END ||
          CASE WHEN (S.param2 & 8) > 0 THEN 'force' ELSE '' END ||
          CASE WHEN (S.param2 & 8) > 0 AND (S.param2 & -16) > 0 THEN
', ' ELSE '' END ||
          CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all' ELSE '' END ||
          CASE WHEN (S.param2 & 16) > 0 AND (S.param2 & -32) > 0 THEN
', ' ELSE '' END ||
          CASE WHEN (S.param2 & 32) > 0 THEN 'wait' ELSE '' END ||
          CASE WHEN (S.param2 & 32) > 0 AND (S.param2 & -128) > 0 THEN
', ' ELSE '' END ||
          CASE WHEN (S.param2 & 128) > 0 THEN 'wal' ELSE '' END ||
          CASE WHEN (S.param2 & 128) > 0 AND (S.param2 & -256) > 0
THEN ', ' ELSE '' END ||
          CASE WHEN (S.param2 & 256) > 0 THEN 'time' ELSE '' END
          || '}'

Basically, a separate CASE statement is used to decide whether a comma
has to be printed or not, which is done by checking whether the
previous flag bit is enabled (so that the appropriate flag has to be
displayed) and if any next bits are enabled (So there are some more
flags to be displayed). Kindly let me know if you know any other
better approach.

Thanks & Regards,
Nitin Jadhav

On Wed, Mar 9, 2022 at 7:07 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav
> <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > > > -[ RECORD 1 ]-----+-------------------------------------
> > > > pid               | 22043
> > > > type              | checkpoint
> > > > kind              | immediate force wait requested time
> > > >
> > > > I think the output in the kind column can be displayed as {immediate,
> > > > force, wait, requested, time}. By the way these are all checkpoint
> > > > flags so it is better to display it as checkpoint flags instead of
> > > > checkpoint kind as mentioned in one of my previous comments.
> > >
> > > I will update in the next patch.
> >
> > The current format matches with the server log message for the
> > checkpoint start in LogCheckpointStart(). Just to be consistent, I
> > have not changed the code.
> >
>
> See below, how flags are shown in other sql functions like:
>
> ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
>                 raw_flags                | combined_flags
> -----------------------------------------+----------------
>  {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
> (1 row)
>
> This looks more readable and it's easy to understand for the
> end-users.. Further comparing the way log messages are displayed with
> the way sql functions display its output doesn't look like a right
> comparison to me. Obviously both should show matching data but the way
> it is shown doesn't need to be the same. In fact it is not in most of
> the cases.
>
> > I have taken care of the rest of the comments in v5 patch for which
> > there was clarity.
> >
>
> Thank you very much. Will take a look at it later.
>
> --
> With Regards,
> Ashutosh Sharma.



> > I just wanted to avoid extra calculations just to show the progress in
> > the view. Since it's a good metric, I have added an additional field
> > named 'next_flags' to the view which holds all possible flag values of
> > the next checkpoint.
>
> I still don't think that's ok.  IIUC the only way to know if the current
> checkpoint is throttled or not is to be aware that the "next_flags" can apply
> to the current checkpoint too, look for it and see if that changes the
> semantics of what the view say the current checkpoint is.  Most users will get
> it wrong.
>
> Again I would just display a bool flag saying whether a new checkpoint has been
> explicitly requested or not, it seems enough.

Ok. I agree that it is difficult to interpret it correctly. So even if
say that a new checkpoint has been explicitly requested, the user may
not understand that it affects current checkpoint behaviour unless the
user knows the internals of the checkpoint. How about naming the field
to 'throttled' (Yes/ No) since our objective is to show that the
current checkpoint is throttled or not.

Thanks & Regards,
Nitin Jadhav

On Wed, Mar 9, 2022 at 7:48 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Tue, Mar 08, 2022 at 08:57:23PM +0530, Nitin Jadhav wrote:
> >
> > I just wanted to avoid extra calculations just to show the progress in
> > the view. Since it's a good metric, I have added an additional field
> > named 'next_flags' to the view which holds all possible flag values of
> > the next checkpoint.
>
> I still don't think that's ok.  IIUC the only way to know if the current
> checkpoint is throttled or not is to be aware that the "next_flags" can apply
> to the current checkpoint too, look for it and see if that changes the
> semantics of what the view say the current checkpoint is.  Most users will get
> it wrong.
>
> > This gives more information than just saying
> > whether the new checkpoint is requested or not with the same memory.
>
> So that next_flags will be empty most of the time?  It seems confusing.
>
> Again I would just display a bool flag saying whether a new checkpoint has been
> explicitly requested or not, it seems enough.
>
> If you're interested in that next checkpoint, you probably want a quick
> completion of the current checkpoint first (and thus need to know if it's
> throttled or not).  And then you will have to keep monitoring that view for the
> next checkpoint anyway, and at that point the view will show the relevant
> information.



On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote:
>
> Ok. I agree that it is difficult to interpret it correctly. So even if
> say that a new checkpoint has been explicitly requested, the user may
> not understand that it affects current checkpoint behaviour unless the
> user knows the internals of the checkpoint. How about naming the field
> to 'throttled' (Yes/ No) since our objective is to show that the
> current checkpoint is throttled or not.

-1

That "throttled" flag should be the same as having or not a "force" in the
flags.  We should be consistent and report information the same way, so either
a lot of flags (is_throttled, is_force...) or as now a single field containing
the set flags, so the current approach seems better.  Also, it wouldn't be much
better to show the checkpoint as not having the force flags and still not being
throttled.

Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
view?

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly.  For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.



> >
> > Ok. I agree that it is difficult to interpret it correctly. So even if
> > say that a new checkpoint has been explicitly requested, the user may
> > not understand that it affects current checkpoint behaviour unless the
> > user knows the internals of the checkpoint. How about naming the field
> > to 'throttled' (Yes/ No) since our objective is to show that the
> > current checkpoint is throttled or not.
>
> -1
>
> That "throttled" flag should be the same as having or not a "force" in the
> flags.  We should be consistent and report information the same way, so either
> a lot of flags (is_throttled, is_force...) or as now a single field containing
> the set flags, so the current approach seems better.  Also, it wouldn't be much
> better to show the checkpoint as not having the force flags and still not being
> throttled.

I think your understanding is wrong here. The flag which affects
throttling behaviour is CHECKPOINT_IMMEDIATE. I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

> CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
> to detect that someone wants a new checkpoint afterwards, whatever it's and
> whether or not the current checkpoint to be finished quickly.  For this flag I
> think it's better to not report it in the view flags but with a new field, as
> discussed before, as it's really what it means.

I understand your suggestion of adding a new field to indicate whether
any of the new requests have been made or not. You just want this
field to represent only a new request or does it also represent the
current checkpoint to finish quickly.

> CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
> progress checkpoint, so it can be simply added to the view flags.

As discussed upthread this is not advisable to do so. The content of
'flags' remains the same through the checkpoint. We cannot add a new
checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
though it affects current checkpoint behaviour. Only thing we can do
is to add a new field to show that the current checkpoint is affected
with new requests.

> Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
> CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
> view?

Where do you want to add this in the path?

I feel the new field name is confusing here.
'next_flags' - It shows all the flag values of the next checkpoint.
Based on this user can get to know that the new request has been made
and also if CHECKPOINT_IMMEDIATE is enabled here, then it indicates
that the current checkpoint also gets affected. You are not ok to use
this name as it confuses the user.
'throttled' - The value will be set to Yes/No based on the
CHECKPOINT_IMMEDIATE bit set in the new checkpoint request's flag.
This says that the current checkpoint is affected and also I thought
this is an indication that new requests have been made. But there is a
confusion here too. If the current checkpoint starts with
CHECKPOINT_IMMEDIATE which is described by the 'flags' field and there
is no new request, then the value of this field is 'Yes' (Not
throttling) which again confuses the user.
'new request' - The value will be set to Yes/No if any new checkpoint
requests. This just indicates whether new requests have been made or
not. It can not be used to infer other information.

Thought?

Thanks & Regards,
Nitin Jadhav

On Fri, Mar 11, 2022 at 3:34 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote:
> >
> > Ok. I agree that it is difficult to interpret it correctly. So even if
> > say that a new checkpoint has been explicitly requested, the user may
> > not understand that it affects current checkpoint behaviour unless the
> > user knows the internals of the checkpoint. How about naming the field
> > to 'throttled' (Yes/ No) since our objective is to show that the
> > current checkpoint is throttled or not.
>
> -1
>
> That "throttled" flag should be the same as having or not a "force" in the
> flags.  We should be consistent and report information the same way, so either
> a lot of flags (is_throttled, is_force...) or as now a single field containing
> the set flags, so the current approach seems better.  Also, it wouldn't be much
> better to show the checkpoint as not having the force flags and still not being
> throttled.
>
> Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
> CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
> view?
>
> CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
> to detect that someone wants a new checkpoint afterwards, whatever it's and
> whether or not the current checkpoint to be finished quickly.  For this flag I
> think it's better to not report it in the view flags but with a new field, as
> discussed before, as it's really what it means.
>
> CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
> progress checkpoint, so it can be simply added to the view flags.



On Fri, Mar 11, 2022 at 04:59:11PM +0530, Nitin Jadhav wrote:
> > That "throttled" flag should be the same as having or not a "force" in the
> > flags.  We should be consistent and report information the same way, so either
> > a lot of flags (is_throttled, is_force...) or as now a single field containing
> > the set flags, so the current approach seems better.  Also, it wouldn't be much
> > better to show the checkpoint as not having the force flags and still not being
> > throttled.
> 
> I think your understanding is wrong here. The flag which affects
> throttling behaviour is CHECKPOINT_IMMEDIATE.

Yes sorry, that's what I meant and later used in the flags.

> I am not suggesting
> removing the existing 'flags' field of pg_stat_progress_checkpoint
> view and adding a new field 'throttled'. The content of the 'flags'
> field remains the same. I was suggesting replacing the 'next_flags'
> field with 'throttled' field since the new request with
> CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags?  So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

> > CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
> > to detect that someone wants a new checkpoint afterwards, whatever it's and
> > whether or not the current checkpoint to be finished quickly.  For this flag I
> > think it's better to not report it in the view flags but with a new field, as
> > discussed before, as it's really what it means.
> 
> I understand your suggestion of adding a new field to indicate whether
> any of the new requests have been made or not. You just want this
> field to represent only a new request or does it also represent the
> current checkpoint to finish quickly.

Only represent what it means: a new checkpoint is requested.  An additional
CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information.

> > CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
> > progress checkpoint, so it can be simply added to the view flags.
> 
> As discussed upthread this is not advisable to do so. The content of
> 'flags' remains the same through the checkpoint. We cannot add a new
> checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
> though it affects current checkpoint behaviour. Only thing we can do
> is to add a new field to show that the current checkpoint is affected
> with new requests.

I don't get it.  The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags?  The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in.  If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

> > Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
> > CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
> > view?
> 
> Where do you want to add this in the path?

Same as in your current patch I guess.



> > I am not suggesting
> > removing the existing 'flags' field of pg_stat_progress_checkpoint
> > view and adding a new field 'throttled'. The content of the 'flags'
> > field remains the same. I was suggesting replacing the 'next_flags'
> > field with 'throttled' field since the new request with
> > CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.
>
> Are you saying that this new throttled flag will only be set by the overloaded
> flags in ckpt_flags?

Yes. you are right.

> So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
> flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
> that's not throttled?

I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
flags that's not throttled (disables delays between writes) and  a
checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
(enables delays between writes)

> > > CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
> > > to detect that someone wants a new checkpoint afterwards, whatever it's and
> > > whether or not the current checkpoint to be finished quickly.  For this flag I
> > > think it's better to not report it in the view flags but with a new field, as
> > > discussed before, as it's really what it means.
> >
> > I understand your suggestion of adding a new field to indicate whether
> > any of the new requests have been made or not. You just want this
> > field to represent only a new request or does it also represent the
> > current checkpoint to finish quickly.
>
> Only represent what it means: a new checkpoint is requested.  An additional
> CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information.

Thanks for the confirmation.

> > > CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
> > > progress checkpoint, so it can be simply added to the view flags.
> >
> > As discussed upthread this is not advisable to do so. The content of
> > 'flags' remains the same through the checkpoint. We cannot add a new
> > checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
> > though it affects current checkpoint behaviour. Only thing we can do
> > is to add a new field to show that the current checkpoint is affected
> > with new requests.
>
> I don't get it.  The checkpoint flags and the view flags (set by
> pgstat_progrss_update*) are different, so why can't we add this flag to the
> view flags?  The fact that checkpointer.c doesn't update the passed flag and
> instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
> an implementation detail, and the view shouldn't focus on which flags were
> initially passed to the checkpointer but instead which flags the checkpointer
> is actually enforcing, as that's what the user should be interested in.  If you
> want to store it in another field internally but display it in the view with
> the rest of the flags, I'm fine with it.

Just to be in sync with the way code behaves, it is better not to
update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
current checkpoint 'flags' field. Because the current checkpoint
starts with a different set of flags and when there is a new request
(with CHECKPOINT_IMMEDIATE), it just processes the pending operations
quickly to take up next requests. If we update this information in the
'flags' field of the view, it says that the current checkpoint is
started with CHECKPOINT_IMMEDIATE which is not true. Hence I had
thought of adding a new field ('next flags' or 'upcoming flags') which
contain all the flag values of new checkpoint requests. This field
indicates whether the current checkpoint is throttled or not and also
it indicates there are new requests. Please share your thoughts. More
thoughts are welcomed.

Thanks & Regards,
Nitin Jadhav



On Fri, Mar 11, 2022 at 5:43 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 04:59:11PM +0530, Nitin Jadhav wrote:
> > > That "throttled" flag should be the same as having or not a "force" in the
> > > flags.  We should be consistent and report information the same way, so either
> > > a lot of flags (is_throttled, is_force...) or as now a single field containing
> > > the set flags, so the current approach seems better.  Also, it wouldn't be much
> > > better to show the checkpoint as not having the force flags and still not being
> > > throttled.
> >
> > I think your understanding is wrong here. The flag which affects
> > throttling behaviour is CHECKPOINT_IMMEDIATE.
>
> Yes sorry, that's what I meant and later used in the flags.
>
> > I am not suggesting
> > removing the existing 'flags' field of pg_stat_progress_checkpoint
> > view and adding a new field 'throttled'. The content of the 'flags'
> > field remains the same. I was suggesting replacing the 'next_flags'
> > field with 'throttled' field since the new request with
> > CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.
>
> Are you saying that this new throttled flag will only be set by the overloaded
> flags in ckpt_flags?  So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
> flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
> that's not throttled?
>
> > > CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
> > > to detect that someone wants a new checkpoint afterwards, whatever it's and
> > > whether or not the current checkpoint to be finished quickly.  For this flag I
> > > think it's better to not report it in the view flags but with a new field, as
> > > discussed before, as it's really what it means.
> >
> > I understand your suggestion of adding a new field to indicate whether
> > any of the new requests have been made or not. You just want this
> > field to represent only a new request or does it also represent the
> > current checkpoint to finish quickly.
>
> Only represent what it means: a new checkpoint is requested.  An additional
> CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information.
>
> > > CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
> > > progress checkpoint, so it can be simply added to the view flags.
> >
> > As discussed upthread this is not advisable to do so. The content of
> > 'flags' remains the same through the checkpoint. We cannot add a new
> > checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
> > though it affects current checkpoint behaviour. Only thing we can do
> > is to add a new field to show that the current checkpoint is affected
> > with new requests.
>
> I don't get it.  The checkpoint flags and the view flags (set by
> pgstat_progrss_update*) are different, so why can't we add this flag to the
> view flags?  The fact that checkpointer.c doesn't update the passed flag and
> instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
> an implementation detail, and the view shouldn't focus on which flags were
> initially passed to the checkpointer but instead which flags the checkpointer
> is actually enforcing, as that's what the user should be interested in.  If you
> want to store it in another field internally but display it in the view with
> the rest of the flags, I'm fine with it.
>
> > > Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
> > > CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
> > > view?
> >
> > Where do you want to add this in the path?
>
> Same as in your current patch I guess.



On Mon, Mar 14, 2022 at 03:16:50PM +0530, Nitin Jadhav wrote:
> > > I am not suggesting
> > > removing the existing 'flags' field of pg_stat_progress_checkpoint
> > > view and adding a new field 'throttled'. The content of the 'flags'
> > > field remains the same. I was suggesting replacing the 'next_flags'
> > > field with 'throttled' field since the new request with
> > > CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.
> >
> > Are you saying that this new throttled flag will only be set by the overloaded
> > flags in ckpt_flags?
>
> Yes. you are right.
>
> > So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
> > flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
> > that's not throttled?
>
> I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
> flags that's not throttled (disables delays between writes) and  a
> checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
> (enables delays between writes)

Yes that's how it's supposed to work, but my point was that your suggested
'throttled' flag could say the opposite, which is bad.

> > I don't get it.  The checkpoint flags and the view flags (set by
> > pgstat_progrss_update*) are different, so why can't we add this flag to the
> > view flags?  The fact that checkpointer.c doesn't update the passed flag and
> > instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
> > an implementation detail, and the view shouldn't focus on which flags were
> > initially passed to the checkpointer but instead which flags the checkpointer
> > is actually enforcing, as that's what the user should be interested in.  If you
> > want to store it in another field internally but display it in the view with
> > the rest of the flags, I'm fine with it.
>
> Just to be in sync with the way code behaves, it is better not to
> update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
> current checkpoint 'flags' field. Because the current checkpoint
> starts with a different set of flags and when there is a new request
> (with CHECKPOINT_IMMEDIATE), it just processes the pending operations
> quickly to take up next requests. If we update this information in the
> 'flags' field of the view, it says that the current checkpoint is
> started with CHECKPOINT_IMMEDIATE which is not true.

Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
be able to display that a new checkpoint was requested) and
CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
throttled anymore if it were.

I still don't understand why you want so much to display "how the checkpoint
was initially started" rather than "how the checkpoint is really behaving right
now".  The whole point of having a progress view is to have something dynamic
that reflects the current activity.

> Hence I had
> thought of adding a new field ('next flags' or 'upcoming flags') which
> contain all the flag values of new checkpoint requests. This field
> indicates whether the current checkpoint is throttled or not and also
> it indicates there are new requests.

I'm not opposed to having such a field, I'm opposed to having a view with "the
current checkpoint is throttled but if there are some flags in the next
checkpoint flags and those flags contain checkpoint immediate then the current
checkpoint isn't actually throttled anymore" behavior.



> > > I don't get it.  The checkpoint flags and the view flags (set by
> > > pgstat_progrss_update*) are different, so why can't we add this flag to the
> > > view flags?  The fact that checkpointer.c doesn't update the passed flag and
> > > instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
> > > an implementation detail, and the view shouldn't focus on which flags were
> > > initially passed to the checkpointer but instead which flags the checkpointer
> > > is actually enforcing, as that's what the user should be interested in.  If you
> > > want to store it in another field internally but display it in the view with
> > > the rest of the flags, I'm fine with it.
> >
> > Just to be in sync with the way code behaves, it is better not to
> > update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
> > current checkpoint 'flags' field. Because the current checkpoint
> > starts with a different set of flags and when there is a new request
> > (with CHECKPOINT_IMMEDIATE), it just processes the pending operations
> > quickly to take up next requests. If we update this information in the
> > 'flags' field of the view, it says that the current checkpoint is
> > started with CHECKPOINT_IMMEDIATE which is not true.
>
> Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
> be able to display that a new checkpoint was requested)

I will take care in the next patch.

> > Hence I had
> > thought of adding a new field ('next flags' or 'upcoming flags') which
> > contain all the flag values of new checkpoint requests. This field
> > indicates whether the current checkpoint is throttled or not and also
> > it indicates there are new requests.
>
> I'm not opposed to having such a field, I'm opposed to having a view with "the
> current checkpoint is throttled but if there are some flags in the next
> checkpoint flags and those flags contain checkpoint immediate then the current
> checkpoint isn't actually throttled anymore" behavior.

I understand your point and I also agree that it becomes difficult for
the user to understand the context.

> and
> CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
> throttled anymore if it were.
>
> I still don't understand why you want so much to display "how the checkpoint
> was initially started" rather than "how the checkpoint is really behaving right
> now".  The whole point of having a progress view is to have something dynamic
> that reflects the current activity.

As of now I will not consider adding this information to the view. If
required and nobody opposes having this included in the 'flags' field
of the view, then I will consider adding.

Thanks & Regards,
Nitin Jadhav

On Mon, Mar 14, 2022 at 5:16 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Mon, Mar 14, 2022 at 03:16:50PM +0530, Nitin Jadhav wrote:
> > > > I am not suggesting
> > > > removing the existing 'flags' field of pg_stat_progress_checkpoint
> > > > view and adding a new field 'throttled'. The content of the 'flags'
> > > > field remains the same. I was suggesting replacing the 'next_flags'
> > > > field with 'throttled' field since the new request with
> > > > CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.
> > >
> > > Are you saying that this new throttled flag will only be set by the overloaded
> > > flags in ckpt_flags?
> >
> > Yes. you are right.
> >
> > > So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
> > > flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
> > > that's not throttled?
> >
> > I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
> > flags that's not throttled (disables delays between writes) and  a
> > checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
> > (enables delays between writes)
>
> Yes that's how it's supposed to work, but my point was that your suggested
> 'throttled' flag could say the opposite, which is bad.
>
> > > I don't get it.  The checkpoint flags and the view flags (set by
> > > pgstat_progrss_update*) are different, so why can't we add this flag to the
> > > view flags?  The fact that checkpointer.c doesn't update the passed flag and
> > > instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
> > > an implementation detail, and the view shouldn't focus on which flags were
> > > initially passed to the checkpointer but instead which flags the checkpointer
> > > is actually enforcing, as that's what the user should be interested in.  If you
> > > want to store it in another field internally but display it in the view with
> > > the rest of the flags, I'm fine with it.
> >
> > Just to be in sync with the way code behaves, it is better not to
> > update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
> > current checkpoint 'flags' field. Because the current checkpoint
> > starts with a different set of flags and when there is a new request
> > (with CHECKPOINT_IMMEDIATE), it just processes the pending operations
> > quickly to take up next requests. If we update this information in the
> > 'flags' field of the view, it says that the current checkpoint is
> > started with CHECKPOINT_IMMEDIATE which is not true.
>
> Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
> be able to display that a new checkpoint was requested) and
> CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
> throttled anymore if it were.
>
> I still don't understand why you want so much to display "how the checkpoint
> was initially started" rather than "how the checkpoint is really behaving right
> now".  The whole point of having a progress view is to have something dynamic
> that reflects the current activity.
>
> > Hence I had
> > thought of adding a new field ('next flags' or 'upcoming flags') which
> > contain all the flag values of new checkpoint requests. This field
> > indicates whether the current checkpoint is throttled or not and also
> > it indicates there are new requests.
>
> I'm not opposed to having such a field, I'm opposed to having a view with "the
> current checkpoint is throttled but if there are some flags in the next
> checkpoint flags and those flags contain checkpoint immediate then the current
> checkpoint isn't actually throttled anymore" behavior.



Hi,

This is a long thread, sorry for asking if this has been asked before.

On 2022-03-08 20:25:28 +0530, Nitin Jadhav wrote:
>       * Sort buffers that need to be written to reduce the likelihood of random
> @@ -2129,6 +2132,8 @@ BufferSync(int flags)
>          bufHdr = GetBufferDescriptor(buf_id);
>  
>          num_processed++;
> +        pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
> +                                     num_processed);
>  
>          /*
>           * We don't need to acquire the lock here, because we're only looking
> @@ -2149,6 +2154,8 @@ BufferSync(int flags)
>                  TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
>                  PendingCheckpointerStats.m_buf_written_checkpoints++;
>                  num_written++;
> +                pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
> +                                             num_written);
>              }
>          }

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.


> @@ -1897,6 +1897,112 @@ pg_stat_progress_basebackup| SELECT s.pid,
>      s.param4 AS tablespaces_total,
>      s.param5 AS tablespaces_streamed
>     FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5,
param6,param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18,
param19,param20);
 
> +pg_stat_progress_checkpoint| SELECT s.pid,
> +        CASE s.param1
> +            WHEN 1 THEN 'checkpoint'::text
> +            WHEN 2 THEN 'restartpoint'::text
> +            ELSE NULL::text
> +        END AS type,
> +    (((((((
> +        CASE
> +            WHEN ((s.param2 & (1)::bigint) > 0) THEN 'shutdown '::text
> +            ELSE ''::text
> +        END ||
> +        CASE
> +            WHEN ((s.param2 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
> +            ELSE ''::text
> +        END) AS flags,
> +    (((((((
> +        CASE
> +            WHEN ((s.param3 & (1)::bigint) > 0) THEN 'shutdown '::text
> +            ELSE ''::text
> +        END ||
> +        CASE
> +            WHEN ((s.param3 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param3 & (4)::bigint) > 0) THEN 'immediate '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param3 & (8)::bigint) > 0) THEN 'force '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param3 & (16)::bigint) > 0) THEN 'flush-all '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param3 & (32)::bigint) > 0) THEN 'wait '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param3 & (128)::bigint) > 0) THEN 'wal '::text
> +            ELSE ''::text
> +        END) ||
> +        CASE
> +            WHEN ((s.param3 & (256)::bigint) > 0) THEN 'time '::text
> +            ELSE ''::text
> +        END) AS next_flags,
> +    ('0/0'::pg_lsn + (
> +        CASE
> +            WHEN (s.param4 < 0) THEN pow((2)::numeric, (64)::numeric)
> +            ELSE (0)::numeric
> +        END + (s.param4)::numeric)) AS start_lsn,
> +    to_timestamp(((946684800)::double precision + ((s.param5)::double precision / (1000000)::double precision))) AS
start_time,
> +        CASE s.param6
> +            WHEN 1 THEN 'initializing'::text
> +            WHEN 2 THEN 'getting virtual transaction IDs'::text
> +            WHEN 3 THEN 'checkpointing replication slots'::text
> +            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
> +            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
> +            WHEN 6 THEN 'checkpointing replication origin'::text
> +            WHEN 7 THEN 'checkpointing commit log pages'::text
> +            WHEN 8 THEN 'checkpointing commit time stamp pages'::text
> +            WHEN 9 THEN 'checkpointing subtransaction pages'::text
> +            WHEN 10 THEN 'checkpointing multixact pages'::text
> +            WHEN 11 THEN 'checkpointing predicate lock pages'::text
> +            WHEN 12 THEN 'checkpointing buffers'::text
> +            WHEN 13 THEN 'processing file sync requests'::text
> +            WHEN 14 THEN 'performing two phase checkpoint'::text
> +            WHEN 15 THEN 'performing post checkpoint cleanup'::text
> +            WHEN 16 THEN 'invalidating replication slots'::text
> +            WHEN 17 THEN 'recycling old WAL files'::text
> +            WHEN 18 THEN 'truncating subtransactions'::text
> +            WHEN 19 THEN 'finalizing'::text
> +            ELSE NULL::text
> +        END AS phase,
> +    s.param7 AS buffers_total,
> +    s.param8 AS buffers_processed,
> +    s.param9 AS buffers_written,
> +    s.param10 AS files_total,
> +    s.param11 AS files_synced
> +   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5,
param6,param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18,
param19,param20);
 
>  pg_stat_progress_cluster| SELECT s.pid,
>      s.datid,
>      d.datname,

This view is depressingly complicated. Added up the view definitions for
the already existing pg_stat_progress* views add up to a measurable part of
the size of an empty database:

postgres[1160866][1]=# SELECT sum(octet_length(ev_action)), SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
ev_class::regclass::textLIKE '%progress%';
 
┌───────┬───────┐
│  sum  │  sum  │
├───────┼───────┤
│ 97410 │ 19786 │
└───────┴───────┘
(1 row)

and this view looks to be a good bit more complicated than the existing
pg_stat_progress* views.

Indeed:
template1[1165473][1]=# SELECT ev_class::regclass, length(ev_action), pg_column_size(ev_action) FROM pg_rewrite WHERE
ev_class::regclass::textLIKE '%progress%' ORDER BY length(ev_action) DESC;
 
┌───────────────────────────────┬────────┬────────────────┐
│           ev_class            │ length │ pg_column_size │
├───────────────────────────────┼────────┼────────────────┤
│ pg_stat_progress_checkpoint   │  43290 │           5409 │
│ pg_stat_progress_create_index │  23293 │           4177 │
│ pg_stat_progress_cluster      │  18390 │           3704 │
│ pg_stat_progress_analyze      │  16121 │           3339 │
│ pg_stat_progress_vacuum       │  16076 │           3392 │
│ pg_stat_progress_copy         │  15124 │           3080 │
│ pg_stat_progress_basebackup   │   8406 │           2094 │
└───────────────────────────────┴────────┴────────────────┘
(7 rows)

pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664


pg_rewrite is the second biggest relation in an empty database already...

template1[1164827][1]=# SELECT relname, pg_total_relation_size(oid) FROM pg_class WHERE relkind = 'r' ORDER BY 2 DESC
LIMIT5;
 
┌────────────────┬────────────────────────┐
│    relname     │ pg_total_relation_size │
├────────────────┼────────────────────────┤
│ pg_proc        │                1212416 │
│ pg_rewrite     │                 745472 │
│ pg_attribute   │                 704512 │
│ pg_description │                 630784 │
│ pg_collation   │                 409600 │
└────────────────┴────────────────────────┘
(5 rows)

Greetings,

Andres Freund



On Fri, Mar 18, 2022 at 05:15:56PM -0700, Andres Freund wrote:
> Have you measured the performance effects of this? On fast storage with large
> shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> be good to verify that.

I am wondering if we could make the function inlined at some point.
We could also play it safe and only update the counters every N loops
instead.

> This view is depressingly complicated. Added up the view definitions for
> the already existing pg_stat_progress* views add up to a measurable part of
> the size of an empty database:

Yeah.  I think that what's proposed could be simplified, and we had
better remove the fields that are not that useful.  First, do we have
any need for next_flags?  Second, is the start LSN really necessary
for monitoring purposes?  Not all the information in the first
parameter is useful, as well.  For example "shutdown" will never be
seen as it is not possible to use a session at this stage, no?  There
is also no gain in having "immediate", "flush-all", "force" and "wait"
(for this one if the checkpoint is requested the session doing the
work knows this information already).

A last thing is that we may gain in visibility by having more
attributes as an effect of splitting param2.  On thing that would make
sense is to track the reason why the checkpoint was triggered
separately (aka wal and time).  Should we use a text[] instead to list
all the parameters instead?  Using a space-separated list of items is
not intuitive IMO, and callers of this routine will likely parse
that.

Shouldn't we also track the number of files flushed in each sub-step?
In some deployments you could have a large number of 2PC files and
such.  We may want more information on such matters.

+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing replication origin'
+                      WHEN 7 THEN 'checkpointing commit log pages'
+                      WHEN 8 THEN 'checkpointing commit time stamp pages'
There is a lot of "checkpointing" here.  All those terms could be
shorter without losing their meaning.

This patch still needs some work, so I am marking it as RwF for now.
--
Michael

Вложения

Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

От
Matthias van de Meent
Дата:
On Sat, 19 Mar 2022 at 01:15, Andres Freund <andres@anarazel.de> wrote:
> pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664
>
> pg_rewrite is the second biggest relation in an empty database already...

Yeah, that's not great. Thanks for nerd-sniping me into looking into
how views and pg_rewrite rules work, that was very interesting and I
learned quite a lot.

# Immediately potential, limited to progress views

I noticed that the CASE-WHEN (used in translating progress stage index
to stage names) in those progress reporting views can be more
efficiently described (althoug with slightly worse behaviour around
undefined values) using text array lookups (as attached). That
resulted in somewhat smaller rewrite entries for the progress views
(toast compression was good old pglz):

template1=# SELECT sum(octet_length(ev_action)),
SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
ev_class::regclass::text LIKE '%progress%';

master:
  sum  |  sum
-------+-------
 97277 | 19956
patched:
  sum  |  sum
-------+-------
 77069 | 18417

So this seems like a nice improvement of 20% uncompressed / 7% compressed.

I tested various cases of phase number to text translations: `CASE ..
WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
results below:

postgres=# create or replace view arrayliteral_view as select
(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
s(index);
CREATE INDEX
postgres=# create or replace view stringcast_view as select
('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
CREATE INDEX
postgres=# create or replace view split_stringcast_view as select
(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
CREATE VIEW
postgres=# create or replace view case_view as select case index when
0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
'e' when 5 then 'f' end as name from tst s(index);
CREATE INDEX


postgres=# postgres=# select ev_class::regclass::text,
octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
where ev_class in ('arrayliteral_view'::regclass::oid,
'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
'stringcast_view'::regclass::oid);
       ev_class        | octet_length | pg_column_size
-----------------------+--------------+----------------
 arrayliteral_view     |         3311 |           1322
 stringcast_view       |         2610 |           1257
 case_view             |         5170 |           1412
 split_stringcast_view |         2847 |           1350

It seems to me that we could consider replacing the CASE statements
with array literals and lookups if we really value our template
database size. But, as text literal concatenations don't seem to get
constant folded before storing them in the rules table, this rewrite
of the views would result in long lines in the system_views.sql file,
or we'd have to deal with the additional overhead of the append
operator and cast nodes.

# Future work; nodeToString / readNode, all rewrite rules

Additionally, we might want to consider other changes like default (or
empty value) elision in nodeToString, if that is considered a
reasonable option and if we really want to reduce the size of the
pg_rewrite table.

I think a lot of space can be recovered from that: A manual removal of
what seemed to be fields with default values (and the removal of all
query location related fields) in the current definition of
pg_stat_progress_create_index reduces its uncompressed size from
23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
for an on-disk space saving of 33% for this view's ev_action.

Do note, however, that that would add significant branching in the
nodeToString and readNode code, which might slow down that code
significantly. I'm not planning on working on that; but in my opinion
that is a viable path to reducing the size of new database catalogs.


-Matthias

PS. attached patch is not to be considered complete - it is a minimal
example of the array literal form. It fails regression tests because I
didn't bother updating or including the regression tests on system
views.

Вложения

Re: Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

От
Andres Freund
Дата:
Hi,

On April 8, 2022 7:52:07 AM PDT, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>On Sat, 19 Mar 2022 at 01:15, Andres Freund <andres@anarazel.de> wrote:
>> pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664
>>
>> pg_rewrite is the second biggest relation in an empty database already...
>
>Yeah, that's not great. Thanks for nerd-sniping me into looking into
>how views and pg_rewrite rules work, that was very interesting and I
>learned quite a lot.

Thanks for looking!


># Immediately potential, limited to progress views
>
>I noticed that the CASE-WHEN (used in translating progress stage index
>to stage names) in those progress reporting views can be more
>efficiently described (althoug with slightly worse behaviour around
>undefined values) using text array lookups (as attached). That
>resulted in somewhat smaller rewrite entries for the progress views
>(toast compression was good old pglz):
>
>template1=# SELECT sum(octet_length(ev_action)),
>SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
>ev_class::regclass::text LIKE '%progress%';
>
>master:
>  sum  |  sum
>-------+-------
> 97277 | 19956
>patched:
>  sum  |  sum
>-------+-------
> 77069 | 18417
>
>So this seems like a nice improvement of 20% uncompressed / 7% compressed.
>
>I tested various cases of phase number to text translations: `CASE ..
>WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
>results below:
>
>postgres=# create or replace view arrayliteral_view as select
>(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
>s(index);
>CREATE INDEX
>postgres=# create or replace view stringcast_view as select
>('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
>CREATE INDEX
>postgres=# create or replace view split_stringcast_view as select
>(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
>CREATE VIEW
>postgres=# create or replace view case_view as select case index when
>0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
>'e' when 5 then 'f' end as name from tst s(index);
>CREATE INDEX
>
>
>postgres=# postgres=# select ev_class::regclass::text,
>octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
>where ev_class in ('arrayliteral_view'::regclass::oid,
>'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
>'stringcast_view'::regclass::oid);
>       ev_class        | octet_length | pg_column_size
>-----------------------+--------------+----------------
> arrayliteral_view     |         3311 |           1322
> stringcast_view       |         2610 |           1257
> case_view             |         5170 |           1412
> split_stringcast_view |         2847 |           1350
>
>It seems to me that we could consider replacing the CASE statements
>with array literals and lookups if we really value our template
>database size. But, as text literal concatenations don't seem to get
>constant folded before storing them in the rules table, this rewrite
>of the views would result in long lines in the system_views.sql file,
>or we'd have to deal with the additional overhead of the append
>operator and cast nodes.

My inclination is that the mapping functions should be c functions. There's really no point in doing it in SQL and it
comesat a noticable price. And, if done in C, we can fix mistakes in minor releases, which we can't in SQL. 


># Future work; nodeToString / readNode, all rewrite rules
>
>Additionally, we might want to consider other changes like default (or
>empty value) elision in nodeToString, if that is considered a
>reasonable option and if we really want to reduce the size of the
>pg_rewrite table.
>
>I think a lot of space can be recovered from that: A manual removal of
>what seemed to be fields with default values (and the removal of all
>query location related fields) in the current definition of
>pg_stat_progress_create_index reduces its uncompressed size from
>23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
>for an on-disk space saving of 33% for this view's ev_action.
>
>Do note, however, that that would add significant branching in the
>nodeToString and readNode code, which might slow down that code
>significantly. I'm not planning on working on that; but in my opinion
>that is a viable path to reducing the size of new database catalogs.

We should definitely be careful about that. I do agree that there's a lot of efficiency to be gained in the
serializationformat. Once we have the automatic node func generation in place, we could have one representation for
humanconsumption, and one for density... 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Hi,

Here is the update patch which fixes the previous comments discussed
in this thread. I am sorry for the long gap in the discussion. Kindly
let me know if I have missed any of the comments or anything new.

Thanks & Regards,
Nitin Jadhav

On Fri, Mar 18, 2022 at 4:52 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > > I don't get it.  The checkpoint flags and the view flags (set by
> > > > pgstat_progrss_update*) are different, so why can't we add this flag to the
> > > > view flags?  The fact that checkpointer.c doesn't update the passed flag and
> > > > instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
> > > > an implementation detail, and the view shouldn't focus on which flags were
> > > > initially passed to the checkpointer but instead which flags the checkpointer
> > > > is actually enforcing, as that's what the user should be interested in.  If you
> > > > want to store it in another field internally but display it in the view with
> > > > the rest of the flags, I'm fine with it.
> > >
> > > Just to be in sync with the way code behaves, it is better not to
> > > update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
> > > current checkpoint 'flags' field. Because the current checkpoint
> > > starts with a different set of flags and when there is a new request
> > > (with CHECKPOINT_IMMEDIATE), it just processes the pending operations
> > > quickly to take up next requests. If we update this information in the
> > > 'flags' field of the view, it says that the current checkpoint is
> > > started with CHECKPOINT_IMMEDIATE which is not true.
> >
> > Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
> > be able to display that a new checkpoint was requested)
>
> I will take care in the next patch.
>
> > > Hence I had
> > > thought of adding a new field ('next flags' or 'upcoming flags') which
> > > contain all the flag values of new checkpoint requests. This field
> > > indicates whether the current checkpoint is throttled or not and also
> > > it indicates there are new requests.
> >
> > I'm not opposed to having such a field, I'm opposed to having a view with "the
> > current checkpoint is throttled but if there are some flags in the next
> > checkpoint flags and those flags contain checkpoint immediate then the current
> > checkpoint isn't actually throttled anymore" behavior.
>
> I understand your point and I also agree that it becomes difficult for
> the user to understand the context.
>
> > and
> > CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
> > throttled anymore if it were.
> >
> > I still don't understand why you want so much to display "how the checkpoint
> > was initially started" rather than "how the checkpoint is really behaving right
> > now".  The whole point of having a progress view is to have something dynamic
> > that reflects the current activity.
>
> As of now I will not consider adding this information to the view. If
> required and nobody opposes having this included in the 'flags' field
> of the view, then I will consider adding.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Mon, Mar 14, 2022 at 5:16 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > On Mon, Mar 14, 2022 at 03:16:50PM +0530, Nitin Jadhav wrote:
> > > > > I am not suggesting
> > > > > removing the existing 'flags' field of pg_stat_progress_checkpoint
> > > > > view and adding a new field 'throttled'. The content of the 'flags'
> > > > > field remains the same. I was suggesting replacing the 'next_flags'
> > > > > field with 'throttled' field since the new request with
> > > > > CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.
> > > >
> > > > Are you saying that this new throttled flag will only be set by the overloaded
> > > > flags in ckpt_flags?
> > >
> > > Yes. you are right.
> > >
> > > > So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
> > > > flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
> > > > that's not throttled?
> > >
> > > I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
> > > flags that's not throttled (disables delays between writes) and  a
> > > checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
> > > (enables delays between writes)
> >
> > Yes that's how it's supposed to work, but my point was that your suggested
> > 'throttled' flag could say the opposite, which is bad.
> >
> > > > I don't get it.  The checkpoint flags and the view flags (set by
> > > > pgstat_progrss_update*) are different, so why can't we add this flag to the
> > > > view flags?  The fact that checkpointer.c doesn't update the passed flag and
> > > > instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
> > > > an implementation detail, and the view shouldn't focus on which flags were
> > > > initially passed to the checkpointer but instead which flags the checkpointer
> > > > is actually enforcing, as that's what the user should be interested in.  If you
> > > > want to store it in another field internally but display it in the view with
> > > > the rest of the flags, I'm fine with it.
> > >
> > > Just to be in sync with the way code behaves, it is better not to
> > > update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
> > > current checkpoint 'flags' field. Because the current checkpoint
> > > starts with a different set of flags and when there is a new request
> > > (with CHECKPOINT_IMMEDIATE), it just processes the pending operations
> > > quickly to take up next requests. If we update this information in the
> > > 'flags' field of the view, it says that the current checkpoint is
> > > started with CHECKPOINT_IMMEDIATE which is not true.
> >
> > Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
> > be able to display that a new checkpoint was requested) and
> > CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
> > throttled anymore if it were.
> >
> > I still don't understand why you want so much to display "how the checkpoint
> > was initially started" rather than "how the checkpoint is really behaving right
> > now".  The whole point of having a progress view is to have something dynamic
> > that reflects the current activity.
> >
> > > Hence I had
> > > thought of adding a new field ('next flags' or 'upcoming flags') which
> > > contain all the flag values of new checkpoint requests. This field
> > > indicates whether the current checkpoint is throttled or not and also
> > > it indicates there are new requests.
> >
> > I'm not opposed to having such a field, I'm opposed to having a view with "the
> > current checkpoint is throttled but if there are some flags in the next
> > checkpoint flags and those flags contain checkpoint immediate then the current
> > checkpoint isn't actually throttled anymore" behavior.

Вложения
> Have you measured the performance effects of this? On fast storage with large
> shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> be good to verify that.

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

It looks fine. Please share your views.

> This view is depressingly complicated. Added up the view definitions for
> the already existing pg_stat_progress* views add up to a measurable part of
> the size of an empty database:

Thank you so much for sharing the detailed analysis. We can remove a
few fields which are not so important to make it simple.

Thanks & Regards,
Nitin Jadhav

On Sat, Mar 19, 2022 at 5:45 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> This is a long thread, sorry for asking if this has been asked before.
>
> On 2022-03-08 20:25:28 +0530, Nitin Jadhav wrote:
> >        * Sort buffers that need to be written to reduce the likelihood of random
> > @@ -2129,6 +2132,8 @@ BufferSync(int flags)
> >               bufHdr = GetBufferDescriptor(buf_id);
> >
> >               num_processed++;
> > +             pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
> > +                                                                      num_processed);
> >
> >               /*
> >                * We don't need to acquire the lock here, because we're only looking
> > @@ -2149,6 +2154,8 @@ BufferSync(int flags)
> >                               TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
> >                               PendingCheckpointerStats.m_buf_written_checkpoints++;
> >                               num_written++;
> > +                             pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
> > +                                                                                      num_written);
> >                       }
> >               }
>
> Have you measured the performance effects of this? On fast storage with large
> shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> be good to verify that.
>
>
> > @@ -1897,6 +1897,112 @@ pg_stat_progress_basebackup| SELECT s.pid,
> >      s.param4 AS tablespaces_total,
> >      s.param5 AS tablespaces_streamed
> >     FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5,
param6,param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18,
param19,param20); 
> > +pg_stat_progress_checkpoint| SELECT s.pid,
> > +        CASE s.param1
> > +            WHEN 1 THEN 'checkpoint'::text
> > +            WHEN 2 THEN 'restartpoint'::text
> > +            ELSE NULL::text
> > +        END AS type,
> > +    (((((((
> > +        CASE
> > +            WHEN ((s.param2 & (1)::bigint) > 0) THEN 'shutdown '::text
> > +            ELSE ''::text
> > +        END ||
> > +        CASE
> > +            WHEN ((s.param2 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
> > +            ELSE ''::text
> > +        END) AS flags,
> > +    (((((((
> > +        CASE
> > +            WHEN ((s.param3 & (1)::bigint) > 0) THEN 'shutdown '::text
> > +            ELSE ''::text
> > +        END ||
> > +        CASE
> > +            WHEN ((s.param3 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param3 & (4)::bigint) > 0) THEN 'immediate '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param3 & (8)::bigint) > 0) THEN 'force '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param3 & (16)::bigint) > 0) THEN 'flush-all '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param3 & (32)::bigint) > 0) THEN 'wait '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param3 & (128)::bigint) > 0) THEN 'wal '::text
> > +            ELSE ''::text
> > +        END) ||
> > +        CASE
> > +            WHEN ((s.param3 & (256)::bigint) > 0) THEN 'time '::text
> > +            ELSE ''::text
> > +        END) AS next_flags,
> > +    ('0/0'::pg_lsn + (
> > +        CASE
> > +            WHEN (s.param4 < 0) THEN pow((2)::numeric, (64)::numeric)
> > +            ELSE (0)::numeric
> > +        END + (s.param4)::numeric)) AS start_lsn,
> > +    to_timestamp(((946684800)::double precision + ((s.param5)::double precision / (1000000)::double precision)))
ASstart_time, 
> > +        CASE s.param6
> > +            WHEN 1 THEN 'initializing'::text
> > +            WHEN 2 THEN 'getting virtual transaction IDs'::text
> > +            WHEN 3 THEN 'checkpointing replication slots'::text
> > +            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
> > +            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
> > +            WHEN 6 THEN 'checkpointing replication origin'::text
> > +            WHEN 7 THEN 'checkpointing commit log pages'::text
> > +            WHEN 8 THEN 'checkpointing commit time stamp pages'::text
> > +            WHEN 9 THEN 'checkpointing subtransaction pages'::text
> > +            WHEN 10 THEN 'checkpointing multixact pages'::text
> > +            WHEN 11 THEN 'checkpointing predicate lock pages'::text
> > +            WHEN 12 THEN 'checkpointing buffers'::text
> > +            WHEN 13 THEN 'processing file sync requests'::text
> > +            WHEN 14 THEN 'performing two phase checkpoint'::text
> > +            WHEN 15 THEN 'performing post checkpoint cleanup'::text
> > +            WHEN 16 THEN 'invalidating replication slots'::text
> > +            WHEN 17 THEN 'recycling old WAL files'::text
> > +            WHEN 18 THEN 'truncating subtransactions'::text
> > +            WHEN 19 THEN 'finalizing'::text
> > +            ELSE NULL::text
> > +        END AS phase,
> > +    s.param7 AS buffers_total,
> > +    s.param8 AS buffers_processed,
> > +    s.param9 AS buffers_written,
> > +    s.param10 AS files_total,
> > +    s.param11 AS files_synced
> > +   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5,
param6,param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18,
param19,param20); 
> >  pg_stat_progress_cluster| SELECT s.pid,
> >      s.datid,
> >      d.datname,
>
> This view is depressingly complicated. Added up the view definitions for
> the already existing pg_stat_progress* views add up to a measurable part of
> the size of an empty database:
>
> postgres[1160866][1]=# SELECT sum(octet_length(ev_action)), SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
ev_class::regclass::textLIKE '%progress%'; 
> ┌───────┬───────┐
> │  sum  │  sum  │
> ├───────┼───────┤
> │ 97410 │ 19786 │
> └───────┴───────┘
> (1 row)
>
> and this view looks to be a good bit more complicated than the existing
> pg_stat_progress* views.
>
> Indeed:
> template1[1165473][1]=# SELECT ev_class::regclass, length(ev_action), pg_column_size(ev_action) FROM pg_rewrite WHERE
ev_class::regclass::textLIKE '%progress%' ORDER BY length(ev_action) DESC; 
> ┌───────────────────────────────┬────────┬────────────────┐
> │           ev_class            │ length │ pg_column_size │
> ├───────────────────────────────┼────────┼────────────────┤
> │ pg_stat_progress_checkpoint   │  43290 │           5409 │
> │ pg_stat_progress_create_index │  23293 │           4177 │
> │ pg_stat_progress_cluster      │  18390 │           3704 │
> │ pg_stat_progress_analyze      │  16121 │           3339 │
> │ pg_stat_progress_vacuum       │  16076 │           3392 │
> │ pg_stat_progress_copy         │  15124 │           3080 │
> │ pg_stat_progress_basebackup   │   8406 │           2094 │
> └───────────────────────────────┴────────┴────────────────┘
> (7 rows)
>
> pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664
>
>
> pg_rewrite is the second biggest relation in an empty database already...
>
> template1[1164827][1]=# SELECT relname, pg_total_relation_size(oid) FROM pg_class WHERE relkind = 'r' ORDER BY 2 DESC
LIMIT5; 
> ┌────────────────┬────────────────────────┐
> │    relname     │ pg_total_relation_size │
> ├────────────────┼────────────────────────┤
> │ pg_proc        │                1212416 │
> │ pg_rewrite     │                 745472 │
> │ pg_attribute   │                 704512 │
> │ pg_description │                 630784 │
> │ pg_collation   │                 409600 │
> └────────────────┴────────────────────────┘
> (5 rows)
>
> Greetings,
>
> Andres Freund



> > Have you measured the performance effects of this? On fast storage with large
> > shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> > be good to verify that.
>
> I am wondering if we could make the function inlined at some point.
> We could also play it safe and only update the counters every N loops
> instead.

The idea looks good but based on the performance numbers shared above,
it is not affecting the performance. So we can use the current
approach as it gives more accurate progress.
---

> > This view is depressingly complicated. Added up the view definitions for
> > the already existing pg_stat_progress* views add up to a measurable part of
> > the size of an empty database:
>
> Yeah.  I think that what's proposed could be simplified, and we had
> better remove the fields that are not that useful.  First, do we have
> any need for next_flags?

"next_flags" is removed in the v6 patch. Added a "new_requests" field
to get to know whether the current checkpoint is being throttled or
not. Please share your views on this.
---

> Second, is the start LSN really necessary
> for monitoring purposes?

IMO, start LSN is necessary to debug if the checkpoint is taking longer.
---

> Not all the information in the first
> parameter is useful, as well.  For example "shutdown" will never be
> seen as it is not possible to use a session at this stage, no?

I understand that "shutdown" and "end-of-recovery" will never be seen
and I have removed it in the v6 patch.
---

> There
> is also no gain in having "immediate", "flush-all", "force" and "wait"
> (for this one if the checkpoint is requested the session doing the
> work knows this information already).

"immediate" is required to understand whether the current checkpoint
is throttled or not. I am not sure about other flags "flush-all",
"force" and "wait". I have just supported all the flags to match the
'checkpoint start' log message. Please share your views. If it is not
really required, I will remove it in the next patch.
---

> A last thing is that we may gain in visibility by having more
> attributes as an effect of splitting param2.  On thing that would make
> sense is to track the reason why the checkpoint was triggered
> separately (aka wal and time).  Should we use a text[] instead to list
> all the parameters instead?  Using a space-separated list of items is
> not intuitive IMO, and callers of this routine will likely parse
> that.

If I understand the above comment correctly, you are saying to
introduce a new field, say "reason" ( possible values are either wal
or time) and the "flags" field will continue to represent the other
flags like "immediate", etc. The idea looks good here. We can
introduce new field "reason" and "flags" field can be renamed to
"throttled" (true/false) if we decide to not support other flags
"flush-all", "force" and "wait".
---

> +                      WHEN 3 THEN 'checkpointing replication slots'
> +                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
> +                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
> +                      WHEN 6 THEN 'checkpointing replication origin'
> +                      WHEN 7 THEN 'checkpointing commit log pages'
> +                      WHEN 8 THEN 'checkpointing commit time stamp pages'
> There is a lot of "checkpointing" here.  All those terms could be
> shorter without losing their meaning.

I will try to make it short in the next patch.
---

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Apr 5, 2022 at 3:15 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Mar 18, 2022 at 05:15:56PM -0700, Andres Freund wrote:
> > Have you measured the performance effects of this? On fast storage with large
> > shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> > be good to verify that.
>
> I am wondering if we could make the function inlined at some point.
> We could also play it safe and only update the counters every N loops
> instead.
>
> > This view is depressingly complicated. Added up the view definitions for
> > the already existing pg_stat_progress* views add up to a measurable part of
> > the size of an empty database:
>
> Yeah.  I think that what's proposed could be simplified, and we had
> better remove the fields that are not that useful.  First, do we have
> any need for next_flags?  Second, is the start LSN really necessary
> for monitoring purposes?  Not all the information in the first
> parameter is useful, as well.  For example "shutdown" will never be
> seen as it is not possible to use a session at this stage, no?  There
> is also no gain in having "immediate", "flush-all", "force" and "wait"
> (for this one if the checkpoint is requested the session doing the
> work knows this information already).
>
> A last thing is that we may gain in visibility by having more
> attributes as an effect of splitting param2.  On thing that would make
> sense is to track the reason why the checkpoint was triggered
> separately (aka wal and time).  Should we use a text[] instead to list
> all the parameters instead?  Using a space-separated list of items is
> not intuitive IMO, and callers of this routine will likely parse
> that.
>
> Shouldn't we also track the number of files flushed in each sub-step?
> In some deployments you could have a large number of 2PC files and
> such.  We may want more information on such matters.
>
> +                      WHEN 3 THEN 'checkpointing replication slots'
> +                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
> +                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
> +                      WHEN 6 THEN 'checkpointing replication origin'
> +                      WHEN 7 THEN 'checkpointing commit log pages'
> +                      WHEN 8 THEN 'checkpointing commit time stamp pages'
> There is a lot of "checkpointing" here.  All those terms could be
> shorter without losing their meaning.
>
> This patch still needs some work, so I am marking it as RwF for now.
> --
> Michael



Hi,

On 2022-06-13 19:08:35 +0530, Nitin Jadhav wrote:
> > Have you measured the performance effects of this? On fast storage with large
> > shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> > be good to verify that.
> 
> To understand the performance effects of the above, I have taken the
> average of five checkpoints with the patch and without the patch in my
> environment. Here are the results.
> With patch: 269.65 s
> Without patch: 269.60 s

Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.

To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.

Greetings,

Andres Freund



> > To understand the performance effects of the above, I have taken the
> > average of five checkpoints with the patch and without the patch in my
> > environment. Here are the results.
> > With patch: 269.65 s
> > Without patch: 269.60 s
>
> Those look like timed checkpoints - if the checkpoints are sleeping a
> part of the time, you're not going to see any potential overhead.

Yes. The above data is collected from timed checkpoints.

create table t1(a int);
insert into t1 select * from generate_series(1,10000000);

I generated a lot of data by using the above queries which would in
turn trigger the checkpoint (wal).
---

> To see whether this has an effect you'd have to make sure there's a
> certain number of dirty buffers (e.g. by doing CREATE TABLE AS
> some_query) and then do a manual checkpoint and time how long that
> times.

For this case I have generated data by using below queries.

create table t1(a int);
insert into t1 select * from generate_series(1,8000000);

This does not trigger the checkpoint automatically. I have issued the
CHECKPOINT manually and measured the performance by considering an
average of 5 checkpoints. Here are the details.

With patch: 2.457 s
Without patch: 2.334 s

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Thu, Jul 7, 2022 at 5:34 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-06-13 19:08:35 +0530, Nitin Jadhav wrote:
> > > Have you measured the performance effects of this? On fast storage with large
> > > shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> > > be good to verify that.
> >
> > To understand the performance effects of the above, I have taken the
> > average of five checkpoints with the patch and without the patch in my
> > environment. Here are the results.
> > With patch: 269.65 s
> > Without patch: 269.60 s
>
> Those look like timed checkpoints - if the checkpoints are sleeping a
> part of the time, you're not going to see any potential overhead.
>
> To see whether this has an effect you'd have to make sure there's a
> certain number of dirty buffers (e.g. by doing CREATE TABLE AS
> some_query) and then do a manual checkpoint and time how long that
> times.
>
> Greetings,
>
> Andres Freund



Hi,

On 7/28/22 11:38 AM, Nitin Jadhav wrote:
>>> To understand the performance effects of the above, I have taken the
>>> average of five checkpoints with the patch and without the patch in my
>>> environment. Here are the results.
>>> With patch: 269.65 s
>>> Without patch: 269.60 s
>>
>> Those look like timed checkpoints - if the checkpoints are sleeping a
>> part of the time, you're not going to see any potential overhead.
> 
> Yes. The above data is collected from timed checkpoints.
> 
> create table t1(a int);
> insert into t1 select * from generate_series(1,10000000);
> 
> I generated a lot of data by using the above queries which would in
> turn trigger the checkpoint (wal).
> ---
> 
>> To see whether this has an effect you'd have to make sure there's a
>> certain number of dirty buffers (e.g. by doing CREATE TABLE AS
>> some_query) and then do a manual checkpoint and time how long that
>> times.
> 
> For this case I have generated data by using below queries.
> 
> create table t1(a int);
> insert into t1 select * from generate_series(1,8000000);
> 
> This does not trigger the checkpoint automatically. I have issued the
> CHECKPOINT manually and measured the performance by considering an
> average of 5 checkpoints. Here are the details.
> 
> With patch: 2.457 s
> Without patch: 2.334 s
> 
> Please share your thoughts.
> 

v6 was not applying anymore, due to a change in 
doc/src/sgml/ref/checkpoint.sgml done by b9eb0ff09e (Rename 
pg_checkpointer predefined role to pg_checkpoint).

Please find attached a rebase in v7.

While working on this rebase, I also noticed that "pg_checkpointer" is 
still mentioned in some translation files:
"
$ git grep pg_checkpointer
src/backend/po/de.po:msgid "must be superuser or have privileges of 
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgid "must be superuser or have privileges of 
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgstr 
"CHECKPOINTを実行するにはスーパーユーザーであるか、またはpg_checkpointerの権限を持つ必要があります"
src/backend/po/sv.po:msgid "must be superuser or have privileges of 
pg_checkpointer to do CHECKPOINT"
"

I'm not familiar with how the translation files are handled (looks like 
they have their own set of commits, see 3c0bcdbc66 for example) but 
wanted to mention that "pg_checkpointer" is still mentioned (even if 
that may be expected as the last commit related to translation files 
(aka 3c0bcdbc66) is older than the one that renamed pg_checkpointer to 
pg_checkpoint (aka b9eb0ff09e)).

That said, back to this patch: I did not look closely but noticed that 
the buffers_total reported by pg_stat_progress_checkpoint:

postgres=# select type,flags,start_lsn,phase,buffers_total,new_requests 
from pg_stat_progress_checkpoint;
     type    |         flags         | start_lsn  |         phase 
  | buffers_total | new_requests
------------+-----------------------+------------+-----------------------+---------------+--------------
  checkpoint | immediate force wait  | 1/E6C523A8 | checkpointing 
buffers |       1024275 | false
(1 row)

is a little bit different from what is logged once completed:

2022-11-04 08:18:50.806 UTC [3488442] LOG:  checkpoint complete: wrote 
1024278 buffers (97.7%);

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Вложения
> v6 was not applying anymore, due to a change in
> doc/src/sgml/ref/checkpoint.sgml done by b9eb0ff09e (Rename
> pg_checkpointer predefined role to pg_checkpoint).
>
> Please find attached a rebase in v7.
>
> While working on this rebase, I also noticed that "pg_checkpointer" is
> still mentioned in some translation files:

Thanks for rebasing the patch and sharing the information.
---

> That said, back to this patch: I did not look closely but noticed that
> the buffers_total reported by pg_stat_progress_checkpoint:
>
> postgres=# select type,flags,start_lsn,phase,buffers_total,new_requests
> from pg_stat_progress_checkpoint;
>     type    |         flags         | start_lsn  |         phase
>  | buffers_total | new_requests
> ------------+-----------------------+------------+-----------------------+---------------+--------------
>  checkpoint | immediate force wait  | 1/E6C523A8 | checkpointing
> buffers |       1024275 | false
> (1 row)
>
> is a little bit different from what is logged once completed:
>
> 2022-11-04 08:18:50.806 UTC [3488442] LOG:  checkpoint complete: wrote
> 1024278 buffers (97.7%);

This is because the count shown in the checkpoint complete message
includes the additional increment done during SlruInternalWritePage().
We are not sure of this increment until it really happens. Hence it
was not considered in the patch. To make it compatible with the
checkpoint complete message, we should increment all three here,
buffers_total, buffers_processed and buffers_written. So the total
number of buffers calculated earlier may not always be the same. If
this looks good, I will update this in the next patch.

Thanks & Regards,
Nitin Jadhav

On Fri, Nov 4, 2022 at 1:57 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On 7/28/22 11:38 AM, Nitin Jadhav wrote:
> >>> To understand the performance effects of the above, I have taken the
> >>> average of five checkpoints with the patch and without the patch in my
> >>> environment. Here are the results.
> >>> With patch: 269.65 s
> >>> Without patch: 269.60 s
> >>
> >> Those look like timed checkpoints - if the checkpoints are sleeping a
> >> part of the time, you're not going to see any potential overhead.
> >
> > Yes. The above data is collected from timed checkpoints.
> >
> > create table t1(a int);
> > insert into t1 select * from generate_series(1,10000000);
> >
> > I generated a lot of data by using the above queries which would in
> > turn trigger the checkpoint (wal).
> > ---
> >
> >> To see whether this has an effect you'd have to make sure there's a
> >> certain number of dirty buffers (e.g. by doing CREATE TABLE AS
> >> some_query) and then do a manual checkpoint and time how long that
> >> times.
> >
> > For this case I have generated data by using below queries.
> >
> > create table t1(a int);
> > insert into t1 select * from generate_series(1,8000000);
> >
> > This does not trigger the checkpoint automatically. I have issued the
> > CHECKPOINT manually and measured the performance by considering an
> > average of 5 checkpoints. Here are the details.
> >
> > With patch: 2.457 s
> > Without patch: 2.334 s
> >
> > Please share your thoughts.
> >
>
> v6 was not applying anymore, due to a change in
> doc/src/sgml/ref/checkpoint.sgml done by b9eb0ff09e (Rename
> pg_checkpointer predefined role to pg_checkpoint).
>
> Please find attached a rebase in v7.
>
> While working on this rebase, I also noticed that "pg_checkpointer" is
> still mentioned in some translation files:
> "
> $ git grep pg_checkpointer
> src/backend/po/de.po:msgid "must be superuser or have privileges of
> pg_checkpointer to do CHECKPOINT"
> src/backend/po/ja.po:msgid "must be superuser or have privileges of
> pg_checkpointer to do CHECKPOINT"
> src/backend/po/ja.po:msgstr
> "CHECKPOINTを実行するにはスーパーユーザーであるか、またはpg_checkpointerの権限を持つ必要があります"
> src/backend/po/sv.po:msgid "must be superuser or have privileges of
> pg_checkpointer to do CHECKPOINT"
> "
>
> I'm not familiar with how the translation files are handled (looks like
> they have their own set of commits, see 3c0bcdbc66 for example) but
> wanted to mention that "pg_checkpointer" is still mentioned (even if
> that may be expected as the last commit related to translation files
> (aka 3c0bcdbc66) is older than the one that renamed pg_checkpointer to
> pg_checkpoint (aka b9eb0ff09e)).
>
> That said, back to this patch: I did not look closely but noticed that
> the buffers_total reported by pg_stat_progress_checkpoint:
>
> postgres=# select type,flags,start_lsn,phase,buffers_total,new_requests
> from pg_stat_progress_checkpoint;
>      type    |         flags         | start_lsn  |         phase
>   | buffers_total | new_requests
> ------------+-----------------------+------------+-----------------------+---------------+--------------
>   checkpoint | immediate force wait  | 1/E6C523A8 | checkpointing
> buffers |       1024275 | false
> (1 row)
>
> is a little bit different from what is logged once completed:
>
> 2022-11-04 08:18:50.806 UTC [3488442] LOG:  checkpoint complete: wrote
> 1024278 buffers (97.7%);
>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com



On Fri, Nov 4, 2022 at 4:27 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> Please find attached a rebase in v7.

I don't think it's a good thing that this patch is using the
progress-reporting machinery. The point of that machinery is that we
want any backend to be able to report progress for any command it
happens to be running, and we don't know which command that will be at
any given point in time, or how many backends will be running any
given command at once. So we need some generic set of counters that
can be repurposed for whatever any particular backend happens to be
doing right at the moment.

But none of that applies to the checkpointer. Any information about
the checkpointer that we want to expose can just be advertised in a
dedicated chunk of shared memory, perhaps even by simply adding it to
CheckpointerShmemStruct. Then you can give the fields whatever names,
types, and sizes you like, and you don't have to do all of this stuff
with mapping down to integers and back. The only real disadvantage
that I can see is then you have to think a bit harder about what the
concurrency model is here, and maybe you end up reimplementing
something similar to what the progress-reporting stuff does for you,
and *maybe* that is a sufficient reason to do it this way.

But I'm doubtful. This feels like a square-peg-round-hole situation.

--
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:
>  
> @@ -7023,29 +7048,63 @@ static void
>  CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
>  {
>      CheckPointRelationMap();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
>      CheckPointReplicationSlots();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
>      CheckPointSnapBuild();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
>      CheckPointLogicalRewriteHeap();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN);
>      CheckPointReplicationOrigin();
>  
>      /* Write out all dirty data in SLRUs and the main buffer pool */
>      TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
>      CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
>      CheckPointCLOG();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
>      CheckPointCommitTs();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
>      CheckPointSUBTRANS();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
>      CheckPointMultiXact();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES);
>      CheckPointPredicate();
> +
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
>      CheckPointBuffers(flags);
>  
>      /* Perform all queued up fsyncs */
>      TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
>      CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_SYNC_FILES);
>      ProcessSyncRequests();
>      CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
>      TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
>  
>      /* We deliberately delay 2PC checkpointing as long as possible */
> +    pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +                                 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
>      CheckPointTwoPhase(checkPointRedo);
>  }

This is quite the code bloat. Can we make this less duplicative?


> +CREATE VIEW pg_stat_progress_checkpoint AS
> +    SELECT
> +        S.pid AS pid,
> +        CASE S.param1 WHEN 1 THEN 'checkpoint'
> +                      WHEN 2 THEN 'restartpoint'
> +                      END AS type,
> +        ( CASE WHEN (S.param2 & 4) > 0 THEN 'immediate ' ELSE '' END ||
> +          CASE WHEN (S.param2 & 8) > 0 THEN 'force ' ELSE '' END ||
> +          CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
> +          CASE WHEN (S.param2 & 32) > 0 THEN 'wait ' ELSE '' END ||
> +          CASE WHEN (S.param2 & 128) > 0 THEN 'wal ' ELSE '' END ||
> +          CASE WHEN (S.param2 & 256) > 0 THEN 'time ' ELSE '' END
> +        ) AS flags,
> +        ( '0/0'::pg_lsn +
> +          ((CASE
> +                WHEN S.param3 < 0 THEN pow(2::numeric, 64::numeric)::numeric
> +                ELSE 0::numeric
> +            END) +
> +           S.param3::numeric)
> +        ) AS start_lsn,

I don't think we should embed this much complexity in the view
defintions. It's hard to read, bloats the catalog, we can't fix them once
released.  This stuff seems like it should be in a helper function.

I don't have any iea what that pow stuff is supposed to be doing.


> +        to_timestamp(946684800 + (S.param4::float8 / 1000000)) AS start_time,

I don't think this is a reasonable path - embedding way too much low-level
details about the timestamp format in the view definition. Why do we need to
do this?



Greetings,

Andres Freund



On Wed, Nov 16, 2022 at 1:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 4:27 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> > Please find attached a rebase in v7.
>
> I don't think it's a good thing that this patch is using the
> progress-reporting machinery. The point of that machinery is that we
> want any backend to be able to report progress for any command it
> happens to be running, and we don't know which command that will be at
> any given point in time, or how many backends will be running any
> given command at once. So we need some generic set of counters that
> can be repurposed for whatever any particular backend happens to be
> doing right at the moment.

Hm.

> But none of that applies to the checkpointer. Any information about
> the checkpointer that we want to expose can just be advertised in a
> dedicated chunk of shared memory, perhaps even by simply adding it to
> CheckpointerShmemStruct. Then you can give the fields whatever names,
> types, and sizes you like, and you don't have to do all of this stuff
> with mapping down to integers and back. The only real disadvantage
> that I can see is then you have to think a bit harder about what the
> concurrency model is here, and maybe you end up reimplementing
> something similar to what the progress-reporting stuff does for you,
> and *maybe* that is a sufficient reason to do it this way.

-1 for CheckpointerShmemStruct as it is being used for running
checkpoints and I don't think adding stats to it is a great idea.
Instead, extending PgStat_CheckpointerStats and using shared memory
stats for reporting progress/last checkpoint related stats is a good
idea IMO. I also think that a new pg_stat_checkpoint view is needed
because, right now, the PgStat_CheckpointerStats stats are exposed via
the pg_stat_bgwriter view, having a separate view for checkpoint stats
is good here. Also, removing CheckpointStatsData and moving all of
those members to PgStat_CheckpointerStats, of course, by being careful
about the amount of shared memory required, is also a good idea IMO.
Going forward, PgStat_CheckpointerStats and pg_stat_checkpoint view
can be a single point of location for all the checkpoint related
stats.

Thoughts?

In fact, I was recently having an off-list chat with Bertrand Drouvot
about the above idea.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Hi,

On 2022-11-16 16:01:55 +0530, Bharath Rupireddy wrote:
> -1 for CheckpointerShmemStruct as it is being used for running
> checkpoints and I don't think adding stats to it is a great idea.

Why?  Imo the data needed for progress reporting aren't really "stats". We'd
not accumulate counters over time, just for the current checkpoint.

I think it might even be useful for other parts of the system to know what the
checkpointer is doing, e.g. bgwriter or autovacuum could adapt the behaviour
if checkpointer can't keep up. Somehow it'd feel wrong to use the stats system
as the source of such adjustments - but perhaps my gut feeling on that isn't
right.

The best argument for combining progress reporting with accumulating stats is
that we could likely share some of the code. Having accumulated stats for all
the checkpoint phases would e.g. be quite valuable.


> Instead, extending PgStat_CheckpointerStats and using shared memory
> stats for reporting progress/last checkpoint related stats is a good
> idea IMO

There's certainly some potential for deduplicating state and to make stats
updated more frequently. But that doesn't necessarily mean that putting the
checkpoint progress into PgStat_CheckpointerStats is a good idea (nor the
opposite).


> I also think that a new pg_stat_checkpoint view is needed
> because, right now, the PgStat_CheckpointerStats stats are exposed via
> the pg_stat_bgwriter view, having a separate view for checkpoint stats
> is good here.

I agree that we should do that, but largely independent of the architectural
question at hand.

Greetings,

Andres Freund



On Wed, Nov 16, 2022 at 5:32 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> -1 for CheckpointerShmemStruct as it is being used for running
> checkpoints and I don't think adding stats to it is a great idea.
> Instead, extending PgStat_CheckpointerStats and using shared memory
> stats for reporting progress/last checkpoint related stats is a good
> idea IMO.

I agree with Andres: progress reporting isn't really quite the same
thing as stats, and either place seems like it could be reasonable. I
don't presently have an opinion on which is a better fit, but I don't
think the fact that CheckpointerShmemStruct is used for running
checkpoints rules anything out. Progress reporting is *also* about
running checkpoints. Any historical data you want to expose might not
be about running checkpoints, but, uh, so what? I don't really see
that as a strong argument against it fitting into this struct.

> I also think that a new pg_stat_checkpoint view is needed
> because, right now, the PgStat_CheckpointerStats stats are exposed via
> the pg_stat_bgwriter view, having a separate view for checkpoint stats
> is good here.

Yep.

> Also, removing CheckpointStatsData and moving all of
> those members to PgStat_CheckpointerStats, of course, by being careful
> about the amount of shared memory required, is also a good idea IMO.
> Going forward, PgStat_CheckpointerStats and pg_stat_checkpoint view
> can be a single point of location for all the checkpoint related
> stats.

I'm not sure that I completely follow this part, or that I agree with
it. I have never really understood why we drive background writer or
checkpointer statistics through the statistics collector. Here again,
for things like table statistics, there is no choice, because we could
have an unbounded number of tables and need to keep statistics about
all of them. The statistics collector can handle that by allocating
more memory as required. But there is only one background writer and
only one checkpointer, so that is not needed in those cases. Why not
just have them expose anything they want to expose through shared
memory directly?

If the statistics collector provides services that we care about, like
persisting data across restarts or making snapshots for transactional
behavior, then those might be reasons to go through it even for the
background writer or checkpointer. But if so, we should be explicit
about what the reasons are, both in the mailing list discussion and in
code comments. Otherwise I fear that we'll just end up doing something
in a more complicated way than is really necessary.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2022-11-16 14:19:32 -0500, Robert Haas wrote:
> I have never really understood why we drive background writer or
> checkpointer statistics through the statistics collector.

To some degree it is required for durability - the stats system needs to know
how to write out those stats. But that wasn't ever a good reason to send
messages to the stats collector - it could just read the stats from shared
memory after all.

There's also integration with snapshots of the stats, resetting them, etc.

There's also the complexity that some of the stats e.g. for checkpointer
aren't about work the checkpointer did, but just have ended up there for
historical raisins. E.g. the number of fsyncs and writes done by backends.

See below:

> Here again, for things like table statistics, there is no choice, because we
> could have an unbounded number of tables and need to keep statistics about
> all of them. The statistics collector can handle that by allocating more
> memory as required. But there is only one background writer and only one
> checkpointer, so that is not needed in those cases. Why not just have them
> expose anything they want to expose through shared memory directly?

That's how it is in 15+. The memory for "fixed-numbered" or "global"
statistics are maintained by the stats system, but in plain shared memory,
allocated at server start. Not via the hash table.

Right now stats updates for the checkpointer use the "changecount" approach to
updates. For now that makes sense, because we update the stats only
occasionally (after a checkpoint or when writing in CheckpointWriteDelay()) -
a stats viewer seeing the checkpoint count go up, without yet seeing the
corresponding buffers written would be misleading.

I don't think we'd want every buffer write or whatnot go through the
changecount mechanism, on some non-x86 platforms that could be noticable. But
if we didn't stage the stats updates locally I think we could make most of the
stats changes without that overhead.  For updates that just increment a single
counter there's simply no benefit in the changecount mechanism afaict.

I didn't want to do that change during the initial shared memory stats work,
it already was bigger than I could handle...


It's not quite clear to me what the best path forward is for
buf_written_backend / buf_fsync_backend, which currently are reported via the
checkpointer stats. I think the best path might be to stop counting them via
the CheckpointerShmem->num_backend_writes etc and just populate the fields in
the view (for backward compat) via the proposed [1] pg_stat_io patch.  Doing
that accounting with CheckpointerCommLock held exclusively isn't free.



> If the statistics collector provides services that we care about, like
> persisting data across restarts or making snapshots for transactional
> behavior, then those might be reasons to go through it even for the
> background writer or checkpointer. But if so, we should be explicit
> about what the reasons are, both in the mailing list discussion and in
> code comments. Otherwise I fear that we'll just end up doing something
> in a more complicated way than is really necessary.

I tried to provide at least some of that in the comments at the start of
pgstat.c in 15+. There's very likely more that should be added, but I think
it's a decent start.

Greetings,

Andres Freund


[1] https://www.postgresql.org/message-id/CAOtHd0ApHna7_p6mvHoO%2BgLZdxjaQPRemg3_o0a4ytCPijLytQ%40mail.gmail.com



On Thu, Nov 17, 2022 at 12:49 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > I also think that a new pg_stat_checkpoint view is needed
> > because, right now, the PgStat_CheckpointerStats stats are exposed via
> > the pg_stat_bgwriter view, having a separate view for checkpoint stats
> > is good here.
>
> Yep.

On Wed, Nov 16, 2022 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
>
> > I also think that a new pg_stat_checkpoint view is needed
> > because, right now, the PgStat_CheckpointerStats stats are exposed via
> > the pg_stat_bgwriter view, having a separate view for checkpoint stats
> > is good here.
>
> I agree that we should do that, but largely independent of the architectural
> question at hand.

Thanks. I quickly prepared a patch introducing pg_stat_checkpointer
view and posted it here -
https://www.postgresql.org/message-id/CALj2ACVxX2ii%3D66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Wed, Nov 16, 2022 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think we'd want every buffer write or whatnot go through the
> changecount mechanism, on some non-x86 platforms that could be noticable. But
> if we didn't stage the stats updates locally I think we could make most of the
> stats changes without that overhead.  For updates that just increment a single
> counter there's simply no benefit in the changecount mechanism afaict.

You might be right, but I'm not sure whether it's worth stressing
about. The progress reporting mechanism uses the st_changecount
mechanism, too, and as far as I know nobody's complained about that
having too much overhead. Maybe they have, though, and I've just
missed it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2022-11-17 09:03:32 -0500, Robert Haas wrote:
> On Wed, Nov 16, 2022 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't think we'd want every buffer write or whatnot go through the
> > changecount mechanism, on some non-x86 platforms that could be noticable. But
> > if we didn't stage the stats updates locally I think we could make most of the
> > stats changes without that overhead.  For updates that just increment a single
> > counter there's simply no benefit in the changecount mechanism afaict.
>
> You might be right, but I'm not sure whether it's worth stressing
> about. The progress reporting mechanism uses the st_changecount
> mechanism, too, and as far as I know nobody's complained about that
> having too much overhead. Maybe they have, though, and I've just
> missed it.

I've seen it in profiles, although not as the major contributor. Most places
do a reasonable amount of work between calls though.

As an experiment, I added a progress report to BufferSync()'s first loop
(i.e. where it checks all buffers). On a 128GB shared_buffers cluster that
increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I
remove the changecount stuff and use a single write + write barrier, it ends
up as 250ms. Inlining brings it down a bit further, to 247ms.

Obviously this is a very extreme case - we only do very little work between
the progress report calls. But it does seem to show that the overhead is not
entirely neglegible.


I think pgstat_progress_start_command() needs the changecount stuff, as does
pgstat_progress_update_multi_param(). But for anything updating a single
parameter at a time it really doesn't do anything useful on a platform that
doesn't tear 64bit writes (so it could be #ifdef
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY).


Out of further curiosity I wanted to test the impact when the loop doesn't
even do a LockBufHdr() and added an unlocked pre-check. 109ms without
progress. 138ms with. 114ms with the simplified
pgstat_progress_update_param(). 108ms after inlining the simplified
pgstat_progress_update_param().

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> I think pgstat_progress_start_command() needs the changecount stuff, as does
> pgstat_progress_update_multi_param(). But for anything updating a single
> parameter at a time it really doesn't do anything useful on a platform that
> doesn't tear 64bit writes (so it could be #ifdef
> PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY).

Seems safe to restrict it to that case.

            regards, tom lane



On Thu, Nov 17, 2022 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:
> As an experiment, I added a progress report to BufferSync()'s first loop
> (i.e. where it checks all buffers). On a 128GB shared_buffers cluster that
> increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I
> remove the changecount stuff and use a single write + write barrier, it ends
> up as 250ms. Inlining brings it down a bit further, to 247ms.

OK, I'd say that's pretty good evidence that we can't totally
disregard the issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:
> Please find attached a rebase in v7.

cfbot complains that the docs don't build:
https://cirrus-ci.com/task/6694349031866368?logs=docs_build#L296

[03:24:27.317] ref/checkpoint.sgml:66: element para: validity error : Element para is not declared in para list of
possiblechildren
 

I've marked the patch as waitin-on-author for now.


There's been a bunch of architectural feedback too, but tbh, I don't know if
we came to any conclusion on that front...

Greetings,

Andres Freund



On Thu, 8 Dec 2022 at 00:33, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:
> > Please find attached a rebase in v7.
>
> cfbot complains that the docs don't build:
> https://cirrus-ci.com/task/6694349031866368?logs=docs_build#L296
>
> [03:24:27.317] ref/checkpoint.sgml:66: element para: validity error : Element para is not declared in para list of
possiblechildren
 
>
> I've marked the patch as waitin-on-author for now.
>
>
> There's been a bunch of architectural feedback too, but tbh, I don't know if
> we came to any conclusion on that front...

There has been no updates on this thread for some time, so this has
been switched as Returned with Feedback. Feel free to open it in the
next commitfest if you plan to continue on this.

Regards,
Vignesh