Обсуждение: [Proposal] Adding callback support for custom statistics kinds
Hi, I'd like to propose $SUBJECT to serialize additional per-entry data beyond the standard statistics entries. Currently, custom statistics kinds can store their standard entry data in the main "pgstat.stat" file, but there is no mechanism for extensions to persist extra data stored in the entry. A common use case is extensions that register a custom kind and, besides standard counters, need to track variable-length data stored in a dsa_pointer. This proposal adds optional "to_serialized_extra" and "from_serialized_extra" callbacks to "PgStat_KindInfo" that allow custom kinds to write and read from extra data in a separate files (pgstat.<kind>.stat). The callbacks give extensions direct access to the file pointer so they can read and write data in any format, while the core "pgstat" infrastructure manages opening, closing, renaming, and cleanup, just as it does with "pgstat.stat". A concrete use case is pg_stat_statements. If it were to use custom stats kinds to track statement counters, it could also track query text stored in DSA. The callbacks allow saving the query text referenced by the dsa_pointer and restoring it after a clean shutdown. Since DSA (and more specifically DSM) cannot be attached by the postmaster, an extension cannot use "on_shmem_exit" or "shmem_startup_hook" to serialize or restore this data. This is why pgstat handles serialization during checkpointer shutdown and startup, allowing a single backend to manage it safely. I considered adding hooks to the existing pgstat code paths (pgstat_before_server_shutdown, pgstat_discard_stats, and pgstat_restore_stats), but that felt too unrestricted. Using per-kind callbacks provides more control. There are already "to_serialized_name" and "from_serialized_name" callbacks used to store and read entries by "name" instead of "PgStat_HashKey", currently used by replication slot stats. Those remain unchanged, as they serve a separate purpose. Other design points: 1. Filenames use "pgstat.<kind>.stat" based on the numeric kind ID. This avoids requiring extensions to provide names and prevents issues with spaces or special characters. 2. Both callbacks must be registered together. Serializing without deserializing would leave orphaned files behind, and I cannot think of a reason to allow this. 3. "write_chunk", "read_chunk", "write_chunk_s", and "read_chunk_s" are renamed to "pgstat_write_chunk", etc., and moved to "pgstat_internal.h" so extensions can use them without re-implementing these functions. 4. These callbacks are valid only for custom, variable-numbered statistics kinds. Custom fixed kinds may not benefit, but could be considered in the future. Attached 0001 is the proposed change, still in POC form. The second patch contains tests in "injection_points" to demonstrate this proposal, and is not necessarily intended for commit. Looking forward to your feedback! -- Sami Imseih Amazon Web Services (AWS)
Вложения
On Wed, Oct 22, 2025 at 03:24:11PM -0500, Sami Imseih wrote: > I'd like to propose $SUBJECT to serialize additional per-entry data beyond > the standard statistics entries. Currently, custom statistics kinds can store > their standard entry data in the main "pgstat.stat" file, but there is no > mechanism for extensions to persist extra data stored in the entry. A common > use case is extensions that register a custom kind and, besides > standard counters, > need to track variable-length data stored in a dsa_pointer. Thanks for sending a proposal in this direction. > A concrete use case is pg_stat_statements. If it were to use custom > stats kinds to track statement counters, it could also track query text > stored in DSA. The callbacks allow saving the query text referenced by the > dsa_pointer and restoring it after a clean shutdown. Since DSA > (and more specifically DSM) cannot be attached by the postmaster, an > extension cannot use "on_shmem_exit" or "shmem_startup_hook" > to serialize or restore this data. This is why pgstat handles > serialization during checkpointer shutdown and startup, allowing a single > backend to manage it safely. Agreed that it would be better to split the query text in a file of its own and now bloat the "main" pgstats file with this data, A real risk is that many PGSS entries with a bunch of queries would cause the file to be just full of the PGSS contents. > I considered adding hooks to the existing pgstat code paths > (pgstat_before_server_shutdown, pgstat_discard_stats, and > pgstat_restore_stats), but that felt too unrestricted. Using per-kind > callbacks provides more control. Per-kind callbacks to control all that makes sense here. > There are already "to_serialized_name" and "from_serialized_name" > callbacks used to store and read entries by "name" instead of > "PgStat_HashKey", currently used by replication slot stats. Those > remain unchanged, as they serve a separate purpose. > > Other design points: > > 1. Filenames use "pgstat.<kind>.stat" based on the numeric kind ID. > This avoids requiring extensions to provide names and prevents issues > with spaces or special characters. Hmm. Is that really what we want here? This pretty says that one single custom kind would never be able use multiple files, ever. > 2. Both callbacks must be registered together. Serializing without > deserializing would leave orphaned files behind, and I cannot think of a > reason to allow this. Hmm. Okay. > 3. "write_chunk", "read_chunk", "write_chunk_s", and > "read_chunk_s" are renamed to "pgstat_write_chunk", etc., and > moved to "pgstat_internal.h" so extensions can use them without > re-implementing these functions. Exposing the write and read chunk APIs and renaming them sounds good here, designed as they are now with a FILE* defined by the caller. It's good to share these for consistency across custom and built-in stats kinds. > 4. These callbacks are valid only for custom, variable-numbered statistics > kinds. Custom fixed kinds may not benefit, but could be considered in the > future. Pushing custom data for fixed-sized stats may be interesting, though like you I am not sure what a good use-case would look like. So discarding this case for now sounds fine to me. > Attached 0001 is the proposed change, still in POC form. Hmm. I would like to propose something a bit more flexible, refactoring and reusing some of the existing callbacks, among the following lines: - Rather than introducing a second callback able to do more serialization work, let's expand a bit the responsibility of to_serialized_name and from_serialized_name to be able to work in a more extended way, renaming them to "to/from_serialized_entry", which are now limited to return a NameData with pgstat.c enforcing the data written to the pgstats to be of NAMEDATALEN. The idea would be to let the callbacks push some custom data where they want. - The to_serialized_name path of pgstat_write_statsfile() would then be changed as follows: -- push a PGSTAT_FILE_ENTRY_NAME -- Write the key write_chunk_s. -- Call the callback to push some custom per-entry data. -- Finish with the main chunk of data, of size pgstat_get_entry_len(). - The fd or FILE* of the "main" pgstats file should be added as argument of both routines (not mandatory, but we are likely going to need that if we want to add more "custom" data in the main pgstats file before writing or reading a chunk). For example, for a PGSS text file, we would likely write two fields to the main data file: an offset and a length to be able to retrieve a query string, from a secondary file. - FDs where the data is written while we are in the to/from serialize can be handled within the code paths specific to the stats kind code. The first time a serialized callback of a stats kind is called, the extra file(s) is(are) opened. This may come at the cost of one new callback: at the end of the read and writes of the stats data, we would need an extra look that's able to perform cleanup actions, which would be here to make sure that the fds opened for the extra files are closed when we are done. The close of each file is equivalent to the pgstat_close_file() done in the patch, except that we'd loop over a callback that would do the cleanup job once we are done reading or writing a file. One step that can be customized in this new "end" callback is if a stats kind may decide to unlink() a previous file, as we do for the main pgstats file, or keep one or more files around. That would be up to the extension developer. We should be able to reuse or rework reset_all_cb() with a status given to it, depending on if we are dealing with a failure or a success path. Currently, reset_all_cb() is only used in a failure path, the idea would be to extend it for the success case. > The second patch > contains tests in "injection_points" to demonstrate this proposal, and is not > necessarily intended for commit. Having coverage for these kinds of APIs is always good, IMO. We need coverage for extension code. -- Michael
Вложения
Thanks for the feedback! > > Other design points: > > > > 1. Filenames use "pgstat.<kind>.stat" based on the numeric kind ID. > > This avoids requiring extensions to provide names and prevents issues > > with spaces or special characters. > > Hmm. Is that really what we want here? This pretty says that one > single custom kind would never be able use multiple files, ever. Perhaps if someone wants to have separate files for each different types of data, we should be able to support multiple files. I think we can add an option for the number of files and they can then be named "pgstat.<kind>.1.stat", pgstat.<kind>.2.stat", etc. I rather avoid having the extension provide a set of files names. So as arguments to the callback, besides the main file pointer ( as you mention below), we also provide the list of custom file pointers. what do you think? > Hmm. I would like to propose something a bit more flexible, > refactoring and reusing some of the existing callbacks, among the > following lines: > - Rather than introducing a second callback able to do more > serialization work, let's expand a bit the responsibility of > to_serialized_name and from_serialized_name to be able to work in a > more extended way, renaming them to "to/from_serialized_entry", which Sure, we can go that route. > - The fd or FILE* of the "main" pgstats file should be added as > argument of both routines (not mandatory, but we are likely going to > need that if we want to add more "custom" data in the main pgstats > file before writing or reading a chunk). For example, for a PGSS text > file, we would likely write two fields to the main data file: an > offset and a length to be able to retrieve a query string, from a > secondary file. Yeah, that could be a good idea for pg_s_s, if we don't want to store the key alongside the query text. Make more sense. > - FDs where the data is written while we are in the to/from serialize > can be handled within the code paths specific to the stats kind code. > The first time a serialized callback of a stats kind is called, the > extra file(s) is(are) opened. This may come at the cost of one new > callback: at the end of the read and writes of the stats data, we > would need an extra look that's able to perform cleanup actions, which > would be here to make sure that the fds opened for the extra files are > closed when we are done. The close of each file is equivalent to the > pgstat_close_file() done in the patch, except that we'd loop over a > callback that would do the cleanup job once we are done reading or > writing a file. One step that can be customized in this new "end" > callback is if a stats kind may decide to unlink() a previous file, as > we do for the main pgstats file, or keep one or more files around. > That would be up to the extension developer. We should be able to > reuse or rework reset_all_cb() with a status given to it, depending on > if we are dealing with a failure or a success path. Currently, > reset_all_cb() is only used in a failure path, the idea would be to > extend it for the success case. I will provide a patch with the recommendations. -- Sami
On Thu, Oct 23, 2025 at 04:35:58PM -0500, Sami Imseih wrote: > Perhaps if someone wants to have separate files for each different > types of data, > we should be able to support multiple files. I think we can add an > option for the > number of files and they can then be named "pgstat.<kind>.1.stat", > pgstat.<kind>.2.stat", > etc. I rather avoid having the extension provide a set of files names. > So as arguments to the callback, besides the main file pointer ( as > you mention below), > we also provide the list of custom file pointers. > > what do you think? My worry here is the lack of flexibility regarding stats that could be split depending on the objects whose data needs to be flushed. For example, stats split across multiple databases (like our good-old pre-v14 pgstats, but on a per-kind basis). So I don't think that we can really assume that the list of file names should be fixed when we begin the read/write process of the main pgstats file. -- Michael
Вложения
> On Thu, Oct 23, 2025 at 04:35:58PM -0500, Sami Imseih wrote: > > Perhaps if someone wants to have separate files for each different > > types of data, > > we should be able to support multiple files. I think we can add an > > option for the > > number of files and they can then be named "pgstat.<kind>.1.stat", > > pgstat.<kind>.2.stat", > > etc. I rather avoid having the extension provide a set of files names. > > So as arguments to the callback, besides the main file pointer ( as > > you mention below), > > we also provide the list of custom file pointers. > > > > what do you think? > > My worry here is the lack of flexibility regarding stats that could be > split depending on the objects whose data needs to be flushed. For > example, stats split across multiple databases (like our good-old > pre-v14 pgstats, but on a per-kind basis). So I don't think that we > can really assume that the list of file names should be fixed when we > begin the read/write process of the main pgstats file. I was trying to avoid an extra field in PgStat_KindInfo if possible, but it's worthwhile to provide more flexibility to an extension. I will go with this. -- Sami Imseih Amazon Web Services (AWS)
On Thu, Oct 23, 2025 at 07:57:38PM -0500, Sami Imseih wrote: > I was trying to avoid an extra field in PgStat_KindInfo if possible, but > it's worthwhile to provide more flexibility to an extension. I will go > with this. Yes, I don't think that we will be able to avoid some refactoring of the existing callbacks. The introduction of a new one may not be completely necessary, though, especially if we reuse the reset callback to be called when the stats read and write finish to close any fds we may have opened when processing. Maintaining the state of the files opened within each stat kind code across multiple calls of the new "serialized" callback feels a bit more natural and more flexible, at least it's my take on the matter. -- Michael
Вложения
> > Hmm. I would like to propose something a bit more flexible, > > refactoring and reusing some of the existing callbacks, among the > > following lines: > > - Rather than introducing a second callback able to do more > > serialization work, let's expand a bit the responsibility of > > to_serialized_name and from_serialized_name to be able to work in a > > more extended way, renaming them to "to/from_serialized_entry", which > > Sure, we can go that route. I started reworking the patch, but then I realized that I don't like this approach of using the same callback to support serializing NameData and serializing extra data. In the existing "to_serialized_name" callback , NameData is serialized instead of the hash key, meaning that the "from_serialized_name" must be called before we create an entry. The callback translates the NameData to an objid as is the case with replication slots, and the key is then used to create the entry. However, in the case of serializing extra data, we want to have already created the entry by the time we call the callback. For example populating non-key fields of an entry with a dsa_pointer after reading some serialized data into dsa. If we do want to support a single callback, we would need extra metadata in the Kind registration to let the extension tell us what the callback is used for and to either trigger the callback before or after entry creation. I am not very thrilled about doing something like this, as I see 2 very different use-cases here. What do you think? -- Sami Imseih Amazon Web Services (AWS)
On Mon, Nov 10, 2025 at 01:56:23PM -0600, Sami Imseih wrote: > I started reworking the patch, but then I realized that I don't like this > approach of using the same callback to support serializing NameData and > serializing extra data. In the existing "to_serialized_name" callback > , NameData is serialized instead of the hash key, meaning that the > "from_serialized_name" must be called before we create an entry. The > callback translates the NameData to an objid as is the case with replication > slots, and the key is then used to create the entry. Thanks for looking at that. > However, in the case of serializing extra data, we want to have already > created the entry by the time we call the callback. For example populating > non-key fields of an entry with a dsa_pointer after reading some serialized > data into dsa. > > If we do want to support a single callback, we would need extra metadata in > the Kind registration to let the extension tell us what the callback is used > for and to either trigger the callback before or after entry creation. I am > not very thrilled about doing something like this, as I see 2 very different > use-cases here. Ah, I see your point. By keeping two callbacks, one to translate a key to/from a different field (NameData currently, but it could be something else with a different size), we would for example be able to keep very simple the checks for duplicated entries when reading the file. Agreed that it would be good to keep the key lookups as stable as we can. So, what you are suggested is a second callback once we have called read_chunk() and write_chunk() for a PGSTAT_FILE_ENTRY_HASH or a PGSTAT_FILE_ENTRY_NAME and let a stats kind write in the main file and/or one or more extra files the data they want? I'd be fine with that, yes, and that should work with the PGSS case in mind. -- Michael
Вложения
Sorry for the delay here.
v1 is the first attempt to address the feedback from the POC.
1/ A user is now able to register as many extra files as they
wish, and the files will be named pgstat.<kind_id>.<file no>.stat,
where file_no starts at 0 up to the number of files specified
by the user with .num_serialized_extra_files.
2/ The callbacks now provide both the core stat file as a FILE
pointer and an array of FILE pointers for the extra files.
IN the write callback, the extra file pointer is accessed
like extra_files[0], extra_files[1], etc., and the same for
the read callback.
3/ The patch centralizes the creation and cleanup of the files
with 2 new routines pgstat_allocate_files and pgstat_cleanup_files,
which both operate on a new local struct which tracks the file
names and descriptors in the read and write stats routines.
```
typedef struct PgStat_SerializeFiles
{
char **tmpfiles;
char **statfiles;
FILE **fd;
int num_files;
} PgStat_SerializeFiles;
```
plug-ins are not made aware of this struct because they don't need
to. The callbacks are provided the FILE pointers they need to care
about for their kind only.
4/ In terms of testing, patch 0002, I did not want to invent a new module
for custom kinds, so I piggybacked off og injection_points as I did in the
POC, but I added on the existing recovery tests, because essentially that
is what we care. Does the data persist after a clean shutdown? do the
.tmp files get removed properly? etc. So I added tests in
recovery/t/029_stats_restart.pl for this.
--
Sami Imseih
Amazon Web Services (AWS)
Вложения
It just occurred to me that the documentation [0] should be updated to describe the callbacks. I will do that in the next revision. [0] https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-ADDIN-CUSTOM-CUMULATIVE-STATISTICS -- Sami Imseih Amazon Web Services (AWS)
On Wed, Nov 19, 2025 at 08:10:43PM -0600, Sami Imseih wrote: > It just occurred to me that the documentation [0] should be > updated to describe the callbacks. I will do that in the next > revision. > > [0] https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-ADDIN-CUSTOM-CUMULATIVE-STATISTICS Hmm. Based on what I can read from the patch, you are still enforcing file name patterns in the backend, as of: + extra->statfiles[i] = psprintf("%s/pgstat.%d.%d.stat", + PGSTAT_STAT_PERMANENT_DIRECTORY, kind, i); My take (also mentioned upthread) is that this design should go the other way around, where modules have the possibility to define their own file names, and some of them could be generated on-the-fly when writing the files, for example for a per-file database split, or the object ID itself. The important part for variable-numbered stats is that the keys of the entries have to be in the main pgstats file. Then, the extra data is loaded back based on the data in the entry key, based on a file name that only a custom stats kind knows about (fd and file name). It means that the custom stats kind needs to track the files it has to clean up by itself in this scheme. We could pass back to the startup process some fds that it cleans up, but it feels simpler here to let the custom code do what they want, instead, rather than having an array that tracks the file names and/or their fds. -- Michael
Вложения
> > It just occurred to me that the documentation [0] should be > > updated to describe the callbacks. I will do that in the next > > revision. > > > > [0] https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-ADDIN-CUSTOM-CUMULATIVE-STATISTICS > > Hmm. Based on what I can read from the patch, you are still enforcing > file name patterns in the backend, as of: > + extra->statfiles[i] = psprintf("%s/pgstat.%d.%d.stat", > + PGSTAT_STAT_PERMANENT_DIRECTORY, kind, i); > > My take (also mentioned upthread) is that this design should go the > other way around, where modules have the possibility to define their > own file names, and some of them could be generated on-the-fly when > writing the files, for example for a per-file database split, or the > object ID itself. The way I thought about it is that extension developer can just provide the number of files they need and the they are then given a list of file pointers that they need. They can then manage what each file is used for. They also don't need to worry about naming the files, all they need to do is track what each file in the list does. > The important part for variable-numbered stats is that the keys of the > entries have to be in the main pgstats file. Then, the extra data is > loaded back based on the data in the entry key, based on a file name > that only a custom stats kind knows about (fd and file name). It > means that the custom stats kind needs to track the files it has to > clean up by itself in this scheme. We could pass back to the startup > process some fds that it cleans up, but it feels simpler here to let > the custom code do what they want, instead, rather than having an > array that tracks the file names and/or their fds. yeah, I was leaning towards putting more responsibility on pgstat to manage these extra files, but you are suggesting that we just let the extension manage the create/cleanup of these files as well. After re-reading your earlier suggestions, this sounds like a third callback that is used for file cleanup, and this callback could be the existing reset_all_cb. Also, instead of reset_all_cb being called during pgstat_reset_after_failure, it can be called during the success case, i.e, a new pgstat_reset_after_success. reset_all_cb also carries a status argument so the extension knows what to do in the case of success or failure. This also means we need to also update all existing callbacks to do work in the failed status. Is that correct? -- Sami
> After re-reading your earlier suggestions, this sounds like a third > callback that is used for file cleanup, and this callback could be > the existing reset_all_cb. Also, instead of reset_all_cb being called > during pgstat_reset_after_failure, it can be called during the success > case, i.e, a new pgstat_reset_after_success. reset_all_cb also > carries a status argument so the extension knows what to do > in the case of success or failure. > This also means we need to also update all existing callbacks to > do work in the failed status. After second thought, I am not too thrilled with extending reset_all_cb to take responsibility for file cleanup, etc. I think it should just remain used to reset stats only. I think the best way forward will be to introduce a callback to be used by custom kinds only. This callback will be responsible for cleaning up files and related resources at the end of the write stats, read stats, and discard stats paths. The callback will provide back to the extension a status (READ, WRITE, DISCARD) and the extension will know how to clean up the resources it created depending on the situation. So, unlike my original proposal, this puts more responsibility on the extension to track and clean up its files, but this seems like the best approach to take here. Also, I am now leaning towards creating a separate test module rather than trying to do too much unrelated testing in injection points. It is definitely convenient to use injection points, but I think we can do better testing with a separate module. This module can also serve as an example for extension developers. what do you think? -- Sami Imseih Amazon Web Services (AWS)
On Mon, Nov 24, 2025 at 06:18:26PM -0600, Sami Imseih wrote: > After second thought, I am not too thrilled with extending reset_all_cb > to take responsibility for file cleanup, etc. I think it should just remain > used to reset stats only. > > I think the best way forward will be to introduce a callback to be used by > custom kinds only. This callback will be responsible for cleaning up files > and related resources at the end of the write stats, read stats, and discard > stats paths. The callback will provide back to the extension a status > (READ, WRITE, DISCARD) and the extension will know how to clean up the > resources it created depending on the situation. I guess that READ and WRITE are the cases that happen on success of these respective operations. DISCARD is the failure case when one of these fail. > So, unlike my original proposal, this puts more responsibility on the > extension to track and clean up its files, but this seems like the best > approach to take here. That may be something we should do anyway. It means that the modules are responsible for the tracking the file(s) they open, still they could also decide operations different than the backend for the main pgstats file, optionally, depending on the state of the reads and writes (aka success or failure of these). > Also, I am now leaning towards creating a separate test module rather than > trying to do too much unrelated testing in injection points. It is definitely > convenient to use injection points, but I think we can do better testing with > a separate module. This module can also serve as an example for extension > developers. You are right that it may be cleaner this way. Do you think that it could make sense to move some of the existing "template" code of injection_points there? One part of the proposed patch that felt independent to me was the renaming and publishing of the two write/read routines for the stats files, so I have extracted that in your first patch to reduce the blast, and applied that as it can also be useful on its own. -- Michael