Обсуждение: Skip unregistered custom kinds on stats load
Hi, While working on something nearby, I observed that if a custom pgstats kind is not registered while the on-disk pgstats.stat file contains entries for this custom kind, it results in all of the stats, including the built-in ones, being discarded. This can happen if you initially register this custom kind, cleanly shut down the database, then start it up without the kind being registered (i.e. not including the extension that loads this custom kind in shared_preload_libraries). This can be observed in 001_stat.pl in the injection_points tests, where the last $node->start shows a corrupted pgstat.stat file message because "injection_points_fixed" and "injection_points" were removed from shared_preload_libraries after a clean shutdown. ``` 2025-10-20 18:38:48.664 UTC startup[93662] LOG: database system was shut down at 2025-10-20 18:38:48 UTC 2025-10-20 18:38:48.665 UTC startup[93662] WARNING: could not find information of kind 26 for entry of type F 2025-10-20 18:38:48.665 UTC startup[93662] LOG: corrupted statistics file "pg_stat/pgstat.stat" 2025-10-20 18:38:48.666 UTC postmaster[93643] LOG: database system is ready to accept connections 2025-10-20 18:38:48.750 UTC postmaster[93643] LOG: received immediate shutdown request 2025-10-20 18:38:48.757 UTC postmaster[93643] LOG: database system is shut down ``` While I do not think this is a bug, I also do not think this is not the ideal behavior. Unlike built-in stats, custom stats are not guaranteed to be registered, and completely resetting all stats in that edge case seems too strict. I think this can be better handled by tracking the lengths of the kinds in the pgstat.stat file when saving, and re-reading the lengths on startup. This would allow skipping past any unrecognized custom data. We need to save the lengths since we have no way to know how much to skip when re-reading. The data can be saved at the start of the file, after the FORMAT, so the extra space required is minimal. With the attached patch, this is the warning message instead. ``` 2025-10-20 18:25:51.763 UTC startup[47833] LOG: database system was shut down at 2025-10-20 18:25:51 UTC 2025-10-20 18:25:51.764 UTC startup[47833] WARNING: found unregistered custom stats kind 26 for entry of type F, skipping 2025-10-20 18:25:51.765 UTC postmaster[47814] LOG: database system is ready to accept connections 2025-10-20 18:25:51.848 UTC postmaster[47814] LOG: received immediate shutdown request 2025-10-20 18:25:51.855 UTC postmaster[47814] LOG: database system is shut down ``` Thoughts? -- Sami Imseih Amazon Web Services (AWS)
Вложения
On Mon, Oct 20, 2025 at 01:39:37PM -0500, Sami Imseih wrote:
> I think this can be better handled by tracking the lengths of the
> kinds in the pgstat.stat file when saving, and re-reading the lengths
> on startup. This would allow skipping past any unrecognized custom
> data. We need to save the lengths since we have no way to know how
> much to skip when re-reading. The data can be saved at the start of
> the file, after the FORMAT, so the extra space required is minimal.
+ /* Write the lengths of all stats kinds */
+ for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
+ {
+ size_t kind_len = 0;
+
+ const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
+
+ if (kind_info)
+ kind_len = pgstat_get_entry_len(kind);
+
+ kind_lengths[kind - 1] = kind_len;
+ write_chunk_s(fpout, &kind_len);
+ }
The fun does not stop there. This also means that the data tracked in
the file becomes incorrect if a server tries to reuse the same ID
across two restarts with a different kind attached to them. The point
is that giving up is kind of always the safest bet and acts as a
sanity measure, giving priority to the availability of the server.
Perhaps we should document better all these properties, meaning that
keeping a file compatible requires the stats kind to be loaded, even
if it means replacing a past one with an "idle" kind, that could
discard past stats that are not supported anymore so as the same stats
kind can be reused in the future (that could be also settable as a GUC
defined by the library loaded with shared_preload_libraries from where
the custom stats kind originates).
--
Michael
Вложения
> The fun does not stop there. This also means that the data tracked in > the file becomes incorrect if a server tries to reuse the same ID > across two restarts with a different kind attached to them. The point > is that giving up is kind of always the safest bet and acts as a > sanity measure, giving priority to the availability of the server. hmm, that is a different case, right? When we restart with a different kind ( different struct ) but the same kind ID, we would find the kind when reading but it may have a different length, and in that case we will fail reading the entry, and actually we will have a legitimate corrupt stats file in that case. Here is what I see in that case. ``` 2025-10-20 19:41:14.365 CDT [36636] WARNING: could not read entry of type 2025-10-20 19:41:14.365 CDT [36636] LOG: corrupted statistics file "pg_stat/pgstat.stat" ``` The more worrying case is if the struct of this other kind has the same length, and the data ends up being read back into the wrong fields. [0] https://github.com/postgres/postgres/blob/master/src/backend/utils/activity/pgstat.c#L1851-L1868 -- Sami
I hit send too fast. Link [0] referenced above was just pointing out the part of the code that reads a chunk of data using the length of the kind entry. -- Sami
On Mon, Oct 20, 2025 at 07:54:09PM -0500, Sami Imseih wrote: > The more worrying case is if the struct of this other kind has > the same length, and the data ends up being read back > into the wrong fields. More to the point. If I were to put defenses when reading back a stats file that includes custom stats kinds before reading the file entries and checking if its contents are OK to be used by the server, validating the size of the on-disk entries is not sufficient. We should write down and cross-check more data when reading them back with what has been loaded by the postmaster: - Name of custom callback, that should match. - Its reserved ID. - Written size on disk, to ensure that what is read from the file is OK to load. - Likely if it is a fixed_amount or not, not really mandatory IMO. Then, based on if these contents match with what has been loaded from s_p_l, we could issue a harder failure that prevents startup and not throw about the stats file. If one needs to ditch some stats, they would still need to load a dummy version of the custom stats kind they want to drop, dropping the entries on read. That's still an existing requirement, or have a tool that's able to read, filter and rewrite an existing stats file (refactoring pgstat_shmem.c to expose some of its routines for this purpose has been on my TODO list of things for some time). The existing implementation is a choice of simplicity, prioritizing availability, FWIW, to keep the contents of the on-disk file simpler. Being able for an extension developer to drop stats in an existing file would still need to be handled on the developer side. So I am not really convinced that this is something we need to treat that as a problem in search of a solution. -- Michael
Вложения
> > The more worrying case is if the struct of this other kind has > > the same length, and the data ends up being read back > > into the wrong fields. > > More to the point. If I were to put defenses when reading back a > stats file that includes custom stats kinds before reading the file > entries and checking if its contents are OK to be used by the server, > validating the size of the on-disk entries is not sufficient. We > should write down and cross-check more data when reading them back > with what has been loaded by the postmaster: > - Name of custom callback, that should match. > - Its reserved ID. > - Written size on disk, to ensure that what is read from the file is > OK to load. > - Likely if it is a fixed_amount or not, not really mandatory IMO. The proposal is not about validating that the data being read back belongs to the same registered kind. It is about ensuring that if a kind is not registered, we simply skip that portion of the pgstat.stat file. It's a very narrow case, perhaps too narrow to be worthwhile. > Then, based on if these contents match with what has been loaded from > s_p_l, we could issue a harder failure that prevents startup and not > throw about the stats file. If one needs to ditch some stats, they > would still need to load a dummy version of the custom stats kind they > want to drop, dropping the entries on read. That's still an existing > requirement, or have a tool that's able to read, filter and rewrite an > existing stats file (refactoring pgstat_shmem.c to expose some of its > routines for this purpose has been on my TODO list of things for some > time). Hmm, since the pgstat.stat file is managed by core and not by the extension, I think this should be handled semi-automatically by pgstats. Even with the checks you mention above, validating that we are indeed loading the same kind will never be totally foolproof. I think we can validate kinds better by adding some MAGIC NUMBER that is set by the extension during registration and tracked in pgstats.stat. If the kind ID and MAGIC NUMBER (for example, FORMAT ID) match, then we have more of a guarantee that this is the same extension. Also, this is probably a good idea to support extension upgrades. -- Sami
On Tue, Oct 21, 2025 at 01:26:30PM -0500, Sami Imseih wrote: > Hmm, since the pgstat.stat file is managed by core and not by the > extension, I think this should be handled semi-automatically by > pgstats. Even with the checks you mention above, validating that we > are indeed loading the same kind will never be totally foolproof. I > think we can validate kinds better by adding some MAGIC NUMBER that is > set by the extension during registration and tracked in pgstats.stat. > If the kind ID > and MAGIC NUMBER (for example, FORMAT ID) match, then we have more of > a guarantee that this is the same extension. Also, this is probably a > good idea to support extension upgrades. Probably. Still, it does not feel like something we need to push strongly in a given direction, because we may pay in flexibility compared to what's on HEAD. With a single kind ID forcing how large a fixed-sized entry or one variable-sized entry should be based on what kind of data is loaded, versioning something based on a single kind is impossible (or the same size is used, leading to different contents translated). Of course, one could use multiple kind IDs if they want to enforce different stats versions in a single major version of the backend. To me, this property comes down that the version of the pgstats file cannot change in a single major version, implying that the same rule is also pushed to the custom kinds: if you begin to use a kind ID in a given major version, do not change the stats format, limitting new changes to a new major version. Or an upgrade format path to use is something an extension needs to handle on its own. I'd bet that stats upgrade requirements for the same major version are not going to be that common, the future may tell a different story depending on the use cases that pop. -- Michael
Вложения
On 2025-Oct-21, Michael Paquier wrote: > The existing implementation is a choice of simplicity, prioritizing > availability, FWIW, to keep the contents of the on-disk file simpler. FWIW I don't necessarily agree with what Sami is proposing, but (if I understand the situation correctly) I think throwing away the whole contents of stats when a module is removed is the opposite of availability, as it will lead to autovacuum lacking data and thus bloat problems. This is of course always problematic after a crash (unless somebody has fixed this already?) but at least it is supposed to be a known issue. With these custom stat kind thingies, nobody knows that this problem exists. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Por suerte hoy explotó el califont porque si no me habría muerto de aburrido" (Papelucho)
On Wed, Oct 22, 2025 at 08:44:27AM +0300, Alvaro Herrera wrote: > FWIW I don't necessarily agree with what Sami is proposing, but (if I > understand the situation correctly) I think throwing away the whole > contents of stats when a module is removed is the opposite of > availability, as it will lead to autovacuum lacking data and thus bloat > problems. This is of course always problematic after a crash (unless > somebody has fixed this already?) but at least it is supposed to be a > known issue. A first part here is something that Bertrand Drouvot has been working on: being able to rebuild the table stats during WAL replay requires us to move the stats to be relfilenode-based, so as the startup could know where to recreate some of the numbers lost during crash recovery from WAL, giving autovacuum a safety net. > With these custom stat kind thingies, nobody knows that this problem > exists. Noted. -- Michael
Вложения
Hi, On Wed, Oct 22, 2025 at 02:51:31PM +0900, Michael Paquier wrote: > > A first part here is something that Bertrand Drouvot has been working > on: being able to rebuild the table stats during WAL replay requires > us to move the stats to be relfilenode-based, so as the startup could > know where to recreate some of the numbers lost during crash recovery > from WAL, giving autovacuum a safety net. Yeah, I resumed working on it recently and should be able to share a patch next week (based on what we discussed in [1]). Next week patch will be the first step i.e replace (i.e get rid) PGSTAT_KIND_RELATION by a brand new PGSTAT_KIND_RELFILENODE and move all the existing stats that are currently under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE. It will do this by keeping the pg_stat_all_tables|indexes and pg_statio_all_tables|indexes on top of the PGSTAT_KIND_RELFILENODE and ensure that a relation rewrite keeps those stats. Next step will be that, once done, we could work from there to add new stats (add writes counters (heap_blks_written and friends) and ensure that some counters (n_dead_tup and friends) are replicated). [1]: https://www.postgresql.org/message-id/aN07p71KNFR2HdaD%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> On Tue, Oct 21, 2025 at 01:26:30PM -0500, Sami Imseih wrote: > > Hmm, since the pgstat.stat file is managed by core and not by the > > extension, I think this should be handled semi-automatically by > > pgstats. Even with the checks you mention above, validating that we > > are indeed loading the same kind will never be totally foolproof. I > > think we can validate kinds better by adding some MAGIC NUMBER that is > > set by the extension during registration and tracked in pgstats.stat. > > If the kind ID > > and MAGIC NUMBER (for example, FORMAT ID) match, then we have more of > > a guarantee that this is the same extension. Also, this is probably a > > good idea to support extension upgrades. > > Probably. Still, it does not feel like something we need to push > strongly in a given direction, because we may pay in flexibility > compared to what's on HEAD. > > With a single kind ID forcing how large a fixed-sized entry or one > variable-sized entry should be based on what kind of data is loaded, > versioning something based on a single kind is impossible (or the same > size is used, leading to different contents translated). Of course, > one could use multiple kind IDs if they want to enforce different > stats versions in a single major version of the backend. To me, this > property comes down that the version of the pgstats file cannot change > in a single major version, implying that the same rule is also pushed > to the custom kinds: if you begin to use a kind ID in a given major > version, do not change the stats format, limitting new changes to a > new major version. Or an upgrade format path to use is something an > extension needs to handle on its own. I'd bet that stats upgrade > requirements for the same major version are not going to be that > common, the future may tell a different story depending on the use > cases that pop. OK, fair enough. I don't have enough compelling use-cases to push back harder on this. Initially, my focus was on the fact that unlike built-in stats, custom stats may not have a registered kind and disposing of all the stats in that case did not seem like the appropriate behavior, or least too strict of a behavior. But also, stats themselves are for the most part are consumed in a way that can tolerate stats being reset, because most of the time a user is taking deltas of stats between snapshots. The autovacuum case is special, and the work that Bertrand is doing will be beneficial to ensure that a/v has the correct stats after crash recovery. -- Sami
On Wed, Oct 22, 2025 at 03:57:15PM -0500, Sami Imseih wrote: > OK, fair enough. I don't have enough compelling use-cases to push > back harder on this. Initially, my focus was on the fact that unlike > built-in stats, custom stats may not have a registered kind and > disposing of all the stats in that case did not seem like the appropriate > behavior, or least too strict of a behavior. Two things that may be useful here, as long as I have this problem in mind.. First, could it could be useful to have an in-core function able to remove from shared memory all the entries related to a specific stats kind, taking in input a kind ID? At least for variable-sized data in the shared hashtable, that could become useful. A second potential improvement would be to bypass the write of a fixed-numbered stats kind if we detect that nothing needs to be written because there is no data, after a reset or if we detect it as full of zeroes. The format of the file with PGSTAT_FILE_ENTRY_FIXED makes this flexibility possible, and we could extend the snapshot_cb and pgstat_build_snapshot_fixed() so as they return a boolean flag to let the upper layer decide if some data should be written or not. If you combine the first system function with resets and the tweak for the fixed-numbered stats, then one could reset/remove the stats, drop a custom kind from s_p_l, and still preserve the rest of the stats. That's more steps than doing all that automatically, but that feels also a bit more bullet-proof if one wants to "upgrade" a stats kind by reusing the same kind ID across two restarts. -- Michael
Вложения
Hi, On Wed, Oct 22, 2025 at 02:41:42PM +0000, Bertrand Drouvot wrote: > Hi, > > On Wed, Oct 22, 2025 at 02:51:31PM +0900, Michael Paquier wrote: > > > > A first part here is something that Bertrand Drouvot has been working > > on: being able to rebuild the table stats during WAL replay requires > > us to move the stats to be relfilenode-based, so as the startup could > > know where to recreate some of the numbers lost during crash recovery > > from WAL, giving autovacuum a safety net. > > Yeah, I resumed working on it recently and should be able to share a patch > next week (based on what we discussed in [1]). FWIW I changed the design so expect some delay. Will share the design decision with the patch(es) once ready. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com