Обсуждение: Refactor query normalization into core query jumbling
Hi, A while back, there was a discussion about moving the query normalization code out of pg_stat_statements and into core query jumbling [0]. This conversation also led to hardening the hooks (currently only post_parse_analyze) that receive JumbleState [1]. For the first point, `generate_normalized_query` takes the query string and the jumbleState with constant locations from core query jumbling. There is nothing uniquely special about pg_stat_statements in this code, and it can be used by other extensions. Extensions could even pass their own query string and JumbleState if they choose. Patch 0001 moves this function and its helpers to queryjumblefuncs.c. A Github search also shows that there are quite a few extensions that may be copying this code [2] [3], which means they will lose out on potential improvements and fixes. As part of this patch, the moved code itself did not change, but I did improve a comment: ``` * If query_loc > 0, then "query" has been advanced by that much compared to * the original string start, so we need to translate the provided locations * to compensate. (This lets us avoid re-scanning statements before the one * of interest, so it's worth doing.) ``` This comment appeared at the top of generate_normalized_query and fill_in_constant_lengths. I simplified it and moved a shortened version inside the functions. Also, I did not think `fill_in_constant_lengths` should be a global function, as I cannot think of a good reason it would be used on its own, though someone may have a different opinion there. For the second point, since JumbleState can be shared by multiple extensions, hooks should receive it as a const pointer. This signals read-only intent and prevents extensions from accidentally modifying it through the hook. This change is in 0002. This does mean that extensions will need to update their hooks, but I do not see that as an issue for a major version. Thoughts? [0] https://postgr.es/m/aQA9v9nLu5qsX8IE%40paquier.xyz [1] https://postgr.es/m/202510281023.4u5aszccvsct%40alvherre.pgsql [2] https://github.com/search?q=fill_in_constant_lengths&type=code [3] https://github.com/search?q=generate_normalized_query&type=code -- Sami Imseih Amazon Web Services (AWS)
Вложения
Hi, On Thu, Dec 18, 2025 at 06:44:17PM -0600, Sami Imseih wrote: > For the second point, since JumbleState can be shared by multiple > extensions, hooks should receive it as a const pointer. This > signals read-only intent and prevents extensions from > accidentally modifying it through the hook. This change is in > 0002. This does mean that extensions will need to update their > hooks, but I do not see that as an issue for a major version. I can see that 0001 is doing: -static void fill_in_constant_lengths(JumbleState *jstate, const char *query, and then: +static void +fill_in_constant_lengths(const JumbleState *jstate, const char *query, Should the const addition be in 0002? But... While looking at fill_in_constant_lengths(), I can see that it is doing: locs = jstate->clocations; and then things like: locs[i].length = -1 or locs[i].length = strlen(yyextra.scanbuf + loc) While this is technically correct so the compiler does not complain (because clocations is a non const pointer in JumbleState and the added const does not apply to what clocations points to), I think that adding const here is misleading. Indeed, the function clearly modifies data accessible through the parameter. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> While this is technically correct so the compiler does not complain (because
> clocations is a non const pointer in JumbleState and the added const does not
> apply to what clocations points to), I think that adding const here is misleading.
Yes, I am not happy with this. I initially thought it would be OK to
modify the JumbleState as long as it is done in a core function, but
after much thought, this is neither a good idea nor safe. If a pointer
is marked as a Constant, we should not modify it.
So, I went back to think about this, and the core problem as I see it
is that multiple hooks on the chain can modify the constant lengths. For
example, pg_stat_statements can now modify the constant lengths in
JumbleState via fill_in_constant_lengths, and the same JumbleState can
have its constant locations modified in a different way.
At this time, constant lengths are the only part of the JumbleState that
is not set by core. So, I don't think post_parse_analyze receiving
JumbleState as a constant is the right approach, nor do we need it.
I think we should allow JumbleState to define a normalization callback,
which defaults to a core normalization function rather than an
extension specific one:
```
jstate->normalize_query = GenerateNormalizedQuery;
```
This way, any extension that wishes to return a normalized string from
the same JumbleState can invoke this callback and get consistent results.
pg_stat_statements and other extensions with a need to normalize a query
string based on the locations of a JumbleState do not need to care about the
internals of normalization, they simply invoke the callback and
receive the final
string.
So v2-0001 implements this callback and moves the normalization logic
into core. Both changes must be done in the same patch. The comments
are also updated where they are no longer applicable or could be improved.
One additional improvement that this patch did not include is a bool in
JumbleState that tracks whether a normalized string has already been
generated. This way, repeated calls to the callback would not need to
regenerate the string; only the first call would perform the work,
while subsequent calls could simply return the previously computed
normalized string.
I do like the simplicity of this approach and it removes pg_stat_statements
from having to own the normalization code and how well different extensions
sharing the same JumbleState can play together. Not yet sure if this is the
correct direction, and I am open to other ideas.
--
Sami Imseih
Amazon Web Services (AWS)
Вложения
On Fri, Dec 19, 2025 at 03:36:16PM -0600, Sami Imseih wrote: > This way, any extension that wishes to return a normalized string from > the same JumbleState can invoke this callback and get consistent results. > pg_stat_statements and other extensions with a need to normalize a query > string based on the locations of a JumbleState do not need to care about the > internals of normalization, they simply invoke the callback and > receive the final > string. Hmm. I did not wrap completely my head with your problem, but, assuming that what you are proposing goes in the right direction, I am wondering if we should not expose a bit more the jumble query APIs so as the normal default callback can be reused by out-of-core rather than hide it entirely. This would mean exposing GenerateNormalizedQuery(), which also giving a way for callers of JumbleQuery() to pass down a custom callback? This would imply thinking harder about the initialization state we expect in the structure, but I think that we should try to design things so as extensions do not need to copy-paste more code from the core tree at the end, just less of it. Of course, this sentence is written with the same line of thoughts as previously mentioned in the other thread we have discussed: extensions should not be allowed to update a JumbleState after it's been set by the backend code, so as once the same JumbleState pointer is passed down across multiple extensions they don't get confused. If an extension wants to use their own policy within the JumbleState, they had better recreate a new independent one if they are unhappy about has been generated previously. -- Michael