Обсуждение: Refactor query normalization into core query jumbling

Поиск
Список
Период
Сортировка

Refactor query normalization into core query jumbling

От
Sami Imseih
Дата:
Hi,

A while back, there was a discussion about moving the query
normalization code out of pg_stat_statements and into core
query jumbling [0]. This conversation also led to hardening
the hooks (currently only post_parse_analyze) that receive
JumbleState [1].

For the first point, `generate_normalized_query` takes the query
string and the jumbleState with constant locations from core
query jumbling. There is nothing uniquely special about
pg_stat_statements in this code, and it can be used by other extensions.
Extensions could even pass their own query string and JumbleState
if they choose. Patch 0001 moves this function and its helpers to
queryjumblefuncs.c.

A Github search also shows that there are quite a few extensions
that may be copying this code [2] [3], which means they will lose
out on potential improvements and fixes.

As part of this patch, the moved code itself did not change,
but I did improve a comment:

```
* If query_loc > 0, then "query" has been advanced by that much compared to
* the original string start, so we need to translate the provided locations
* to compensate.  (This lets us avoid re-scanning statements before the one
* of interest, so it's worth doing.)
```

This comment appeared at the top of generate_normalized_query and
fill_in_constant_lengths. I simplified it and moved a shortened
version inside the functions.

Also, I did not think `fill_in_constant_lengths` should be a global
function, as I cannot think of a good reason it would be used
on its own, though someone may have a different opinion there.

For the second point, since JumbleState can be shared by multiple
extensions, hooks should receive it as a const pointer. This
signals read-only intent and prevents extensions from
accidentally modifying it through the hook. This change is in
0002. This does mean that extensions will need to update their
hooks, but I do not see that as an issue for a major version.

Thoughts?

[0] https://postgr.es/m/aQA9v9nLu5qsX8IE%40paquier.xyz
[1] https://postgr.es/m/202510281023.4u5aszccvsct%40alvherre.pgsql
[2] https://github.com/search?q=fill_in_constant_lengths&type=code
[3] https://github.com/search?q=generate_normalized_query&type=code

--
Sami Imseih
Amazon Web Services (AWS)

Вложения

Re: Refactor query normalization into core query jumbling

От
Bertrand Drouvot
Дата:
Hi,

On Thu, Dec 18, 2025 at 06:44:17PM -0600, Sami Imseih wrote:
> For the second point, since JumbleState can be shared by multiple
> extensions, hooks should receive it as a const pointer. This
> signals read-only intent and prevents extensions from
> accidentally modifying it through the hook. This change is in
> 0002. This does mean that extensions will need to update their
> hooks, but I do not see that as an issue for a major version.

I can see that 0001 is doing:

-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,

and then:

+static void
+fill_in_constant_lengths(const JumbleState *jstate, const char *query,

Should the const addition be in 0002? But...

While looking at fill_in_constant_lengths(), I can see that it is doing:

locs = jstate->clocations;

and then things like:

locs[i].length = -1

or

locs[i].length = strlen(yyextra.scanbuf + loc)

While this is technically correct so the compiler does not complain (because
clocations is a non const pointer in JumbleState and the added const does not
apply to what clocations points to), I think that adding const here is misleading.

Indeed, the function clearly modifies data accessible through the parameter.

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Refactor query normalization into core query jumbling

От
Sami Imseih
Дата:
> While this is technically correct so the compiler does not complain (because
> clocations is a non const pointer in JumbleState and the added const does not
> apply to what clocations points to), I think that adding const here is misleading.

Yes, I am not happy with this. I initially thought it would be OK to
modify the JumbleState as long as it is done in a core function, but
after much thought, this is neither a good idea nor safe. If a pointer
is marked as a Constant, we should not modify it.

So, I went back to think about this, and the core problem as I see it
is that multiple hooks on the chain can modify the constant lengths. For
example, pg_stat_statements can now modify the constant lengths in
JumbleState via fill_in_constant_lengths, and the same JumbleState can
have its constant locations modified in a different way.

At this time, constant lengths are the only part of the JumbleState that
is not set by core. So, I don't think post_parse_analyze receiving
JumbleState as a constant is the right approach, nor do we need it.

I think we should allow JumbleState to define a normalization callback,
which defaults to a core normalization function rather than an
extension specific one:

```
    jstate->normalize_query = GenerateNormalizedQuery;
```

This way, any extension that wishes to return a normalized string from
the same JumbleState can invoke this callback and get consistent results.
pg_stat_statements and other extensions with a need to normalize a query
string based on the locations of a JumbleState do not need to care about the
internals of normalization, they simply invoke the callback and
receive the final
string.

So v2-0001 implements this callback and moves the normalization logic
into core. Both changes must be done in the same patch.  The comments
are also updated where they are no longer applicable or could be improved.

One additional improvement that this patch did not include is a bool in
JumbleState that tracks whether a normalized string has already been
generated. This way, repeated calls to the callback would not need to
regenerate the string; only the first call would perform the work,
while subsequent calls could simply return the previously computed
normalized string.

I do like the simplicity of this approach and it removes pg_stat_statements
from having to own the normalization code and how well different extensions
sharing the same JumbleState can play together. Not yet sure if this is the
correct direction, and I am open to other ideas.

--
Sami Imseih
Amazon Web Services (AWS)

Вложения

Re: Refactor query normalization into core query jumbling

От
Michael Paquier
Дата:
On Fri, Dec 19, 2025 at 03:36:16PM -0600, Sami Imseih wrote:
> This way, any extension that wishes to return a normalized string from
> the same JumbleState can invoke this callback and get consistent results.
> pg_stat_statements and other extensions with a need to normalize a query
> string based on the locations of a JumbleState do not need to care about the
> internals of normalization, they simply invoke the callback and
> receive the final
> string.

Hmm.  I did not wrap completely my head with your problem, but,
assuming that what you are proposing goes in the right direction, I am
wondering if we should not expose a bit more the jumble query APIs so
as the normal default callback can be reused by out-of-core rather
than hide it entirely.  This would mean exposing
GenerateNormalizedQuery(), which also giving a way for callers of
JumbleQuery() to pass down a custom callback?  This would imply
thinking harder about the initialization state we expect in the
structure, but I think that we should try to design things so as
extensions do not need to copy-paste more code from the core tree at
the end, just less of it.

Of course, this sentence is written with the same line of thoughts as
previously mentioned in the other thread we have discussed: extensions
should not be allowed to update a JumbleState after it's been set by
the backend code, so as once the same JumbleState pointer is passed
down across multiple extensions they don't get confused.  If an
extension wants to use their own policy within the JumbleState, they
had better recreate a new independent one if they are unhappy about
has been generated previously.
--
Michael

Вложения