Обсуждение: Cleaning up PREPARE query strings?
Hi,
Currently prepared statements store the whole query string that was submitted
by the client at the time of the PREPARE as-is. This is usually fine, but if
that query was a multi-statement query string it can lead to a waste of memory.
There are some pattern that are more likely to have such overhead, mine being
an application with a fixed set of prepared statements that are sent at the
connection start using a single query to avoid extra round trips.
One naive example of the outcome is as follow:
#= PREPARE s1 AS SELECT 1\; PREPARE s2(text) AS SELECT oid FROM pg_class WHERE
relname = $1\; PREPARE s3(int, int) AS SELECT $1 + $2;
PREPARE
PREPARE
PREPARE
=# SELECT name, statement FROM pg_prepared_statements ;
name | statement
------+----------------------------------------------------------------------------
s1 | PREPARE s1 AS SELECT 1; PREPARE s2(text) AS SELECT oid FROM pg_class WHERE+
| relname = $1; PREPARE s3(int, int) AS SELECT $1 + $2;
s2 | PREPARE s1 AS SELECT 1; PREPARE s2(text) AS SELECT oid FROM pg_class WHERE+
| relname = $1; PREPARE s3(int, int) AS SELECT $1 + $2;
s3 | PREPARE s1 AS SELECT 1; PREPARE s2(text) AS SELECT oid FROM pg_class WHERE+
| relname = $1; PREPARE s3(int, int) AS SELECT $1 + $2;
(3 rows)
The more prepared statements you have the bigger the waste. This is also not
particularly readable for people who want to rely on the pg_prepared_statements
views, as you need to parse the query again yourself to figure out what exactly
is the associated query.
I assume that some other patterns could lead to other kind of problems. For
instance if the query string includes a prepared statement and some DML, it
could lead some automated program to replay both the PREPARE and DML when only
the PREPARE was intended.
I'm attaching a POC patch to fix that behavior by teaching PREPARE to clean the
passed query text the same way as pg_stat_statements. Since it relies on the
location saved during parsing the overhead should be minimal, and only present
when some space can actually be saved. Note that I first tried to have the
cleanup done in CreateCachedPlan so that it's done everywhere including things
like the extended protocol but this lead to too many issues so I ended up doing
it for an explicit PREPARE statement only.
With this patch applied, the above scenario gives this new output:
=# SELECT name, statement FROM pg_prepared_statements ;
name | statement
------+----------------------------------------------------
s1 | PREPARE s1 AS SELECT 1
s2 | PREPARE s2(text) AS SELECT oid FROM pg_class WHERE+
| relname = $1
s3 | PREPARE s3(int, int) AS SELECT $1 + $2
(3 rows)
One possible issue is that any comment present at the beginning of the query
text would be discarded. I'm not sure if that's something used by e.g.
pg_hint_plan, but if yes it's always possible to put the statement in front of
the SELECT (or other actual first keyword) rather than the PREPARE itself to
preserve it.
Вложения
Julien Rouhaud <rjuju123@gmail.com> writes:
> I'm attaching a POC patch to fix that behavior by teaching PREPARE to clean the
> passed query text the same way as pg_stat_statements.
This patch invalidates all the location fields in the parsed query:
they could not be used to generate sane error cursors referencing the
truncated string. I'm not sure how many places try to generate such
errors post-parsing, but it's more than zero, and I've long had
ambitions of trying to extend that substantially (e.g, allowing
execution-time errors from functions to point at the relevant function
call).
Certainly the patch could be extended to update all those fields,
but that increases its complexity very significantly. I doubt
that it's worth it. My reaction to your example is more like
"if that bothers you, don't do it that way".
regards, tom lane
Hi, I think this is a good idea. > Julien Rouhaud <rjuju123@gmail.com> writes: > > I'm attaching a POC patch to fix that behavior by teaching PREPARE to clean the > > passed query text the same way as pg_stat_statements. > > This patch invalidates all the location fields in the parsed query: > they could not be used to generate sane error cursors referencing the > truncated string. I am not sure why we need to update the location of the rawstmt at all. CreateCachedPlan does not seem to care about location and length of query strings, AFAICT. ```rawstmt->stmt_location = 0;``` This is somewhat tangential, but I am now also wondering if CleanQuerytext itself should not mess with supplied statement location and length, but instead send back the location and length of the isolated query text within the multi-line query separately. From ``` extern const char *CleanQuerytext(const char *query, int *location, int *len); ``` To ``` extern const char *CleanQuerytext(const char *query, int location, int len, int *new_location, int *new_len); ``` It looks wrong that CleanQueryText is modifying any of the original locations and lengths. -- Sami Imseih Amazon Web Services (AWS)
Hi, On Wed, Dec 24, 2025 at 11:21:00AM -0500, Tom Lane wrote: > Julien Rouhaud <rjuju123@gmail.com> writes: > > I'm attaching a POC patch to fix that behavior by teaching PREPARE to clean the > > passed query text the same way as pg_stat_statements. > > This patch invalidates all the location fields in the parsed query: > they could not be used to generate sane error cursors referencing the > truncated string. I'm not sure how many places try to generate such > errors post-parsing, but it's more than zero, and I've long had > ambitions of trying to extend that substantially (e.g, allowing > execution-time errors from functions to point at the relevant function > call). Ah I wasn't aware of that. After a quick look I'm assuming it's all the callers of executor_errposition()? There aren't a lot of them, and I've unfortunately been unable to hit any of them. Every scenario I tried was always caught during planning time. Do you have any example on how to exercise them? > Certainly the patch could be extended to update all those fields, > but that increases its complexity very significantly. I doubt > that it's worth it. My reaction to your example is more like > "if that bothers you, don't do it that way". I agree that updating the location fields in the whole parsetree is a non starter, butt I don't think that it's the only option. If that's indeed the cases going through executor_errposition(), they rely on having the executor state at hand to retrieve the query string. We could simply have an extra "query string offset" field stored there that would always be 0 except when PREPARE changes the input query string and then do the location to character number calculation after applying that offset. And that offset should already be returned by CleanQuerytext() so it wouldn't have extra complication.
I was looking a bit more here, and I found that this patch breaks if
pg_stat_statements is enabled.
```
postgres=# SELECT 'bingo'\; PREPARE q1 AS SELECT 1 AS a \; SELECT 42;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?>
```
This is because the raw statement location is set to 0 with the patch which
breaks generate_nromalized_query when trying to deal with a multi command
statement. So, I don't think the way v1 is doing things will actually work.
I am thinking that storing the statement length and location in the entry
of prepared_queries may be the better option. Having that info, the
cleaned up query text can be generated on the fly whenever
pg_prepared_statement is called.
We may not need to call CleanQuerytext whenever, however it does
have the benefit of removing leading and trailing whitespace and the
terminating semicolon.
What do you think?
--
Sami Imseih
Amazon Web Services (AWS)
Hi, On Tue, Jan 13, 2026 at 02:24:51PM -0600, Sami Imseih wrote: > I was looking a bit more here, and I found that this patch breaks if > pg_stat_statements is enabled. > > ``` > postgres=# SELECT 'bingo'\; PREPARE q1 AS SELECT 1 AS a \; SELECT 42; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > The connection to the server was lost. Attempting reset: Failed. > !?> > ``` > > This is because the raw statement location is set to 0 with the patch which > breaks generate_nromalized_query when trying to deal with a multi command > statement. So, I don't think the way v1 is doing things will actually work. That's very strange. I'm pretty sure that I tried since the earlier approach I mentioned using doing it in CreateCachedPlan() had that exact problem. Also, src/test/recovery/t/027_stream_regress.pl does run the full regression tests with pg_stat_statements enabled, and it doesn't fail. I'm not sure what is different in your case.
> That's very strange. I'm pretty sure that I tried since the earlier approach I > mentioned using doing it in CreateCachedPlan() had that exact problem. > > Also, src/test/recovery/t/027_stream_regress.pl does run the full regression > tests with pg_stat_statements enabled, and it doesn't fail. > > I'm not sure what is different in your case. the regression tests succeed because by the time the multi-statement command is executed, the queries involved have already been tracked by pg_stat_statements, so they don't need to get stored again, and no reason to go through generate_normalize_query. In this case, SELECT $1 is already tracked. To repro the crash, just reset pg_stat_statements prior. ``` diff --git a/src/test/regress/sql/prepare.sql b/src/test/regress/sql/prepare.sql index 0e7fe44725e..51421357f26 100644 --- a/src/test/regress/sql/prepare.sql +++ b/src/test/regress/sql/prepare.sql @@ -4,6 +4,9 @@ SELECT name, statement, parameter_types, result_types FROM pg_prepared_statements; +CREATE EXTENSION pg_stat_statements; +SELECT pg_stat_statements_reset(); + SELECT 'bingo'\; PREPARE q1 AS SELECT 1 AS a \; SELECT 42; EXECUTE q1; ``` For this test, we should either reset statements before or construct a statement for the test that has not been tracked. We can maybe do that with a query against a newly created table in prepare.sql. I like the latter more. -- Sami Imseih Amazon Web Services (AWS)
Hi, On Wed, Jan 14, 2026 at 09:56:21AM -0600, Sami Imseih wrote: > > That's very strange. I'm pretty sure that I tried since the earlier approach I > > mentioned using doing it in CreateCachedPlan() had that exact problem. > > > > Also, src/test/recovery/t/027_stream_regress.pl does run the full regression > > tests with pg_stat_statements enabled, and it doesn't fail. > > > > I'm not sure what is different in your case. > > the regression tests succeed because by the time the multi-statement command > is executed, the queries involved have already been tracked by > pg_stat_statements, > so they don't need to get stored again, and no reason to go through > generate_normalize_query. In this case, SELECT $1 is already tracked. I see. I'm not sure why my first prototype used to crash with that exact same problem in 027_stream_regress but not this one, but anyway. I only had a quick look at the code but unless I'm missing something it's mostly an oversight as I pass "new_query" to CreateCachedPlan() but still pass the original query string to pg_analyze_and_rewrite_varparams(). Using "new_query" there too should fix the problem? In an earlier message you wrote > I am thinking that storing the statement length and location in the entry > of prepared_queries may be the better option. Having that info, the > cleaned up query text can be generated on the fly whenever > pg_prepared_statement is called. I don't like this approach as it has more execution time overhead, but also doesn't fix at all the memory waste in the prepared statements cache. > To repro the crash, just reset pg_stat_statements prior. > > ``` > diff --git a/src/test/regress/sql/prepare.sql b/src/test/regress/sql/prepare.sql > index 0e7fe44725e..51421357f26 100644 > --- a/src/test/regress/sql/prepare.sql > +++ b/src/test/regress/sql/prepare.sql > @@ -4,6 +4,9 @@ > > SELECT name, statement, parameter_types, result_types FROM > pg_prepared_statements; > > +CREATE EXTENSION pg_stat_statements; > +SELECT pg_stat_statements_reset(); > + > SELECT 'bingo'\; PREPARE q1 AS SELECT 1 AS a \; SELECT 42; > EXECUTE q1; > ``` > > For this test, we should either reset statements before or construct a > statement for the test that has not been tracked. We can maybe > do that with a query against a newly created table in prepare.sql. > I like the latter more. FWIW I think that this should belong to pg_stat_statements testing, no the main regression tests. This would also ensure that we see consistent results in some other scenarios.
> I only had a quick look at the code but unless I'm missing something it's > mostly an oversight as I pass "new_query" to CreateCachedPlan() but still pass > the original query string to pg_analyze_and_rewrite_varparams(). Using > "new_query" there too should fix the problem? That will be a fix, but that seems way too invasive for me. Not sure if it may break other things. > > I am thinking that storing the statement length and location in the entry > > of prepared_queries may be the better option. Having that info, the > > cleaned up query text can be generated on the fly whenever > > pg_prepared_statement is called. > > I don't like this approach as it has more execution time overhead, but also > doesn't fix at all the memory waste in the prepared statements cache. ISTM that your proposal will actually use more memory because pstate->p_sourcetext does not get free'd, but we must now allocate space for a new "cleaned" query. As far as execution overhead, this will only be paid when you query pg_prepared_statement, and we can even optimize this a bit by only cleaning once and reusing the results. > FWIW I think that this should belong to pg_stat_statements testing, no the main > regression tests. This would also ensure that we see consistent results in > some other scenarios. I agree. -- Sami Imseih Amazon Web Services (AWS)
Hi, On Thu, Jan 15, 2026 at 01:13:36PM -0600, Sami Imseih wrote: > > I only had a quick look at the code but unless I'm missing something it's > > mostly an oversight as I pass "new_query" to CreateCachedPlan() but still pass > > the original query string to pg_analyze_and_rewrite_varparams(). Using > > "new_query" there too should fix the problem? > > That will be a fix, but that seems way too invasive for me. Not sure if it may > break other things. > > > > I am thinking that storing the statement length and location in the entry > > > of prepared_queries may be the better option. Having that info, the > > > cleaned up query text can be generated on the fly whenever > > > pg_prepared_statement is called. > > > > I don't like this approach as it has more execution time overhead, but also > > doesn't fix at all the memory waste in the prepared statements cache. > > ISTM that your proposal will actually use more memory because > pstate->p_sourcetext does not get free'd, but we must now allocate > space for a new "cleaned" query. I'm not sure that I understand. Sure we allocate a new temporary buffer for the cleaned up query string but this will be freed soon. The query string stored in the prepared statement will stay in memory as long as the prepare statement exist so this any cleanup can actually save a lot of memory. I'm attaching a v2 that fixes the crash you describe earlier, and with some minimal regression test in pg_stat_statements to cover it. > As far as execution overhead, this will only be paid when you query > pg_prepared_statement, and we can even optimize this a bit by only > cleaning once and reusing the results. Actually I thought that there was a requirement to have a zero terminated string in the API to transform C strings to text datums, but there is not. It's actually slightly faster if you know the size as there is on strlen() less call in that case. This could be done in both case though. > > FWIW I think that this should belong to pg_stat_statements testing, no the main > > regression tests. This would also ensure that we see consistent results in > > some other scenarios. > > I agree. As mentioned basic tests added in v2.
Вложения
> ISTM that your proposal will actually use more memory because
> pstate->p_sourcetext does not get free'd, but we must now allocate
> space for a new "cleaned" query.
I'm not sure that I understand. Sure we allocate a new temporary buffer for
the cleaned up query string but this will be freed soon. The query string
stored in the prepared statement will stay in memory as long as the prepare
statement exist so this any cleanup can actually save a lot of memory.
My point is pstate->p_sourcetext doesn't get freed just because we're not
referencing it from CachedPlanSource in prepared_queries. Instead, with
multi-statement strings, prepared_queries now use a newly allocated string,
new_query, which will be around until DEALLOCATE.
```
+ tmp = palloc(rawstmt->stmt_len + 1);
+ strlcpy(tmp, cleaned, rawstmt->stmt_len + 1);
+
+ new_query = tmp;
+ strlcpy(tmp, cleaned, rawstmt->stmt_len + 1);
+
+ new_query = tmp;
```
So this patch always increases memory usage for multi-statement strings;
we have both the original multi-statement string and the cleaned up copy
around.
So this patch always increases memory usage for multi-statement strings;
we have both the original multi-statement string and the cleaned up copy
around.
--
Sami Imseih
Amazon Web Services (AWS)
Hi, On Sun, Jan 18, 2026 at 10:16:16AM -0600, Sami Imseih wrote: > > > > > ISTM that your proposal will actually use more memory because > > > pstate->p_sourcetext does not get free'd, but we must now allocate > > > space for a new "cleaned" query. > > > > I'm not sure that I understand. Sure we allocate a new temporary buffer > > for > > the cleaned up query string but this will be freed soon. The query string > > stored in the prepared statement will stay in memory as long as the prepare > > statement exist so this any cleanup can actually save a lot of memory. > > > > My point is pstate->p_sourcetext doesn't get freed just because we're not > referencing it from CachedPlanSource in prepared_queries. Instead, with > multi-statement strings, prepared_queries now use a newly allocated string, > new_query, which will be around until DEALLOCATE. > > ``` > + tmp = palloc(rawstmt->stmt_len + 1); > + strlcpy(tmp, cleaned, rawstmt->stmt_len + 1); > + > + new_query = tmp; > ``` > > So this patch always increases memory usage for multi-statement strings; > we have both the original multi-statement string and the cleaned up copy > around. I think that you have a misunderstanding of how memory context works in postgres. Both pstate->p_sourcetext and new_query will be freed when their underlying memory context will be reset whether or not they're individually freed, so they won't survive until the next EXECUTE or DEALLOCATE. It's actually usually worse to issue individual pfree() as it's just unneeded overhead. CreateCachedPlan() creates a dedicated memory context to save all its information, including a new copy of the passed query string, and that context will eventually be linked to a permanent one so that it can survive the original query execution lifetime. So with this patch we have an opportunity to allocate less data in the permanent memory context, which as I said requires a transient extra allocation of the current portion of the current query string.
> I think that you have a misunderstanding of how memory context works in
> postgres.
No, I understand how memory contexts work, but I skipped over the fact that the
cleaned up query text is copied into CachedPlanSource memory context during
CreateCachedPlan:
```
plansource->query_string = pstrdup(query_string);
```
Which is long-lived and will remain until DEALLOCATE.
while pstate->p_sourcetext and new_query are in the current execution's
memory context.
The patch is indeed better from a memory perspective. My apologies.
However, the error reporting does break with the patch. Notice with the patch
the cursor for the error reporting shifts incorrectly. This is due to the fact
rawstmt->stmt_location/length are no longer representative of the original
qurey text.
## unpatched
```
postgres=# SELECT 1 \; PREPARE stmt AS SELECT nonexistent_column
FROM users\; SELECT 2;
?column?
----------
1
(1 row)
ERROR: relation "users" does not exist
LINE 1: ... ; PREPARE stmt AS SELECT nonexistent_column FROM users; SEL...
^
```
## patched
```
postgres=# SELECT 1 \; PREPARE stmt AS SELECT nonexistent_column
FROM users\; SELECT 2;
?column?
----------
1
(1 row)
ERROR: relation "users" does not exist
LINE 1: ...LECT 1 ; PREPARE stmt AS SELECT nonexistent_column FROM u...
^
```
--
Sami Imseih
Amazon Web Services (AWS)
Hi, On Mon, Jan 19, 2026 at 12:43:53AM -0600, Sami Imseih wrote: > > However, the error reporting does break with the patch. Notice with the patch > the cursor for the error reporting shifts incorrectly. This is due to the fact > rawstmt->stmt_location/length are no longer representative of the original > qurey text. > > ## unpatched > ``` > postgres=# SELECT 1 \; PREPARE stmt AS SELECT nonexistent_column > FROM users\; SELECT 2; > ?column? > ---------- > 1 > (1 row) > > ERROR: relation "users" does not exist > LINE 1: ... ; PREPARE stmt AS SELECT nonexistent_column FROM users; SEL... > > ^ > ``` > > ## patched > ``` > postgres=# SELECT 1 \; PREPARE stmt AS SELECT nonexistent_column > FROM users\; SELECT 2; > ?column? > ---------- > 1 > (1 row) > > ERROR: relation "users" does not exist > LINE 1: ...LECT 1 ; PREPARE stmt AS SELECT nonexistent_column FROM u... > > ^ > ``` This was already reported by Tom Lane on his first message, although his complaint was about execution time error reporting while this is during parse-analysis. However, I think that the exact same approach can be used to fixed both, either updating the position of every single element (which no one wants) or teaching the executor (and evidently the planstate) about a new "query offset" so that parser_errposition and executor_errposition report the correct location. I'm still waiting on whether the latter would be acceptable or not before implementing it. Note that I wasn't able to hit the execution time error so at least with parse-analysis time error I could at least have some regression tests, so thanks a lot! In the meantime, the cfbot shows that a rebase is needed, so v3 attached.
Вложения
> This was already reported by Tom Lane on his first message, although his
> complaint was about execution time error reporting while this is during
> parse-analysis. However, I think that the exact same approach can be used to
> fixed both, either updating the position of every single element (which no one
> wants) or teaching the executor (and evidently the planstate) about a new
> "query offset" so that parser_errposition and executor_errposition report the
> correct location.
Maybe this is what you also mean, but I think it should be enough to
just tell Query
about the updated location = 0, and we don't need to touch anything else
( the raw parestree or the copy of it in plansource ). This will
ensure that the error
tracking is relying on the original raw parsetree, and consumers of the updated
statement query location and lengths are getting the updated values based on
the cleaned up query text that is stored in prepared statements.
I tested the below with the tests you have in v3 so far and it works.
I also tested
the error position and it looks correct.
Not sure if I am missing something.
```
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -27,6 +27,7 @@
#include "commands/prepare.h"
#include "funcapi.h"
#include "nodes/nodeFuncs.h"
+#include "nodes/queryjumble.h"
#include "parser/parse_coerce.h"
#include "parser/parse_collate.h"
#include "parser/parse_expr.h"
@@ -64,6 +65,8 @@ PrepareQuery(ParseState *pstate, PrepareStmt *stmt,
Oid *argtypes = NULL;
int nargs;
List *query_list;
+ const char *new_query;
+ ListCell *lc;
/*
* Disallow empty-string statement name (conflicts with protocol-level
@@ -82,12 +85,31 @@ PrepareQuery(ParseState *pstate, PrepareStmt *stmt,
rawstmt->stmt = stmt->query;
rawstmt->stmt_location = stmt_location;
rawstmt->stmt_len = stmt_len;
+ new_query = pstate->p_sourcetext;
+
+ /*
+ * If we have a multi-statement string, confine the query to
the query we
+ * care about.
+ */
+ if (stmt_location >= 0)
+ {
+ const char *cleaned;
+ char *tmp;
+
+ cleaned = CleanQuerytext(pstate->p_sourcetext,
&rawstmt->stmt_location,
+
&rawstmt->stmt_len);
+
+ tmp = palloc(rawstmt->stmt_len + 1);
+ strlcpy(tmp, cleaned, rawstmt->stmt_len + 1);
+
+ new_query = tmp;
+ }
/*
* Create the CachedPlanSource before we do parse analysis,
since it needs
* to see the unmodified raw parse tree.
*/
- plansource = CreateCachedPlan(rawstmt, pstate->p_sourcetext,
+ plansource = CreateCachedPlan(rawstmt, new_query,
CreateCommandTag(stmt->query));
/* Transform list of TypeNames to array of type OIDs */
@@ -119,6 +141,17 @@ PrepareQuery(ParseState *pstate, PrepareStmt *stmt,
query_list = pg_analyze_and_rewrite_varparams(rawstmt,
pstate->p_sourcetext,
&argtypes, &nargs, NULL);
+ /*
+ * Set the stmt location to 0 and update stmt_len, since at
this point we
+ * only have the query we care about in case of a
multi-statement string.
+ */
+ foreach(lc, query_list)
+ {
+ Query *query = lfirst_node(Query, lc);
+
+ if (query->stmt_location > 0)
+ query->stmt_location = 0;
+ }
```
--
Sami Imseih
Amazon Web Services (AWS)
Hi, On Mon, Jan 26, 2026 at 04:48:59PM -0600, Sami Imseih wrote: > > This was already reported by Tom Lane on his first message, although his > > complaint was about execution time error reporting while this is during > > parse-analysis. However, I think that the exact same approach can be used to > > fixed both, either updating the position of every single element (which no one > > wants) or teaching the executor (and evidently the planstate) about a new > > "query offset" so that parser_errposition and executor_errposition report the > > correct location. > > Maybe this is what you also mean, but I think it should be enough to > just tell Query > about the updated location = 0, and we don't need to touch anything else > ( the raw parestree or the copy of it in plansource ). This will > ensure that the error > tracking is relying on the original raw parsetree, and consumers of the updated > statement query location and lengths are getting the updated values based on > the cleaned up query text that is stored in prepared statements. > > I tested the below with the tests you have in v3 so far and it works. > I also tested > the error position and it looks correct. > > Not sure if I am missing something. It's hard to know what changed exactly as you show the diff against master rather than against the patch, but IIUC the problem will be for execution time error location reporting, as reported before. I think that it will still be wrong, but may now also point to an offset that isn't in the query text. I still haven't been able to hit that code path, so I can't really investigate more. In general I feel like it would be cleaner to have the same handling for a possible "location vs query offset" approach for both pare-analysis and execution, but I'm not opposed to this approach if it gets the feature accepted.