Обсуждение: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
"Jelte Fennema-Nio"
Дата:
A bunch of frontend tools, including psql, still used PQcancel to send cancel requests to the server. That function is insecure, because it does not use encryption to send the cancel request. This starts using the new cancellation APIs (introduced in 61461a300) for all these frontend tools. These APIs use the same encryption settings as the connection that's being cancelled. Since these APIs are not signal-safe this required a refactor to not send the cancel request in a signal handler anymore, but instead using a dedicated thread. Similar logic was already used for Windows anyway, so this also has the benefit that it makes the cancel logic more uniform across our supported platforms. Because this is fixing a security issue, it would be nice if we could backport it. I'm not sure that's realistic though, given the size/complexity of the change. I'm curious what others think about that. To be clear, I'm only really talking about backporting to PG17 and PG18 because those contain the new cancellation APIs in libpq. Backporting to even older versions would also require backporting the new cancellation APIs in libpq, which seems like a no-go. A possible follow up improvement to pg_dump would be to use threads for its parallel workers on UNIX as well. Then the Windows and Unix implementations would get even more aligned. I started looking into that, but that quickly became quite a big refactor, touching a lot of code unrelated to cancellation. So it seems better to do that in a separate follow-on patch if people are interested.
Вложения
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
"Jelte Fennema-Nio"
Дата:
On Sun Dec 14, 2025 at 3:40 PM CET, Jelte Fennema-Nio wrote: > A bunch of frontend tools, including psql, still used PQcancel to send > cancel requests to the server. That function is insecure, because it > does not use encryption to send the cancel request. This starts using > the new cancellation APIs (introduced in 61461a300) for all these > frontend tools. Fixed conflict after Copyright update.
Вложения
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
"Jelte Fennema-Nio"
Дата:
On Sun Dec 14, 2025 at 3:40 PM CET, Jelte Fennema-Nio wrote: > A bunch of frontend tools, including psql, still used PQcancel to send > cancel requests to the server. That function is insecure, because it > does not use encryption to send the cancel request. This starts using > the new cancellation APIs (introduced in 61461a300) for all these > frontend tools. Small update. Split up the fe_utils and pg_dump changes into separate commits, to make patches easier to review. Also use non-blocking writes to the self-pipe from the signal handler to avoid potential deadlocks (extremely unlikely for such blocks to occur, but better safe than sorry).
Вложения
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
Heikki Linnakangas
Дата:
On 08/02/2026 21:05, Jelte Fennema-Nio wrote: > On Sun Dec 14, 2025 at 3:40 PM CET, Jelte Fennema-Nio wrote: >> A bunch of frontend tools, including psql, still used PQcancel to send >> cancel requests to the server. That function is insecure, because it >> does not use encryption to send the cancel request. This starts using >> the new cancellation APIs (introduced in 61461a300) for all these >> frontend tools. > > Small update. Split up the fe_utils and pg_dump changes into separate > commits, to make patches easier to review. Also use non-blocking writes > to the self-pipe from the signal handler to avoid potential deadlocks > (extremely unlikely for such blocks to occur, but better safe than sorry). Had a brief look at this: It took me a while to get the big picture of how this works. cancel.c could use some high-level comments explaining how to use the facility; it's a real mixed bag right now. The SIGINT handler now does three things: - Set CancelRequested global variable, - call callback if set, and - wake up the cancel thread to send the cancel message, if cancel connection is set. There's no high-level overview documentation or comments on how those three mechanism work or interact. It took me a while to understand that they are really separate, alternative ways to handle SIGINT, all mashed into the same signal handler function. At first read, I thought they're somehow part of the one same mechanism. The cancelConn mechanism is a global variable, which means that it can only be used with one connection in the process. That's OK with the current callers, but seems short-sighted. What if we wanted to use it for pgbench, for example, which uses multiple threads and connections? Or if we changed pg_dump to use multiple threads, like you also suggested as a possible follow-up. The "self-pipe trick" usually refers to interrupting the main thread from select(); this uses it to wake up the other, separate cancellation thread. That's fine, but again it took me a while to understand that that's what it does. Comments! This is racy, if the cancellation thread doesn't immediately process the wakeup. For example, because it's still busy processing a previous wakeup, because there's a network hiccup or something. By the time the cancellation thread runs, the main thread might already be running a different query than it was when the user hit CTRL-C. - Heikki
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
"Jelte Fennema-Nio"
Дата:
On Thu Mar 5, 2026 at 7:30 PM CET, Heikki Linnakangas wrote: > It took me a while to get the big picture of how this works. cancel.c > could use some high-level comments explaining how to use the facility; > it's a real mixed bag right now. Attached is a version with a bunch more comments. I agree this cancel logic is hard to understand without them. It took me quite a while to understand it myself. (I don't think the code got any harder to understand with these changes though, the exact same complexity was already there for Windows. But I agree more commends are good.) > The cancelConn mechanism is a global variable, which means that it can > only be used with one connection in the process. That's OK with the > current callers, but seems short-sighted. What if we wanted to use it > for pgbench, for example, which uses multiple threads and connections? > Or if we changed pg_dump to use multiple threads, like you also > suggested as a possible follow-up. Allowing for multiple callers seems like scope-creep to me, and also hard to do in a generic way. I'd say IF we want that in a generic way I'd first want to see a version of that that solves the problem for pgdench/pg_dump, before generalizing it to cancel.c. > This is racy, if the cancellation thread doesn't immediately process the > wakeup. For example, because it's still busy processing a previous > wakeup, because there's a network hiccup or something. By the time the > cancellation thread runs, the main thread might already be running a > different query than it was when the user hit CTRL-C. I now noted this in one of the new comments. I don't think there's a way around this race condition entirely. It's simply a limitation of our cancel protocol (because it's impossible to specify which query on a connection should be cancelled). In theory we could reduce the window for the race, by having all frontend tools use async connections and have the main thread wait for either the self-pipe or a cancel. That way it would be more similar to the previous signal code in behaviour. That's a much bigger lift though, i.e. all PQexec and PQgetResult calls would need to be modified. My proposed change doesn't require changing the callsites at all. In interactive usage of psql it seems pretty unlikely that people will hit this race condition. In non-interactive use, it doesn't matter because you just want Ctrl-C to cancel the application (whichever query it currently runs). So I'd say it's currently not worth the complexity to do the less racy option.
Вложения
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
Heikki Linnakangas
Дата:
On 06/03/2026 04:12, Jelte Fennema-Nio wrote:
> On Thu Mar 5, 2026 at 7:30 PM CET, Heikki Linnakangas wrote:
>> It took me a while to get the big picture of how this works. cancel.c
>> could use some high-level comments explaining how to use the facility;
>> it's a real mixed bag right now.
>
> Attached is a version with a bunch more comments. I agree this cancel
> logic is hard to understand without them. It took me quite a while to
> understand it myself. (I don't think the code got any harder to
> understand with these changes though, the exact same complexity was
> already there for Windows. But I agree more commends are good.)
Thanks. I agree it was complicated before these patches.
>> This is racy, if the cancellation thread doesn't immediately process
>> the wakeup. For example, because it's still busy processing a previous
>> wakeup, because there's a network hiccup or something. By the time the
>> cancellation thread runs, the main thread might already be running a
>> different query than it was when the user hit CTRL-C.
>
> I now noted this in one of the new comments. I don't think there's a way
> around this race condition entirely. It's simply a limitation of our
> cancel protocol (because it's impossible to specify which query on a
> connection should be cancelled).
That's true, but I still wonder if this could make it much worse.
> In theory we could reduce the window for the race, by having all
> frontend tools use async connections and have the main thread wait for
> either the self-pipe or a cancel. That way it would be more similar to
> the previous signal code in behaviour. That's a much bigger lift though,
> i.e. all PQexec and PQgetResult calls would need to be modified. My
> proposed change doesn't require changing the callsites at all.
Yeah, it does have that advantage..
One simple thing we could is to remember the "generation" in the signal
handler, and store it in another global variable ("cancelledGeneration"
or such). In the cancel thread, check that the generation matches;
otherwise the thread is about to send a cancellation to a query that
already finished, and should not send it.
I worry how this behaves if establishing the cancel connection gets
stuck for a long time. Because of a network hiccup, for example. That's
also not a new problem though; it's perhaps even worse today, if the
signal handler gets stuck for a long time, trying to establish the
connection. Still, would be good to do some testing with a bad network.
- Heikki
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
Jelte Fennema-Nio
Дата:
On Fri, 6 Mar 2026 at 20:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > In theory we could reduce the window for the race, by having all
> > frontend tools use async connections and have the main thread wait for
> > either the self-pipe or a cancel. That way it would be more similar to
> > the previous signal code in behaviour. That's a much bigger lift though,
> > i.e. all PQexec and PQgetResult calls would need to be modified. My
> > proposed change doesn't require changing the callsites at all.
>
> Yeah, it does have that advantage..
I let Claude Code take a stab at a POC for doing the same thread
approach. So I could get a more accurate feeling of how big this lift
would be. It's a much bigger change, but the general design is
relatively straight forward. It's attached as the nocfbot patch (it's
not built on top of any of the other patches, it's a separate one).
> One simple thing we could is to remember the "generation" in the signal
> handler, and store it in another global variable ("cancelledGeneration"
> or such). In the cancel thread, check that the generation matches;
> otherwise the thread is about to send a cancellation to a query that
> already finished, and should not send it.
>In the cancel thread, check that the generation matches;
otherwise the thread is about to send a cancellation to a query that
already finished, and should not send it.
I took a look at this, and attached a fixup patch that does this. It
uses C11 atomics because locks cannot be taken in the signal handler
and the signal handler needs to read/write variables from/to two
different threads. I'm not sure if it pulls it's own weight though. It
seems a really unlikely scenario where the signal handler is fired
during one generation, but then the cancel thread wakes up during
another. I'm not sure if we should care about it. And I actually think
it could make us miss cancel a SIGINT for non-interactive use cases of
psql.
> I worry how this behaves if establishing the cancel connection gets
> stuck for a long time. Because of a network hiccup, for example. That's
> also not a new problem though; it's perhaps even worse today, if the
> signal handler gets stuck for a long time, trying to establish the
> connection. Still, would be good to do some testing with a bad network.
To make sure we're talking about the same situation, let me summarize
it differently: Establishing the connection for the cancel is slow,
but the actual query connection is still fast.
I think this is an interesting case to consider (much more interesting
than the kind of race the additional generation check could protect
against). First of all because I think it can definitely happen.
Especially with SSL the cancel needs several round trips, while a
query on an already established connection only needs one direction
latency. I can definitely see how this could cause out-of-order
arrival even on stable high-latency networks. Especilaly if there's
also some unfortunate packet drops.
And as you suggested the failure modes are different:
- With master, psql will become unresponsive until the client gets a
response from the server (or tcp timeouts are hit)
- With this patchset, a later query will get cancelled.
I think for interactive psql usage both are annoying, but both are not
the end of the world. I think I would personally prefer the current
master behaviour. I'm not sure preserving it is worth all the
additional code changes though to make all the applications use
non-blocking APIs. In any case SSL for cancel keys is definitely worth
the patchset behaviour to me (even though it sounds slightly worse).
For non-interactive use (i.e. running scripts in psql or other
frontend tools like vacuumdb). I don't think this situation applies.
You want whatever query to be cancelled.
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
"Jelte Fennema-Nio"
Дата:
On Sat Mar 7, 2026 at 1:01 AM CET, Jelte Fennema-Nio wrote: > I took a look at this, and attached a fixup patch that does this. Now with the actual attachements...
Вложения
- nocfbot.v0-0001-POC-Use-non-blocking-apis-for-frontend-tools.patch
- v5-0001-Move-Windows-pthread-compatibility-functions-to-s.patch
- v5-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patch
- v5-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patch
- v5-0004-fixup-Don-t-use-deprecated-and-insecure-PQcancel-.patch
- v5-0005-fixup-Don-t-use-deprecated-and-insecure-PQcancel-.patch
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
"Jelte Fennema-Nio"
Дата:
On Fri Mar 6, 2026 at 8:51 PM CET, Heikki Linnakangas wrote:
> I worry how this behaves if establishing the cancel connection gets
> stuck for a long time. Because of a network hiccup, for example. That's
> also not a new problem though; it's perhaps even worse today, if the
> signal handler gets stuck for a long time, trying to establish the
> connection. Still, would be good to do some testing with a bad network.
After thinking on this again, I thought of a much easier solution to
this problem than the direction I was exploring in my previous response
to this:
We can have SetCancelConn() and ResetCancelConn() wait for any pending
cancel to complete before letting them replace/remove the cancelConn.
That way even in case of a bad network, we know that an already
in-flight cancel request will never cancel a query from a next
SetCancelConn() call. It does mean that you cannot submit a new query
before we've received a response to the in-flight cancel request (either
because the hiccup is reselved or because TCP timeouts report a
failure). That's the current behaviour too with running PQcancel in the
signal handler, and I also think that's the behaviour that makes the
most sense.
Attached is a patchset that does this.
To ensure that it worked correctly, I mimicked network issues by running
the following iptables command after already having connected with psql:
sudo iptables -A INPUT -p tcp --dport 5432 -m state --state NEW -j DROP
That command drops all new incoming connections to the server, but
allows already established connections to continue working. Which means
that any new cancel connections will not be able to connect.
You can allow the traffic again with:
sudo iptables -D INPUT -p tcp --dport 5432 -m state --state NEW -j DROP
To see what happens when the connection attempt never goes through I
used:
sudo sysctl net.ipv4.tcp_syn_retries=2
sudo sysctl net.ipv4.tcp_retries2=3
This is what happens then:
localhost jelte@postgres:5432-1484255=# select pg_sleep(5);
^CSending cancel request
Time: 5005.833 ms (00:05.006)
Could not send cancel request: connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?
localhost jelte@postgres:5432-1484255=#
The query succeeds after 5 seconds, but the prompt does not become
interactive until a little while after that when the cancel request
error is also shown.
Вложения
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
Heikki Linnakangas
Дата:
On 15/03/2026 17:09, Jelte Fennema-Nio wrote: > On Fri Mar 6, 2026 at 8:51 PM CET, Heikki Linnakangas wrote: >> I worry how this behaves if establishing the cancel connection gets >> stuck for a long time. Because of a network hiccup, for example. >> That's also not a new problem though; it's perhaps even worse today, >> if the signal handler gets stuck for a long time, trying to establish >> the connection. Still, would be good to do some testing with a bad >> network. > > After thinking on this again, I thought of a much easier solution to > this problem than the direction I was exploring in my previous response > to this: We can have SetCancelConn() and ResetCancelConn() wait for any > pending > cancel to complete before letting them replace/remove the cancelConn. > > That way even in case of a bad network, we know that an already > in-flight cancel request will never cancel a query from a next > SetCancelConn() call. It does mean that you cannot submit a new query > before we've received a response to the in-flight cancel request (either > because the hiccup is reselved or because TCP timeouts report a > failure). That's the current behaviour too with running PQcancel in the > signal handler, and I also think that's the behaviour that makes the > most sense. +1. With a little extra effort, the cancellation can be made abortable too, so that you don't need to wait for the TCP timeout. I.e when ResetCancelConn() is called, the cancellation thread can immediately call PQcancelReset(). One a different topic, is there any guarantee on which thread will receive the SIGINT? It matters because psql's cancel callback sometimes calls longjmp(), which assumes that the signal handler is executed in the main thread. - Heikki
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
От
"Jelte Fennema-Nio"
Дата:
On Mon, 16 Mar 2026 at 10:57, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > +1. With a little extra effort, the cancellation can be made abortable > too, so that you don't need to wait for the TCP timeout. I.e when > ResetCancelConn() is called, the cancellation thread can immediately > call PQcancelReset(). I agree we could do that, but I don't think we should. Then we'd be getting into the exact situation where psql doesn't wait for an already in-flight cancel request to be processed by the server before sending the next query. i.e. while this would be fine if there's a network issue, it would be bad if the server is just slow to respond to the cancel request (e.g. because there's a PgBouncer in the middle that hasn't forwarded the request yet). > One a different topic, is there any guarantee on which thread will > receive the SIGINT? It matters because psql's cancel callback sometimes > calls longjmp(), which assumes that the signal handler is executed in > the main thread. Good point, I had thought about whether this mattered, but hadn't considered the callbacks. Attached is v7 that makes sure the signal is always handled by the main thread by blocking SIGINT before creating the cancel thread. I manually tested that this works by sending signals to specific threads using htop, and logging thread the id from the signal handler. Before this change the thread id would be from different threads, after this change it's always from the same one.