Обсуждение: Add GoAway protocol message for graceful but fast server shutdown/switchover

Поиск
Список
Период
Сортировка

Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
This change introduces a new GoAway backend-to-frontend protocol
message (byte 'g') that the server can send to the client to politely
request that client to disconnect/reconnect when convenient. This message is
advisory only - the connection remains fully functional and clients may
continue executing queries and starting new transactions. "When
convenient" is obviously not very well defined, but the primary target
clients are clients that maintain a connection pool. Such clients should
disconnect/reconnect a connection in the pool when there's no user of
that connection. This is similar to how such clients often currently
remove a connection from the pool after the connection hits a maximum
lifetime of e.g. 1 hour.

This new message is used by Postgres during the already existing "smart"
shutdown procedure (i.e. when postmaster receives SIGTERM). When
Postgres is in "smart" shutdown mode existing clients can continue to
run queries as usual but new connection attempts are rejected. This mode
is primarily useful when triggering a switchover of a read replica. A
load balancer can route new connections only to the new read replica,
while the old load balancer keeps serving the existing connections until
they disconnect. The problem is that this draining of connections could
often take a long time. Even when clients only run very short
queries/transactions because the session can be kept open much longer
(many connection pools use 1 hour max lifetime of a connection by default).
With the introduction of the GoAway message Postgres now sends this
message to all connected clients when it enters smart shutdown mode.
If these clients respond to the message by reconnecting/disconnecting
earlier than their maximum connection lifetime the draining can complete
much quicker. Similar benefits to switchover duration can be achieved
for other applications or proxies implementing the Postgres protocol,
like when switching over a cluster of PgBouncer machines to a newer
version.

Applications/clients that use libpq can periodically check the result of
the new PQgoAwayReceived() function to find out whether they have been
asked to reconnect.

Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Kirill Reshke
Дата:
On Thu, 23 Oct 2025 at 18:05, Jelte Fennema-Nio <me@jeltef.nl> wrote:
>
> This change introduces a new GoAway backend-to-frontend protocol
> message (byte 'g') that the server can send to the client to politely
> request that client to disconnect/reconnect when convenient. This message is
> advisory only - the connection remains fully functional and clients may
> continue executing queries and starting new transactions. "When
> convenient" is obviously not very well defined, but the primary target
> clients are clients that maintain a connection pool. Such clients should
> disconnect/reconnect a connection in the pool when there's no user of
> that connection. This is similar to how such clients often currently
> remove a connection from the pool after the connection hits a maximum
> lifetime of e.g. 1 hour.
>
> This new message is used by Postgres during the already existing "smart"
> shutdown procedure (i.e. when postmaster receives SIGTERM). When
> Postgres is in "smart" shutdown mode existing clients can continue to
> run queries as usual but new connection attempts are rejected. This mode
> is primarily useful when triggering a switchover of a read replica. A
> load balancer can route new connections only to the new read replica,
> while the old load balancer keeps serving the existing connections until
> they disconnect. The problem is that this draining of connections could
> often take a long time. Even when clients only run very short
> queries/transactions because the session can be kept open much longer
> (many connection pools use 1 hour max lifetime of a connection by default).
> With the introduction of the GoAway message Postgres now sends this
> message to all connected clients when it enters smart shutdown mode.
> If these clients respond to the message by reconnecting/disconnecting
> earlier than their maximum connection lifetime the draining can complete
> much quicker. Similar benefits to switchover duration can be achieved
> for other applications or proxies implementing the Postgres protocol,
> like when switching over a cluster of PgBouncer machines to a newer
> version.
>
> Applications/clients that use libpq can periodically check the result of
> the new PQgoAwayReceived() function to find out whether they have been
> asked to reconnect.

Hi!
Im +1 on this idea. This is something I wanted back in 2020, when
implementing the 'online restart' feature for odyssey[0], but never
bothered to create a thread.
Due to its asyn engine complexity, odyssey cannot simply reuse tcp
connections from 'old' binary, so we accept new connections in new
binary and try to drop connections in old binary with some rate.

About patches:

in 0001:

>+
>+ if (strcmp(value, "latest") == 0)
>+ {
>+ *result = PG_PROTOCOL_LATEST;
>+ return true;
>+ }

Not needed? we already have this check at the beginning of
pqParseProtocolVersion

In 0002:

> +       The <literal>GoAway</literal> message is sent by the server during a
> +       smart shutdown to politely request that clients disconnect.

I'm not sure this wording is super-foolproof. First of all, is it
'client', not 'clients'? Looks like we should describe single client
to single server interaction in this doc.
Maybe also change the last sentence to ' ... to instruct clients to
disconnect.' ? Maybe this wording is not great also, but I want to
reflect in doc that disconnection is
strongly advised, yet not obligatory

> +       Applications should check this flag
> +       periodically and disconnect gracefully when possible, such as after
> +       completing the current transaction or unit of work.

What flag? Also, 'Applications should' - no, they shouldn't, is it
just an option? Maybe we should change wording to something like

'Applications can decide that it is recommendatory to close (or maybe
re-open) their connection with the server as soon as they get at least
one 'GoAway' msg.'

Also, can the server send more than one 'GoAway' msg? If yes, should
we document this?


> - * notice.  (An ERROR is very possibly the backend telling us why
> + * notice. (An ERROR is very possibly the backend telling us why

This change is unrelated

Other coding changes looks straightforward and are fine to me.


[0] https://github.com/yandex/odyssey

-- 
Best regards,
Kirill Reshke



Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Fri Oct 24, 2025 at 7:04 AM CEST, Kirill Reshke wrote:
> On Thu, 23 Oct 2025 at 18:05, Jelte Fennema-Nio <me@jeltef.nl> wrote:
> Im +1 on this idea. This is something I wanted back in 2020, when
> implementing the 'online restart' feature for odyssey[0], but never
> bothered to create a thread.

Yeah, to be clear: A big goal of this is definitely to be used by
poolers/proxies/middleware. Those systems will often be more frequently
restarted than the actual database servers, so being able to do that
quickly without disrupting active connections is much more important
there than with plain PostgreSQL servers.

> About patches:

Thanks for the review. Attached is a new patchset. I think I addressed
all of your comments (I almost fully rewrote the docs). I also fixed
two other issues that I found:
- updating docs for 3.3 in more places
- handling the GoAway message in more code paths on the client side


Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Ajit Awekar
Дата:
Hi Jelte,

Thank you for proposing the GoAway protocol message. 

I've developed a patch that serves as a strong, immediate use case for its inclusion. https://www.postgresql.org/message-id/flat/CAER375OvH3_ONmc-SgUFpA6gv_d6eNj2KdZktzo-f_uqNwwWNw%40mail.gmail.com


Thanks & Best Regards,

Ajit


On Fri, 24 Oct 2025 at 17:24, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Fri Oct 24, 2025 at 7:04 AM CEST, Kirill Reshke wrote:
> On Thu, 23 Oct 2025 at 18:05, Jelte Fennema-Nio <me@jeltef.nl> wrote:
> Im +1 on this idea. This is something I wanted back in 2020, when
> implementing the 'online restart' feature for odyssey[0], but never
> bothered to create a thread.

Yeah, to be clear: A big goal of this is definitely to be used by
poolers/proxies/middleware. Those systems will often be more frequently
restarted than the actual database servers, so being able to do that
quickly without disrupting active connections is much more important
there than with plain PostgreSQL servers.

> About patches:

Thanks for the review. Attached is a new patchset. I think I addressed
all of your comments (I almost fully rewrote the docs). I also fixed
two other issues that I found:
- updating docs for 3.3 in more places
- handling the GoAway message in more code paths on the client side

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Thu Oct 23, 2025 at 3:04 PM CEST, Jelte Fennema-Nio wrote:
> This change introduces a new GoAway backend-to-frontend protocol
> message (byte 'g') that the server can send to the client to politely
> request that client to disconnect/reconnect when convenient.

After pushback on another threadabout introducing additional minor
protocol versions[1], I've decided to change this patch to use a
protocol extension instead of a minor version bump.

I personally don't think this patch is any better now, but that's fine.
If this means it has a chance of going into PG19, that's totally worth
it to me. (also I'd like to stop spending time on discussions where
clearly neither side will agree with eachother).

The automated test requires the not yet committed pytest changes[2]. I
don't think the automated test is required for a merge, so I don't think
this is blocked on pytest support getting in. It's here mainly as an
example and as a regression test during development, to know I did not
break the goaway functionality while changing the implementation.

[1]: https://www.postgresql.org/message-id/flat/CADK3HHKe1PA1U6aB5-7tWBQ0yZGgNvY7H=ECDD9955Pas_zx_Q@mail.gmail.com
[2]: https://commitfest.postgresql.org/patch/6045/

Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Thu Jan 8, 2026 at 10:15 AM CET, Jelte Fennema-Nio wrote:
> After pushback on another threadabout introducing additional minor
> protocol versions[1], I've decided to change this patch to use a
> protocol extension instead of a minor version bump.

Turns out there were still some leftovers from using a version bump in
the libpq_pipeline tests. Removed those too now.

Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Zsolt Parragi
Дата:
Hello!

I only have a few stylistic comments and one question about the wal
sender part - but maybe I don't understand something there.

+ /*
+ * Only signal regular backends and walsenders. Skip
+ * auxiliary processes and dead-end backends.
+ */
+ if (bp->bkend_type == B_BACKEND ||
+ bp->bkend_type == B_WAL_SENDER)
+ {
+ SendProcSignal(bp->pid, PROCSIG_SMART_SHUTDOWN,
+    INVALID_PROC_NUMBER);

I don't see related changes in walsenders, am I missing something,
shouldn't this have some handling in WalSndLoop? Also, shouldn't
walsenders exit later, after normal backends have already stopped? So
I'm not sure how this is supposed to improve them.

+ /*
+ * Parse any available data to see if a GoAway message has arrived.
+ */
+ pqParseInput3(conn);

This is just stylistic, but other places seem to call parseInput instead.

There are also two typos/mistakes in the documentation:

* "rquest"
* "server requests the server"



Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Tue, 27 Jan 2026 at 13:57, Zsolt Parragi <zsolt.parragi@percona.com> wrote:
> I only have a few stylistic comments and one question about the wal
> sender part - but maybe I don't understand something there.

Thanks for the review. I updated the stylistic things. And I agree with
you on the walsender note, so I stopped sending the signal to those
backends. I also rebased to use Jacob his new table and slightly tweaked
the docs a bit in a few places.

Attached is v5

Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Sun Feb 8, 2026 at 11:03 PM CET, Jelte Fennema-Nio wrote:
> Attached is v5

Attached is v6 which fixes a rebase conflict.


Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Zsolt Parragi
Дата:
+ /*
+ * Only signal regular backends, since those need to notify
+ * their clients using a GoAway message.
+ */
+ if (bp->bkend_type == B_BACKEND)

This condition is slightly different to how SignalChildren works, is
that intentional? I don't think it causes any practical difference,
and I don't see an easy way to reuse SignalChildren for this, but
maybe it could still follow the same pattern.

Otherwise I don't see any other issues, and this also doesn't seem to
be an important comment.

Since the pytest framework seems unlikely to be included in PG19, have
you considered a different test implementation, to have at least some
minimal coverage?



Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Wed Feb 25, 2026 at 4:08 PM CET, Zsolt Parragi wrote:
> + /*
> + * Only signal regular backends, since those need to notify
> + * their clients using a GoAway message.
> + */
> + if (bp->bkend_type == B_BACKEND)
>
> This condition is slightly different to how SignalChildren works, is
> that intentional? I don't think it causes any practical difference,
> and I don't see an easy way to reuse SignalChildren for this, but
> maybe it could still follow the same pattern.

Changed it to be consistent now, and resolved a rebase conflict.

> Since the pytest framework seems unlikely to be included in PG19, have
> you considered a different test implementation, to have at least some
> minimal coverage?

I now included some basic support for GoAway in psql and added a perl
test based on that.

Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Tomas Vondra
Дата:
Hi,

I've looked at this patch today, to check if there's something we could
get done in PG19.

I find it a bit we didn't get much feedback from people working on the
client/downstream stuff - clients, connection poolers/middleware, that
sort of stuff. OK, we did hear from Kiril, and he seems to like it.

I'm not very involved in the protocol stuff, so I'm sure there's a lot
details I'm missing. It'd be very helpful if there was some sort of PoC
support on the pooler/client side, so that I can experiment with it and
see how helpful the new protocol message is. But I realize that's a bit
too much to ask for.

A couple thoughts about this (some of this may be missing what the patch
aims to do).

* Does it make sense to tie this to smart shutdowns? I realize it's just
an example, and it probably makes sense to send the GoAway message
before a shutdown. But isn't this a bit similar to cancel/terminate of a
backend? Why not to have a pg_goaway_backend() function, that'd send the
message to a single backend? It might be useful for load-balancing, if
we could pick a "heavy" backend and ask it to reconnect / move to a
different replica. (Could that be handled by a middleware?)

* In fact, does it improve the smart shutdown case in practice? Let's
say we have a single instance, and we're restarting it. It'll send
GoAway to all the clients, the good clients will try to reconnect. But
if there's even a single "bad" client ignoring the GoAway, all the
well-behaved clients will get stuck. Ofc, that can happen without the
GoAway message too - a client may disconnect because of timeout etc. But
it makes it more likely, and it'll affect the well-behaved clients.

* Would it make sense to have some payload in the GoAway message? I'm
thinking about (a) some deadline by which the client should disconnect,
e.g. time of planned restart / shutdown, (b) priority, expressing how
much the client should try to disconnect (and maybe take more drastic
actions).

Also, two minor comments:

* The sgml docs say the function is defined as

  int PQgoAwayReceived(const PGconn *conn);

but in the .h file it's defined without the "const".

* The new entry in protocol.sgml (in the "Supported Protocol Extensions"
 table) says

  <entry><literal>goaway</literal></entry>

but the following table includes "_pq_" in the entry name. Should the
new entry do the same?


regards

-- 
Tomas Vondra




Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Jacob Champion
Дата:
On Fri, Mar 20, 2026 at 12:20 PM Tomas Vondra <tomas@vondra.me> wrote:
> * In fact, does it improve the smart shutdown case in practice? Let's
> say we have a single instance, and we're restarting it. It'll send
> GoAway to all the clients, the good clients will try to reconnect. But
> if there's even a single "bad" client ignoring the GoAway, all the
> well-behaved clients will get stuck. Ofc, that can happen without the
> GoAway message too - a client may disconnect because of timeout etc. But
> it makes it more likely, and it'll affect the well-behaved clients.
>
> * Would it make sense to have some payload in the GoAway message? I'm
> thinking about (a) some deadline by which the client should disconnect,
> e.g. time of planned restart / shutdown, (b) priority, expressing how
> much the client should try to disconnect (and maybe take more drastic
> actions).

I'd been wondering about these as well, but in the context of the
tangential thread [1]. HTTP has much stronger semantics for its GOAWAY
frames, for instance.

--Jacob

[1] https://postgr.es/m/CAOYmi%2BmSn8xQ7ExqY07V6G2oFXN2nY%2B7f4yf_RV2%3D7xNCKwW-Q%40mail.gmail.com



On Fri, Mar 20, 2026 at 2:20 PM Tomas Vondra <tomas@vondra.me> wrote:
* Does it make sense to tie this to smart shutdowns? I realize it's just
an example, and it probably makes sense to send the GoAway message
before a shutdown. But isn't this a bit similar to cancel/terminate of a
backend? Why not to have a pg_goaway_backend() function, that'd send the
message to a single backend? It might be useful for load-balancing, if
we could pick a "heavy" backend and ask it to reconnect / move to a
different replica. (Could that be handled by a middleware?)

+1. Another scenario that comes to mind is asking for a reconnect based on backend memory consumption, since there's a bunch of internal structures (relcache, etc) that can grow in an unbounded fashion. 

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Fri, 20 Mar 2026 at 20:20, Tomas Vondra <tomas@vondra.me> wrote:
> It'd be very helpful if there was some sort of PoC
> support on the pooler/client side, so that I can experiment with it and
> see how helpful the new protocol message is. But I realize that's a bit
> too much to ask for.

I'll see if I can whip something up, it shouldn't be too hard.

> Why not to have a pg_goaway_backend() function, that'd send the
> message to a single backend?

I like this idea a lot. So I added it in the attached v8 patch. This
also allowed we me to add low level tests using the libpq_pipeline
testsuite.

> * In fact, does it improve the smart shutdown case in practice? Let's
> say we have a single instance, and we're restarting it. It'll send
> GoAway to all the clients, the good clients will try to reconnect. But
> if there's even a single "bad" client ignoring the GoAway, all the
> well-behaved clients will get stuck. Ofc, that can happen without the
> GoAway message too - a client may disconnect because of timeout etc. But
> it makes it more likely, and it'll affect the well-behaved clients.

For primary server restarts, I don't think anyone should be using smart
shutdown right now either. Any new connections to the database will be
failing for an indeterminate amount of time. I agree that sending GoAway
might worsen the problem in some cases, but it's already terrible to
start with. Fast shutdown is the only sensible restart mode for a
primary server. This seems to be generally accepted knowledge, given
that we use SIGINT (fast shutdown) in our systemd example[1].

Sending a GoAway on smart shutdown makes that shutdown mode very useful
for read replicas during a planned switch-over to another replica. Now
clients can finish their work and quickly reconnect to the new read
replica, minimizing switchover time while preventing errors.

Even when restarting primary servers, triggering a smart shutdown has a
benefits, as long as it's followed by a fast shutdown after a short
delay (e.g., 1 second). This causes slightly longer downtime (the
additional delay), but it allows most clients to disconnect on their own
terms instead of in the middle of a query. Connection errors can often
be retried transparently more easily than errors in the middle of a
query. In effect, for many applications, this could mean a reduction in
errors and only an increase in latency during a restart.

> * Would it make sense to have some payload in the GoAway message? I'm
> thinking about (a) some deadline by which the client should disconnect,
> e.g. time of planned restart / shutdown, (b) priority, expressing how
> much the client should try to disconnect (and maybe take more drastic
> actions).

I thought some more about this, but ultimately, the payloads you suggest
only seem useful if a client has something inbetween "disconnect hard
now" and "disconnect when the connection is unused". I cannot think of
any such cases. i.e. what other "drastic actions" could a client take
instead of simply closing the connection. If that's the only
possibility, why not simply have the server close the connection in that
case.

Overall, I agree that having no payload in this new message feels a bit
weird. But ultimately, clients don't need any payload to do something
useful.

> Also, two minor comments:

Fixed.

Вложения

Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
"Jelte Fennema-Nio"
Дата:
On Fri, 20 Mar 2026 at 21:11, Jacob Champion <jacob.champion@enterprisedb.com> wrote:
> I'd been wondering about these as well, but in the context of the
> tangential thread [1]. HTTP has much stronger semantics for its GOAWAY
> frames, for instance.

I reread the HTTP/3 GOAWAY spec[1], but I think our protocol is too
different from HTTP/3 to take any lessons from it (at the moment at
least). HTTP/3 "streams" are independent, we have no such concept. Our
whole session is a single stream, due to all of our session state. So
the semantics that on a single connection a client cannot open newer
streams does not really mean any useful for us, i.e. there's no way to
open a new stream. Even the "which messages have definitely not been
processed feature" can already be inferred from the server right now, by
tracking what responses have been received before the server closes the
connection. So I cannot think of any useful payload to add to the GoAway
message.

[1]: https://www.rfc-editor.org/rfc/rfc9114.html#connection-shutdown



Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Zsolt Parragi
Дата:
+ if (!goaway_reported && PQgoAwayReceived(pset.db))
+ {
+ pg_log_info("Server sent GoAway, requesting disconnect when convenient.");
+ goaway_reported = true;
+ }

Shouldn't this variable be reset in pqDropServerData?

+    It is possible for NoticeResponse, ParameterStatus and GoAway
messages to be
     interspersed between CopyData messages; frontends must handle these cases,

The patch currently doesn't actually do this, did you add this as
future proofing?


> I thought some more about this, but ultimately, the payloads you suggest
> only seem useful if a client has something inbetween "disconnect hard
> now" and "disconnect when the connection is unused"

Couldn't a client optimize the reconnect time if it knows about the
deadline? If it knows that it still has 10 minutes before the server
kicks it out, it might choose to finish a 3-4 minute task, reconnect,
and then continue.



Re: Add GoAway protocol message for graceful but fast server shutdown/switchover

От
Tatsuo Ishii
Дата:
Reading the proposal, I have some questions. Sorry if they have been
already discussed.

1. If clients do not disconnect a session even if they have received
   the GoAway message, PostgreSQL server will give up the shutdown
   sequence. In this case, shouldn't the PostgreSQL server send a
   message indicating "I have given up the smart shutdown request"?
   Otherwise, the fact that GoAway has been received will remain in
   the client, and if the client does not check the receiving timely,
   the client may exit the session unnecessarily.

2. Can we use a NOTICE message instead of the new protocol GoAway for
   the purpose?  Frontends are expected to receive and handle NOTICE
   messages while processing frontend/backend protocol. So I think it
   is fair to expect clients to find the NOTICE message and behave as
   we discussed. This way, we don't need the GoAway message.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp