Обсуждение: Windows: Wrong error message at connection termination

Поиск
Список
Период
Сортировка

Windows: Wrong error message at connection termination

От
Lars Kanis
Дата:

Dear hackers,

I lately had a hard time to find the root cause for some wired behavior with the async API of libpq when running client and server on Windows. When the connection aborts with an error - most notably with an error at the connection setup - it sometimes fails with a wrong error message:

Instead of:

    connection to server at "::1", port 5433 failed: FATAL:  role "a" does not exist

it fails with:

    connection to server at "::1", port 5433 failed: server closed the connection unexpectedly

I found out, that the recv() function of the Winsock API has some wired behavior. If the connection receives a TCP RST flag, recv() immediately returns -1, regardless if all previous data has been retrieved. So when the connection is closed hard, the behavior is timing dependent on the client side. It may drop the last packet or it delivers it to libpq, if libpq calls recv() quick enough.

This behavior is described at closesocket() here:
https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket

This is called a hard or abortive close, because the socket's virtual circuit is reset immediately, and any unsent data is lost. On Windows, any recv call on the remote side of the circuit will fail with WSAECONNRESET.

Unfortunately each connection is closed hard by a Windows PostgreSQL server with TCP flag RST. That in turn is another Winsock API behavior, that is that every socket, that wasn't closed by the application is closed hard with the RST flag at process termination. I didn't find any official documentation about this behavior.

Explicit closing the socket before process termination leads to a graceful close even on Windows. That is done by the attached patch. I think delivering the correct error message to the user is much more important that closing the process in sync with the socket.


Some background: I'm the maintainer of ruby-pg, the PostgreSQL client library for ruby. The next version of ruby-pg will switch to the async API for connection setup. Using this API changes the timing of socket operations and therefore often leads to the above wrong message. Previous versions made use of the sync API, which usually doesn't suffer from this issue. The original issue is here: https://github.com/ged/ruby-pg/issues/404

--

Kind Regards
Lars Kanis


Вложения

Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Lars Kanis <lars@greiz-reinsdorf.de> writes:
> Explicit closing the socket before process termination leads to a 
> graceful close even on Windows. That is done by the attached patch. I 
> think delivering the correct error message to the user is much more 
> important that closing the process in sync with the socket.

Per the comment immediately above this, it's intentional that we don't
close the socket.  I'm not really convinced that this is an improvement.

Can we get anywhere by using shutdown(2) instead of close(), ie do a
half-close?  I have no idea what Windows thinks the semantics of that
are, but it might be worth trying.

            regards, tom lane



Re: Windows: Wrong error message at connection termination

От
Thomas Munro
Дата:
On Thu, Nov 18, 2021 at 10:13 AM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
> Unfortunately each connection is closed hard by a Windows PostgreSQL server with TCP flag RST. That in turn is
anotherWinsock API behavior, that is that every socket, that wasn't closed by the application is closed hard with the
RSTflag at process termination. I didn't find any official documentation about this behavior. 

Interesting discovery.  I think you might get the same behaviour from
a Unix system if you set SO_LINGER to 0 before you exit[1].  I suppose
if a TCP implementation is partially in user space (I have no idea if
this is true for Windows, I never use it, but I recall that Winsock
was at some point a DLL) and can't handle the existence of any socket
state after the process is gone, you might want to nuke everything and
tell the peer immediately that you're doing so on exit?

I realise now that the experiments we did a while back to try to
understand this across a few different operating systems[2] had missed
this subtlety, because that Python script had an explicit close()
call, whereas PostgreSQL exits.  It still revealed that the client
isn't allowed to read any data after its write failed, which is a
known source of error messages being eaten.  What I missed is that the
client doesn't just get an RST and enter this
no-you-can't-have-the-error-message-I-have-received state in response
to data sent by the client (the usual way you expect to get RST), like
in that test, but it also does so proactively when the server process
exits, as you've explained (in other words, it's not necessary for the
client to try to write to reach this error-eating state).

[1] https://stackoverflow.com/questions/3757289/when-is-tcp-option-so-linger-0-required
[2]
https://www.postgresql.org/message-id/flat/20190306030706.GA3967%40f01898859afd.ant.amazon.com#32f9f16f9be8da5ee5c3b405d6d1829c



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Interesting discovery.  I think you might get the same behaviour from
> a Unix system if you set SO_LINGER to 0 before you exit[1].  I suppose
> if a TCP implementation is partially in user space (I have no idea if
> this is true for Windows, I never use it, but I recall that Winsock
> was at some point a DLL) and can't handle the existence of any socket
> state after the process is gone, you might want to nuke everything and
> tell the peer immediately that you're doing so on exit?

It's definitely plausible that Windows does this because it can't
handle retransmits once the sender's state is gone.  However, it
seems to me that any such state would be tied to the open socket,
not to the sender process as such.  Which would suggest that an
early close() as Lars suggests would make things worse not better.
This is all just speculation unfortunately.  (Man, I hate dealing
with closed-source software.)

            regards, tom lane



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> I realise now that the experiments we did a while back to try to
> understand this across a few different operating systems[2] had missed
> this subtlety, because that Python script had an explicit close()
> call, whereas PostgreSQL exits.  It still revealed that the client
> isn't allowed to read any data after its write failed, which is a
> known source of error messages being eaten.

Yeah.  After re-reading that thread, I'm a bit confused about how
to square the results we got then with Lars' report.  The Windows
documentation he pointed to does claim that the default behavior if you
issue closesocket() is to do a "graceful close in the background", which
one would think means allowing sent data to be received.  That's not what
we saw.  It's possible that we would get different results if we re-tested
with a scenario where the client doesn't attempt to send data after the
server-side close; but I'm not sure how much it's worth to improve that
case if the other case still fails hard.  In any case, our previous
results definitely show that issuing an explicit close() is no panacea.

            regards, tom lane



Re: Windows: Wrong error message at connection termination

От
Lars Kanis
Дата:
Am 18.11.21 um 03:04 schrieb Tom Lane:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> I realise now that the experiments we did a while back to try to
>> understand this across a few different operating systems[2] had missed
>> this subtlety, because that Python script had an explicit close()
>> call, whereas PostgreSQL exits.  It still revealed that the client
>> isn't allowed to read any data after its write failed, which is a
>> known source of error messages being eaten.
> Yeah.  After re-reading that thread, I'm a bit confused about how
> to square the results we got then with Lars' report.  The Windows
> documentation he pointed to does claim that the default behavior if you
> issue closesocket() is to do a "graceful close in the background", which
> one would think means allowing sent data to be received.  That's not what
> we saw.  It's possible that we would get different results if we re-tested
> with a scenario where the client doesn't attempt to send data after the
> server-side close; but I'm not sure how much it's worth to improve that
> case if the other case still fails hard.

Form my experimentation the Winsock implementation has the two issues 
which I explained. First it drops all received but not yet retrieved 
data as soon as it receives a RST packet. And secondly it always sends a 
RST packet on every socket, that wasn't send-closed at process 
termination, regardless if there is any pending data.

Sending data to a socket, that was already closed from the other side is 
only one way to trigger a RST packet, but closing a socket with 
l_linger=0 is another way and process termination is the third. They all 
can lead to data loss on the receiver side, presumably because of the 
RST flag.

An alternative to closesocket() is shutdown(sock, SD_SEND). It doesn't 
free the socket resource, but leads to a graceful shutdown. However the 
FIN packet is send when the shutdown() or closesocket() function is 
called and that's still short before the process terminates. I did some 
more testing with different linger options, but it didn't change the 
behavior substantial. So I didn't find any way to close the socket with 
a FIN packet at the point in time of the process termination.

The other way around would be to make sure on the client side, that the 
last message is retrieved before the RST packet arrives, so that no data 
is lost. This works mostly well through the sync API of libpq, but with 
the async API the trigger for data reception is outside of the scope of 
libpq, so that there's no way to ensure recv() is called quick enough, 
after the data was received but before RST arrives. On a local 
client+server combination there is only a gap of 0.5 milliseconds or so. 
I also didn't find a way to retrieve the enqueued data after RST 
arrived. Maybe there's a nasty hack to retrieve the data afterwards, but 
I didn't dig into assembly code and memory layout of Winsock internals.


> In any case, our previous
> results definitely show that issuing an explicit close() is no panacea.
I don't fully understand the issue with closing the socket before 
process termination. Sure, it can be a valuable information that the 
corresponding backend process has definitely terminated. At least in the 
context of regression testing or so. But I think that loosing messages 
from the backend is way more critical than a non-sync process 
termination. Do I miss something?

--

Regards,
Lars Kanis





Re: Windows: Wrong error message at connection termination

От
Thomas Munro
Дата:
On Mon, Nov 22, 2021 at 8:19 AM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
> The other way around would be to make sure on the client side, that the
> last message is retrieved before the RST packet arrives, so that no data
> is lost. This works mostly well through the sync API of libpq, but with
> the async API the trigger for data reception is outside of the scope of
> libpq, so that there's no way to ensure recv() is called quick enough,
> after the data was received but before RST arrives. On a local
> client+server combination there is only a gap of 0.5 milliseconds or so.
> I also didn't find a way to retrieve the enqueued data after RST
> arrived. Maybe there's a nasty hack to retrieve the data afterwards, but
> I didn't dig into assembly code and memory layout of Winsock internals.

Hmm.  Well, if I understand how this works (and I'm not too familiar
with this Windows code so I maybe I don't), the postmaster duplicates
the socket into the child process (see
{write,read}_inheritable_socket()) and then closes its own handle (see
ServerLoop()'s call to StreamClose(port->sock)).  What if the
postmaster kept the socket open, and then closed its copy after the
child exits?  Then, I guess, maybe, Winsock socket state would live on
with a non-zero reference count and be able to perform the proper
graceful TCP shutdown dance, at least as long as the postmaster itself
is up.  Various other ideas: don't do that, but duplicate the socket
back into the postmaster before exit, or into some other process, or
rewrite PostgreSQL to use threads...



Re: Windows: Wrong error message at connection termination

От
Thomas Munro
Дата:
On Mon, Nov 22, 2021 at 9:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Hmm.  Well, if I understand how this works (and I'm not too familiar
> with this Windows code so I maybe I don't), the postmaster duplicates
> the socket into the child process (see
> {write,read}_inheritable_socket()) and then closes its own handle (see
> ServerLoop()'s call to StreamClose(port->sock)).  What if the
> postmaster kept the socket open, and then closed its copy after the
> child exits?  Then, I guess, maybe, Winsock socket state would live on
> with a non-zero reference count and be able to perform the proper
> graceful TCP shutdown dance, at least as long as the postmaster itself
> is up.  Various other ideas: don't do that, but duplicate the socket
> back into the postmaster before exit, or into some other process, or
> rewrite PostgreSQL to use threads...

Hmm, maybe it's still not enough.  Now that I have coffee, I thought
about the well known failure of idle_in_transaction_timeout to report
errors on Windows[1].  There'd be no RST on timeout with the above
approach, which is good, but the next time you try to send a query,
perhaps a race begins: the server's TCP stack receives the query
packet and replies with RST (the "normal" kind that is a response to
unreceivable data, not the linger=0 kind that is proactively sent),
meanwhile the client begins to read, and *probably* reads the already
buffered idle-in-transaction-timeout error message, but with unlucky
scheduling the RST arrives first and drops the buffered data (unlike
on Unix), right?

[1] https://www.postgresql.org/message-id/CAP3o3PdzM0BLmNBELA5wV6YoN_1yYBVdoOvz9kYbOuK-YQGFAw%40mail.gmail.com



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Hmm.  Well, if I understand how this works (and I'm not too familiar
> with this Windows code so I maybe I don't), the postmaster duplicates
> the socket into the child process (see
> {write,read}_inheritable_socket()) and then closes its own handle (see
> ServerLoop()'s call to StreamClose(port->sock)).  What if the
> postmaster kept the socket open, and then closed its copy after the
> child exits?

Ugh :-(.  For starters, we risk running out of FDs in the postmaster,
don't we?

I did some tracing just now and convinced myself that socket_close is
the first on_proc_exit callback registered in an ordinary backend,
and therefore the last action done by proc_exit_prepare.  The only
things that happen after that are PROFILE_PID_DIR setup (not relevant
in production builds), an elog(DEBUG) call, and any atexit callbacks
that third-party code might have registered.

If you're willing to avert your eyes from the question of what atexit
callbacks might do, then it'd be okay to do closesocket in socket_close,
reasoning that the backend has certainly disconnected itself from shmem
and so on, and thus is effectively done even if it is still a live process
so far as the kernel is concerned.  So maybe Lars' proposed patch is
acceptable after all.  It feels a bit shaky, but when we're sitting atop
a piece-of-junk TCP stack, we can't really have the guarantees we'd like.

The main way in which it's shaky is that future rearrangements of the
shutdown sequence, or additions of new on_proc_exit callbacks, could
create a situation where socket_close is no longer the last interesting
action.  We could imagine doing something to make it less likely for
that to happen accidentally, but I'm not sure it's worth the trouble.

Essentially this is reverting 268313a95 of 2003-05-29.  The commit
message for that fails to cite any mailing-list discussion, but after
some digging in the archives I think I did it in response to

https://www.postgresql.org/message-id/flat/009c01c31ce9%24eeaf00f0%24fb02a8c0%40muskrat

where the complaint was that a DB couldn't be dropped because a
just-closed connection was still live so far as the server was concerned.
We didn't do anything to make PQclose() synchronous, so the problem is
really still there; but the idea was that other client libraries could
make session-close synchronous if they wanted.  For that purpose,
being out of the ProcArray is really sufficient, and I think it's safe
to suppose that socket_close must run after that.

            regards, tom lane



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Hmm, maybe it's still not enough.  Now that I have coffee, I thought
> about the well known failure of idle_in_transaction_timeout to report
> errors on Windows[1].

Yeah, I think that may well be a manifestation of the same problem:
once the backend exits, Winsock issues RST which prevents the client
from reading the queued data.  We had been analyzing that under the
assumption that Windows obeys the TCP RFCs ... but having now been
disabused of that optimism, it seems to match up pretty well.
It'd be useful to check if Lars' patch cures that symptom.

            regards, tom lane



Re: Windows: Wrong error message at connection termination

От
Thomas Munro
Дата:
On Mon, Nov 22, 2021 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Hmm, maybe it's still not enough.  Now that I have coffee, I thought
> > about the well known failure of idle_in_transaction_timeout to report
> > errors on Windows[1].
>
> Yeah, I think that may well be a manifestation of the same problem:
> once the backend exits, Winsock issues RST which prevents the client
> from reading the queued data.  We had been analyzing that under the
> assumption that Windows obeys the TCP RFCs ... but having now been
> disabused of that optimism, it seems to match up pretty well.
> It'd be useful to check if Lars' patch cures that symptom.

Yeah, it sounds like it might solve at least the server-side problem.
Let's call that weird behaviour #1: RST on process exit.  (I wonder if
my keep-the-socket-open-in-another-process thought experiment is
theoretically better: a lingering socket should be capable of
resending data that hasn't been ack'd yet in FIN-WAIT-1 state after
close, which I suspect might not happen if the TCP stack nukes the
socket.  If close() avoids the proactive RST but still doesn't really
follow the shutdown protocol then it's papering over a crack in the
wall, but I'm not planning to argue about that...)

IIUC we'd still have weird behaviour #2 on the client side: TCP stack
drops buffered received data on the floor on receipt of RST.

So yeah, it'd be interesting to know if by avoiding/hiding weird
behaviour #1, idle_in_transaction_timeout works as desired most of the
time by tilting the race in favour of eager clients and favourable
scheduling.  If a client sends a new query and then immediately begins
to read the response, there's a good chance it'll be able to read the
already-buffered error message before the query->RST ping pong...
Which I now understand is exactly what Lars was explaining: that sync
APIs (like the psql command shown in that other thread) might have a
good chance of winning that race, but for async APIs, the author of
the async API has no idea what its client is going to do.



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Mon, Nov 22, 2021 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It'd be useful to check if Lars' patch cures that symptom.

> Yeah, it sounds like it might solve at least the server-side problem.
> Let's call that weird behaviour #1: RST on process exit.  (I wonder if
> my keep-the-socket-open-in-another-process thought experiment is
> theoretically better: a lingering socket should be capable of
> resending data that hasn't been ack'd yet in FIN-WAIT-1 state after
> close, which I suspect might not happen if the TCP stack nukes the
> socket.  If close() avoids the proactive RST but still doesn't really
> follow the shutdown protocol then it's papering over a crack in the
> wall, but I'm not planning to argue about that...)

The language about "graceful shutdown" in the Windows docs at least
suggests that they finish out the TCP connection cleanly; failing
to retransmit at need would hardly qualify as "graceful".  Of course,
Redmond keeps finding ways to fail to meet reasonable expectations.

> IIUC we'd still have weird behaviour #2 on the client side: TCP stack
> drops buffered received data on the floor on receipt of RST.

Do we know that that actually happens in an arm's-length connection
(ie two separate machines)?  I wonder if the data loss is strictly
an artifact of a localhost connection.  There'd be a lot more pressure
on them to make cross-machine TCP work per spec, one would think.
But in any case, if we can avoid sending RST in this situation,
it seems mostly moot for our usage.

            regards, tom lane



Re: Windows: Wrong error message at connection termination

От
Lars Kanis
Дата:
Am 22.11.21 um 00:04 schrieb Tom Lane:
> Do we know that that actually happens in an arm's-length connection
> (ie two separate machines)?  I wonder if the data loss is strictly
> an artifact of a localhost connection.  There'd be a lot more pressure
> on them to make cross-machine TCP work per spec, one would think.
> But in any case, if we can avoid sending RST in this situation,
> it seems mostly moot for our usage.

Sorry it took some days to get a setup to check this!

The result is as expected:

 1. Windows client to Linux server works without dropping the error message
 2. Linux client to Windows server works without dropping the error message
 3. Windows client to remote Windows server drops the error message,
    depending on the timing of the event loop

In 1. the Linux server doesn't end the connection with a RST packet, so 
that the Windows client enqueues the error message properly and doesn't 
drop it.

In 2. the Linux client doesn't care about the RST packet of the Windows 
server and properly enqueues and raises the error message.

In 3. the combination of the bad RST behavior of client and server leads 
to data loss. It depends on the network timing. A delay of 0.5 ms in the 
event loop was enough in a localhost setup and as wall as in some LAN 
setup. On the contrary over some slower WLAN connection a delay of less 
than 15 ms did not loose data, but higher delays still did.

The idea of running a second process, pass the socket handle to it, 
observe the parent process and close the socket when it exited, could 
work, but I guess it's overly complicated and creates more issues than 
it solves. Probably the same if the master process handles the socket 
closing.

So I still think it's best to close the socket as proposed in the patch.

--

Regards,
Lars Kanis





Re: Windows: Wrong error message at connection termination

От
Alexander Lakhin
Дата:
Hello Lars,
27.11.2021 14:39, Lars Kanis wrote:
>
> So I still think it's best to close the socket as proposed in the patch.
Please see also the previous discussion of the topic:
https://www.postgresql.org/message-id/flat/16678-253e48d34dc0c376%40postgresql.org

Best regards,
Alexander



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Alexander Lakhin <exclusion@gmail.com> writes:
> 27.11.2021 14:39, Lars Kanis wrote:
>> So I still think it's best to close the socket as proposed in the patch.

> Please see also the previous discussion of the topic:
> https://www.postgresql.org/message-id/flat/16678-253e48d34dc0c376%40postgresql.org

Hm, yeah, that discussion seems to have slipped through the cracks.
Not sure why it didn't end up in pushing something.

After re-reading that thread and re-studying relevant Windows
documentation [1][2], I think the main open question is whether
we need to issue shutdown() or not, and if so, whether to use
SD_BOTH or just SD_SEND.  I'm inclined to prefer not calling
shutdown(), because [1] is self-contradictory as to whether it
can block, and [2] is pretty explicit that it's not necessary.

            regards, tom lane

[1] https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-shutdown
[2] https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2



Re: Windows: Wrong error message at connection termination

От
Alexander Lakhin
Дата:
Hello Tom,
29.11.2021 22:16, Tom Lane wrote:
> Hm, yeah, that discussion seems to have slipped through the cracks.
> Not sure why it didn't end up in pushing something.
>
> After re-reading that thread and re-studying relevant Windows
> documentation [1][2], I think the main open question is whether
> we need to issue shutdown() or not, and if so, whether to use
> SD_BOTH or just SD_SEND.  I'm inclined to prefer not calling
> shutdown(), because [1] is self-contradictory as to whether it
> can block, and [2] is pretty explicit that it's not necessary.
>
>             regards, tom lane
>
> [1] https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-shutdown
> [2] https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2
I've tested the close-only patch with pg_sleep() in pqReadData(), and it
works too. So I wonder how to understand "To assure that all data is
sent and received on a connected socket before it is closed, an
application should use shutdown to close connection before calling
closesocket." in [1].
Maybe they mean that shutdown should be used before, but not after
closesocket ). Or maybe the Windows' behaviour somehow evolved over
time. (With the patch I cannot reproduce the FATAL message loss even on
Windows 2012 R2.) So without a practical evidence of the importance of
shutdown() I'm inclined to a more simple solution too.

As to 268313a95, back in 2003 it was possible to compile server on
Windows only using Cygwin (though you could compile libpq with Visual C,
see [3]). So "#ifdef WIN32" that is proposed now, will not affect that
scenario anyway.

Best regards,
Alexander

[3]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=doc/src/sgml/install-win32.sgml;hb=268313a95



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Alexander Lakhin <exclusion@gmail.com> writes:
> 29.11.2021 22:16, Tom Lane wrote:
>> After re-reading that thread and re-studying relevant Windows
>> documentation [1][2], I think the main open question is whether
>> we need to issue shutdown() or not, and if so, whether to use
>> SD_BOTH or just SD_SEND.  I'm inclined to prefer not calling
>> shutdown(), because [1] is self-contradictory as to whether it
>> can block, and [2] is pretty explicit that it's not necessary.

> I've tested the close-only patch with pg_sleep() in pqReadData(), and it
> works too.

Thanks for testing!

> So I wonder how to understand "To assure that all data is
> sent and received on a connected socket before it is closed, an
> application should use shutdown to close connection before calling
> closesocket." in [1].

I suppose their documentation has evolved over time.  This sentence
probably predates their explicit acknowledgement in [2] that you don't
have to call shutdown().  Maybe, once upon a time with very old
versions of Winsock, you did have to do so if you wanted graceful close.

I'll push the close-only change in a little bit.

            regards, tom lane



Re: Windows: Wrong error message at connection termination

От
Sergey Shinderuk
Дата:
On 02.12.2021 22:31, Tom Lane wrote:
> I'll push the close-only change in a little bit.

Unexpectedly, this changes the error message:

    postgres=# set idle_session_timeout = '1s';
    SET
    postgres=# select 1;
    could not receive data from server: Software caused connection abort 
(0x00002745/10053)
    The connection to the server was lost. Succeeded.
    postgres=#

Without shutdown/closesocket it would most likely be:

    server closed the connection unexpectedly
            This probably means the server terminated abnormally
            before or while processing the request.

When the timeout expires, the server sends the error message and 
gracefully closes the connection by sending a FIN.  Later, psql sends 
another query to the server, and the server responds with a RST.  But 
now recv() returns WSAECONNABORTED(10053) instead of WSAECONNRESET(10054).

Without shutdown/closesocket, after the timeout expires, the server 
sends the error message, the client sends an ACK, and the server 
responds with a RST.  Then psql tries to sends the next query, but 
nothing is sent at the TCP level, and the next recv() returns WSAECONNRESET.

IIUIC, in both cases we may or may not recv() the error message from the 
server depending on how fast the RST arrives from the server.

Should we handle ECONNABORTED similarly to ECONNRESET in pqsecure_raw_read?

-- 
Sergey Shinderuk        https://postgrespro.com/



Re: Windows: Wrong error message at connection termination

От
Sergey Shinderuk
Дата:
On 14.01.2022 13:01, Sergey Shinderuk wrote:
> When the timeout expires, the server sends the error message and 
> gracefully closes the connection by sending a FIN.  Later, psql sends 
> another query to the server, and the server responds with a RST.  But 
> now recv() returns WSAECONNABORTED(10053) instead of WSAECONNRESET(10054).

On the other hand, I cannot reproduce this behavior with a remote server 
even if pause psql just before the recv() call to let the RST win the race.

So I get:

postgres=# set idle_session_timeout = '1s';
recv() returned 15 errno 0
SET
recv() returned -1 errno 10035 (WSAEWOULDBLOCK)
postgres=# select 1;
recv() returned 116 errno 0
recv() returned 0 errno 0
recv() returned 0 errno 0
FATAL:  terminating connection due to idle-session timeout
server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.

recv() signals EOF like on Unix.

Here I connected from a Windows virtual machine to the macOS host, but 
the Wireshark dump looks the same (there is a RST) as for a localhost 
connection.

Is this "error-eating" behavior of RST on Windows specific only to 
localhost connections?

-- 
Sergey Shinderuk        https://postgrespro.com/



Re: Windows: Wrong error message at connection termination

От
Sergey Shinderuk
Дата:
On 14.01.2022 13:01, Sergey Shinderuk wrote:
> Unexpectedly, this changes the error message:
> 
>      postgres=# set idle_session_timeout = '1s';
>      SET
>      postgres=# select 1;
>      could not receive data from server: Software caused connection 
> abort (0x00002745/10053)

For the record, after more poking I realized that it depends on timing. 
  By injecting delays I can get any of the following from libpq:

* could not receive data from server: Software caused connection abort
* server closed the connection unexpectedly
* no connection to the server


> Should we handle ECONNABORTED similarly to ECONNRESET in pqsecure_raw_read?

So this doesn't make sense anymore.

Sorry for the noise.

-- 
Sergey Shinderuk        https://postgrespro.com/



Re: Windows: Wrong error message at connection termination

От
Tom Lane
Дата:
Sergey Shinderuk <s.shinderuk@postgrespro.ru> writes:
> On 14.01.2022 13:01, Sergey Shinderuk wrote:
>> Unexpectedly, this changes the error message:
> ...
> For the record, after more poking I realized that it depends on timing.
>   By injecting delays I can get any of the following from libpq:
> * could not receive data from server: Software caused connection abort
> * server closed the connection unexpectedly
> * no connection to the server

Thanks for the follow-up.  At the moment I'm not planning to do anything
pending the results of the other thread [1].  It seems likely though that
we'll end up reverting this explicit-close behavior in the back branches,
as the other changes involved look too invasive for back-patching.

            regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com