Обсуждение: Add client connection check during the execution of the query

Поиск
Список
Период
Сортировка

Add client connection check during the execution of the query

От
s.cherkashin@postgrespro.ru
Дата:
This patch adds verification of the connection with the client during 
the execution of the SQL query. The feature enables using the GUC 
variable ‘client_connection_check_interval’. The default check interval 
is 1 second. If you set the value of ‘client_connection_check_interval’ 
to 0, then the check will not be performed.
The feature will be useful in cases when, during the execution of a very 
long query, the client suddenly terminates the connection - this will 
allow backend to cancel further execution of the query and free server 
resources.

Вложения

Re: Add client connection check during the execution of the query

От
Tom Lane
Дата:
s.cherkashin@postgrespro.ru writes:
> This patch adds verification of the connection with the client during 
> the execution of the SQL query. The feature enables using the GUC 
> variable ‘client_connection_check_interval’. The default check interval 
> is 1 second. If you set the value of ‘client_connection_check_interval’ 
> to 0, then the check will not be performed.

I took a quick look through this.

* It won't even compile on non-Linux platforms, because MSG_DONTWAIT
is a Linux-ism.  Perhaps that can be replaced by putting the client
socket into nonblock mode, but I'm not very sure that that'll work
(especially when using OpenSSL or another TLS implementation).

* I'm not convinced that this will reliably detect client connection loss.
AFAICS, if there is any unread data pending, it'd report that all is well
even if the client dropped off the net after sending that data.  It's hard
to evaluate how likely such a situation is, but one really obvious case
is that the client might've sent an 'X' message to try to close the
connection gracefully.  Also, use of TLS would again make things much
harder to reason about, because the TLS layer may send or receive data
that we don't know about.

* The management of the pending timeout interrupt seems like a mess.
Why did you involve ProcessInterrupts in that?  It seems likely to queue
extra timeouts at random times due to unrelated interrupts causing that
bit of code to run, and/or cause weird gaps in the timeout intervals due
to not being run promptly.  I'd be inclined to set this up so that the
timeout handler itself re-queues the timeout (I think that will work, or
if not, we should probably fix timeout.c so that it can).

* BTW, I am not on board with making this enabled-by-default.

This does seem like possibly a useful option if we can make it
work portably/reliably, but I don't have very high hopes for that.

            regards, tom lane


Re: Add client connection check during the execution of the query

От
Andres Freund
Дата:
Hi,

On 2019-01-13 18:05:39 -0500, Tom Lane wrote:
> s.cherkashin@postgrespro.ru writes:
> > This patch adds verification of the connection with the client during 
> > the execution of the SQL query. The feature enables using the GUC 
> > variable ‘client_connection_check_interval’. The default check interval 
> > is 1 second. If you set the value of ‘client_connection_check_interval’ 
> > to 0, then the check will not be performed.
> 
> I took a quick look through this.
> 
> * It won't even compile on non-Linux platforms, because MSG_DONTWAIT
> is a Linux-ism.  Perhaps that can be replaced by putting the client
> socket into nonblock mode, but I'm not very sure that that'll work
> (especially when using OpenSSL or another TLS implementation).
> 
> * I'm not convinced that this will reliably detect client connection loss.
> AFAICS, if there is any unread data pending, it'd report that all is well
> even if the client dropped off the net after sending that data.  It's hard
> to evaluate how likely such a situation is, but one really obvious case
> is that the client might've sent an 'X' message to try to close the
> connection gracefully.  Also, use of TLS would again make things much
> harder to reason about, because the TLS layer may send or receive data
> that we don't know about.
> 
> * The management of the pending timeout interrupt seems like a mess.
> Why did you involve ProcessInterrupts in that?  It seems likely to queue
> extra timeouts at random times due to unrelated interrupts causing that
> bit of code to run, and/or cause weird gaps in the timeout intervals due
> to not being run promptly.  I'd be inclined to set this up so that the
> timeout handler itself re-queues the timeout (I think that will work, or
> if not, we should probably fix timeout.c so that it can).
> 
> * BTW, I am not on board with making this enabled-by-default.
> 
> This does seem like possibly a useful option if we can make it
> work portably/reliably, but I don't have very high hopes for that.

Given that nothing happened since this message, and the commitfest is
ending, I'm going to mark this as returned with feedback.

Greetings,

Andres Freund


Re: Add client connection check during the execution of the query

От
s.cherkashin@postgrespro.ru
Дата:
The purpose of this patch is to stop the execution of continuous 
requests in case of a disconnection from the client. In most cases, the 
client must wait for a response from the server before sending new data 
- which means there should not remain unread data on the socket and we 
will be able to determine a broken connection.
Perhaps exceptions are possible, but I could not think of such a use 
case (except COPY). I would be grateful if someone could offer such 
cases or their solutions.
I added a test for the GUC variable when the client connects via SSL, 
but I'm not sure that this test is really necessary.

Best regards,
Sergey Cherkashin.

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Sat, Feb 9, 2019 at 6:16 AM <s.cherkashin@postgrespro.ru> wrote:
> The purpose of this patch is to stop the execution of continuous
> requests in case of a disconnection from the client. In most cases, the
> client must wait for a response from the server before sending new data
> - which means there should not remain unread data on the socket and we
> will be able to determine a broken connection.
> Perhaps exceptions are possible, but I could not think of such a use
> case (except COPY). I would be grateful if someone could offer such
> cases or their solutions.
> I added a test for the GUC variable when the client connects via SSL,
> but I'm not sure that this test is really necessary.

Hello Sergey,

This seems like a reasonable idea to me.  There is no point in running
a monster 24 hour OLAP query if your client has gone away.  It's using
MSG_PEEK which is POSIX, and I can't immediately think of any reason
why it's not safe to try to peek at a byte in that socket at any time.
Could you add a comment to explain why you have !PqCommReadingMsg &&
!PqCommBusy?  The tests pass on a couple of different Unixoid OSes I
tried.  Is it really necessary to do a bunch of IO and busy CPU work
in $long_query?  pg_sleep(60) can do the job, because it includes a
standard CHECK_FOR_INTERRUPTS/latch loop that will loop around on
SIGALRM.  I just tested that your patch correctly interrupts
pg_sleep() if I kill -9 my psql process.  Why did you call the timeout
SKIP_CLIENT_CHECK_TIMEOUT (I don't understand the "SKIP" part)?  Why
not CLIENT_CONNECTION_CHECK_TIMEOUT or something?

I wonder if it's confusing to users that you get "connection to client
lost" if the connection goes away while running a query, but nothing
if the connection goes away without saying goodbye ("X") while idle.

The build fails on Windows.  I think it's because
src/tools/msvc/Mkvcbuild.pm is trying to find something to compile
under src/test/modules/connection, and I think the solution is to add
that to the variable @contrib_excludes.  (I wonder if that script
should assume there is nothing to build instead of dying with "Could
not determine contrib module type for connection", otherwise every
Unix hacker is bound to get this wrong every time.)

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.45820

Aside from that problem, my Windows CI building thing isn't smart
enough to actually run those extra tests yet, so I don't know if it
actually works on that platform yet (I really need to teach that thing
to use the full buildfarm scripts...)


--
Thomas Munro
https://enterprisedb.com



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Jul 5, 2019 at 5:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Feb 9, 2019 at 6:16 AM <s.cherkashin@postgrespro.ru> wrote:
> > The purpose of this patch is to stop the execution of continuous
> > requests in case of a disconnection from the client. In most cases, the
> > client must wait for a response from the server before sending new data
> > - which means there should not remain unread data on the socket and we
> > will be able to determine a broken connection.
> > Perhaps exceptions are possible, but I could not think of such a use
> > case (except COPY). I would be grateful if someone could offer such
> > cases or their solutions.
> > I added a test for the GUC variable when the client connects via SSL,
> > but I'm not sure that this test is really necessary.
>
> [review]

Email to Sergey is bouncing back.  I've set this to "Waiting on
author" in the Commitfest app.


--
Thomas Munro
https://enterprisedb.com



Re: Add client connection check during the execution of the query

От
Tatsuo Ishii
Дата:
> The purpose of this patch is to stop the execution of continuous
> requests in case of a disconnection from the client.

Pgpool-II already does this by sending a parameter status message to
the client. It is expected that clients are always prepared to receive
the parameter status message. This way I believe we could reliably
detect that the connection to the client is broken or not.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Add client connection check during the execution of the query

От
Tatsuo Ishii
Дата:
> This seems like a reasonable idea to me.  There is no point in running
> a monster 24 hour OLAP query if your client has gone away.  It's using
> MSG_PEEK which is POSIX, and I can't immediately think of any reason
> why it's not safe to try to peek at a byte in that socket at any time.

I am not familiar with Windows but I accidentally found this article
written by Microsoft:

https://support.microsoft.com/en-us/help/192599/info-avoid-data-peeking-in-winsock

It seems using MSG_PEEK is not recommended by Microsoft.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Jul 5, 2019 at 6:42 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> > This seems like a reasonable idea to me.  There is no point in running
> > a monster 24 hour OLAP query if your client has gone away.  It's using
> > MSG_PEEK which is POSIX, and I can't immediately think of any reason
> > why it's not safe to try to peek at a byte in that socket at any time.
>
> I am not familiar with Windows but I accidentally found this article
> written by Microsoft:
>
> https://support.microsoft.com/en-us/help/192599/info-avoid-data-peeking-in-winsock
>
> It seems using MSG_PEEK is not recommended by Microsoft.

Hmm, interesting.  Using it very infrequently just as a way to detect
that the other end has gone away doesn't seem too crazy based on
anything in that article though, does it?  What they're saying
actually applies to every operating system, not just Windows, AFAICS.
Namely, don't use MSG_PEEK frequently because it's a syscall and takes
locks in the kernel, and don't use it to wait for full messages to
arrive, or you might effectively deadlock if internal buffers are
full.  But Sergey's patch only uses it to check if we could read 1
single byte, and does so very infrequently (the GUC should certainly
be set to at least many seconds).

What else could we do?  Assuming the kernel actually knows the
connection has gone away, the WaitEventSetWait() interface is no help
on its own, I think, because it'll just tell you the socket is read
for reading when it's closed, you still have to actually try to read
to distinguish closed from a data byte.

I tried this patch using a real network with two machines.  I was able
to get the new "connection to client lost" error by shutting down a
network interface (effectively yanking a cable), but only with TCP
keepalive configured.  That's not too surprising; without that and
without trying to write, there is no way for the kernel to know that
the other end has gone.

-- 
Thomas Munro
https://enterprisedb.com



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Jul 5, 2019 at 6:28 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> > The purpose of this patch is to stop the execution of continuous
> > requests in case of a disconnection from the client.
>
> Pgpool-II already does this by sending a parameter status message to
> the client. It is expected that clients are always prepared to receive
> the parameter status message. This way I believe we could reliably
> detect that the connection to the client is broken or not.

Hmm.  If you send a message, it's basically application-level
keepalive.  But it's a lot harder to be sure that the protocol and
socket are in the right state to insert a message at every possible
CHECK_FOR_INTERRUPT() location.  Sergey's proposal of recv(MSG_PEEK)
doesn't require any knowledge of the protocol at all, though it
probably does need TCP keepalive to be configured to be useful for
remote connections.

-- 
Thomas Munro
https://enterprisedb.com



Re: Add client connection check during the execution of the query

От
Stas Kelvich
Дата:

> On 5 Jul 2019, at 11:46, Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> On Fri, Jul 5, 2019 at 6:28 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>>> The purpose of this patch is to stop the execution of continuous
>>> requests in case of a disconnection from the client.
>> 
>> Pgpool-II already does this by sending a parameter status message to
>> the client. It is expected that clients are always prepared to receive
>> the parameter status message. This way I believe we could reliably
>> detect that the connection to the client is broken or not.
> 
> Hmm.  If you send a message, it's basically application-level
> keepalive.  But it's a lot harder to be sure that the protocol and
> socket are in the right state to insert a message at every possible
> CHECK_FOR_INTERRUPT() location.  Sergey's proposal of recv(MSG_PEEK)
> doesn't require any knowledge of the protocol at all, though it
> probably does need TCP keepalive to be configured to be useful for
> remote connections.


Well, indeed in case of cable disconnect only way to detect it with
proposed approach is to have tcp keepalive. However if disconnection
happens due to client application shutdown then client OS should itself
properly close than connection and therefore this patch will detect
such situation without keepalives configured.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Sat, Jul 6, 2019 at 12:27 AM Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> Well, indeed in case of cable disconnect only way to detect it with
> proposed approach is to have tcp keepalive. However if disconnection
> happens due to client application shutdown then client OS should itself
> properly close than connection and therefore this patch will detect
> such situation without keepalives configured.

Yeah.

+1 for this patch, with a few adjustments including making the test
use pg_sleep() as mentioned.  It does something useful, namely
cancelling very long running queries sooner if the client has gone
away instead of discovering that potentially much later when sending a
response.  It does so with a portable kernel interface (though we
haven't heard from a Windows tester), and I think it's using it in a
safe way (we're not doing the various bad things you can do with
MSG_PEEK, and the fd is expected to be valid for the process's
lifetime, and the socket is always in non-blocking mode*, so I don't
think there is any bad time for pg_check_client_connection() to run).
It reuses the existing timer infrastructure so there isn't really any
new overhead.  One syscall every 10 seconds or whatever at the next
available CFI is basically nothing.  On its own, this patch will
reliably detect clients that closed abruptly or exited/crashed (so
they client kernel sends a FIN packet).  In combination with TCP
keepalive, it'll also detect clients that went away because the
network or client kernel ceased to exist.

*There are apparently no callers of pg_set_block(), so if you survived
pq_init() you have a non-blocking socket.  If you're on Windows, the
code always sets the magic pgwin32_noblock global flag before trying
to peek.  I wondered if it's OK that the CFI would effectively clobber
that with 0 on its way out, but that seems to be OK because every
place in the code that sets that flag does so immediately before an IO
operationg without a CFI in between.  As the comment in pgstat.c says
"This is extremely broken and should be fixed someday.".  I wonder if
we even need that flag at all now that all socket IO is non-blocking.

-- 
Thomas Munro
https://enterprisedb.com



Re: Add client connection check during the execution of the query

От
Tatsuo Ishii
Дата:
> Yeah.
> 
> +1 for this patch, with a few adjustments including making the test
> use pg_sleep() as mentioned.  It does something useful, namely
> cancelling very long running queries sooner if the client has gone
> away instead of discovering that potentially much later when sending a
> response.  It does so with a portable kernel interface (though we
> haven't heard from a Windows tester), and I think it's using it in a
> safe way (we're not doing the various bad things you can do with
> MSG_PEEK, and the fd is expected to be valid for the process's
> lifetime, and the socket is always in non-blocking mode*, so I don't
> think there is any bad time for pg_check_client_connection() to run).
> It reuses the existing timer infrastructure so there isn't really any
> new overhead.  One syscall every 10 seconds or whatever at the next
> available CFI is basically nothing.  On its own, this patch will
> reliably detect clients that closed abruptly or exited/crashed (so
> they client kernel sends a FIN packet).  In combination with TCP
> keepalive, it'll also detect clients that went away because the
> network or client kernel ceased to exist.
> 
> *There are apparently no callers of pg_set_block(), so if you survived
> pq_init() you have a non-blocking socket.  If you're on Windows, the
> code always sets the magic pgwin32_noblock global flag before trying
> to peek.  I wondered if it's OK that the CFI would effectively clobber
> that with 0 on its way out, but that seems to be OK because every
> place in the code that sets that flag does so immediately before an IO
> operationg without a CFI in between.  As the comment in pgstat.c says
> "This is extremely broken and should be fixed someday.".  I wonder if
> we even need that flag at all now that all socket IO is non-blocking.

I have looked into the patch and tested a little bit.

First of all, I had to grab February snapshot to test the patch
because it does not apply to the current HEAD. I noticed that there
are some confusions in the doc and code regarding what the new
configuration parameter means. According to the doc:

+        Default value is <literal>zero</literal> - it disables connection
+        checks, so the backend will detect client disconnection only when trying
+        to send a response to the query.

But guc.c comment says:

+            gettext_noop("Sets the time interval for checking connection with the client."),
+            gettext_noop("A value of -1 disables this feature. Zero selects a suitable default value."),

Probably the doc is correct since the actual code does so.

Also I found this in postgresql.conf default:

#client_connection_check_interval = 1000    # set time interval between

So here the default value seems to be be 1000. If so, guc.c should be
adjusted and the doc should be changed accordingly. I am not sure.

Next I have tested the patch using standard pgbench.

With the feature enabled with 1000ms check interval:
$ pgbench -c 10 -T 300 -S test
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
duration: 300 s
number of transactions actually processed: 19347995
latency average = 0.155 ms
tps = 64493.278581 (including connections establishing)
tps = 64493.811945 (excluding connections establishing)

Without the feature (client-connection-check-interval = 0)
$ pgbench -c 10 -T 300 -S test
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
duration: 300 s
number of transactions actually processed: 20314812
latency average = 0.148 ms
tps = 67715.993428 (including connections establishing)
tps = 67717.251843 (excluding connections establishing)

So the performance is about 5% down with the feature enabled in this
case.  For me, 5% down is not subtle. Probably we should warn this in
the doc.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Thu, Jul 18, 2019 at 3:19 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> So the performance is about 5% down with the feature enabled in this
> case.  For me, 5% down is not subtle. Probably we should warn this in
> the doc.

Yeah, the timer logic is wrong.  I didn't have time to look into it
but with truss/strace for some reason I see 3 setitimer() syscalls for
every query, but I think this doesn't even need to set the timer for
every query.


--
Thomas Munro
https://enterprisedb.com



Re: Add client connection check during the execution of the query

От
Tatsuo Ishii
Дата:
> Yeah, the timer logic is wrong.  I didn't have time to look into it
> but with truss/strace for some reason I see 3 setitimer() syscalls for
> every query, but I think this doesn't even need to set the timer for
> every query.

Hum. I see 2 settimer(), instead of 3.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Add client connection check during the execution of the query

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> Yeah, the timer logic is wrong.  I didn't have time to look into it
>> but with truss/strace for some reason I see 3 setitimer() syscalls for
>> every query, but I think this doesn't even need to set the timer for
>> every query.

> Hum. I see 2 settimer(), instead of 3.

src/backend/utils/misc/timeout.c is not really designed for there
to be timeouts that persist across multiple queries.  It can probably
be made better, but this patch doesn't appear to have touched any of
that logic.

To point to just one obvious problem, the error recovery path
(sigsetjmp block) in postgres.c does

        disable_all_timeouts(false);

which cancels *all* timeouts.  Probably don't want that behavior
anymore.

I think the issue you're complaining about may have to do with
the fact that if there's no statement timeout active, both
enable_statement_timeout and disable_statement_timeout will
call "disable_timeout(STATEMENT_TIMEOUT, false);".  That does
nothing, as desired, if there are no other active timeouts ...
but if there is one, ie the client_connection timeout, we'll
end up calling schedule_alarm which will call setitimer even
if the desired time-of-nearest-timeout hasn't changed.
That was OK behavior for the set of timeouts that the code
was designed to handle, but we're going to need to be smarter
now.

            regards, tom lane



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Thu, Jul 18, 2019 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> >> Yeah, the timer logic is wrong.  I didn't have time to look into it
> >> but with truss/strace for some reason I see 3 setitimer() syscalls for
> >> every query, but I think this doesn't even need to set the timer for
> >> every query.
>
> > Hum. I see 2 settimer(), instead of 3.
>
> src/backend/utils/misc/timeout.c is not really designed for there
> to be timeouts that persist across multiple queries.  It can probably
> be made better, but this patch doesn't appear to have touched any of
> that logic.
>
> To point to just one obvious problem, the error recovery path
> (sigsetjmp block) in postgres.c does
>
>                 disable_all_timeouts(false);
>
> which cancels *all* timeouts.  Probably don't want that behavior
> anymore.
>
> I think the issue you're complaining about may have to do with
> the fact that if there's no statement timeout active, both
> enable_statement_timeout and disable_statement_timeout will
> call "disable_timeout(STATEMENT_TIMEOUT, false);".  That does
> nothing, as desired, if there are no other active timeouts ...
> but if there is one, ie the client_connection timeout, we'll
> end up calling schedule_alarm which will call setitimer even
> if the desired time-of-nearest-timeout hasn't changed.
> That was OK behavior for the set of timeouts that the code
> was designed to handle, but we're going to need to be smarter
> now.

Ok, I think we like this feature proposal and I think we have a pretty
good handle on what needs to be done next, but without a patch author
that's not going to happen.  I've therefore marked it "Returned with
feedback".  If Sergey or someone else is interested in picking this up
and doing the work in time for CF2, please feel free to change that to
'moved to next'.

I wonder if it'd be possible, along the way, to make it so that
statement_timeout doesn't require calling itimer() for every
statement, instead letting the existing timer run if there is a
reasonable amount left to run on it, and then resetting it if needed.
I haven't looked into whether that's hard/impossible due to some race
condition I haven't spent the time to think about, but if Ishii-san
sees 5% slowdown from a couple of itimer calls(), I guess using
statement_timeout might currently cause a 2.5% slowdown.

-- 
Thomas Munro
https://enterprisedb.com



Re: Add client connection check during the execution of the query

От
Konstantin Knizhnik
Дата:

On 18.07.2019 6:19, Tatsuo Ishii wrote:
>
> I noticed that there are some confusions in the doc and code regarding what the new
> configuration parameter means. According to the doc:
>
> +        Default value is <literal>zero</literal> - it disables connection
> +        checks, so the backend will detect client disconnection only when trying
> +        to send a response to the query.
>
> But guc.c comment says:
>
> +            gettext_noop("Sets the time interval for checking connection with the client."),
> +            gettext_noop("A value of -1 disables this feature. Zero selects a suitable default value."),
>
> Probably the doc is correct since the actual code does so.

Yes, value -1 is not even accepted due to the specified range.

> tps = 67715.993428 (including connections establishing)
> tps = 67717.251843 (excluding connections establishing)
>
> So the performance is about 5% down with the feature enabled in this
> case.  For me, 5% down is not subtle. Probably we should warn this in
> the doc.

I also see some performance degradation, although it is not so large in 
my case (I used pgbench with scale factor 10 and run the same command as 
you).
In my case difference is 103k vs. 105k TPS is smaller than 2%.

It seems to me that it is not necessary to enable timeout at each command:

@@ -4208,6 +4210,9 @@ PostgresMain(int argc, char *argv[],
           */
          CHECK_FOR_INTERRUPTS();
          DoingCommandRead = false;
+        if (client_connection_check_interval)
+            enable_timeout_after(SKIP_CLIENT_CHECK_TIMEOUT,
+                                 client_connection_check_interval);

          /*
           * (5) turn off the idle-in-transaction timeout


Unlike statement timeout or idle in transaction timeout price start of 
measuring time is not important.
So it is possible to do once before main backend loop:

@@ -3981,6 +3983,10 @@ PostgresMain(int argc, char *argv[],
         if (!IsUnderPostmaster)
                 PgStartTime = GetCurrentTimestamp();

+       if (client_connection_check_interval)
+               enable_timeout_after(SKIP_CLIENT_CHECK_TIMEOUT,
+ client_connection_check_interval);
+
         /*
          * POSTGRES main processing loop begins here
          *


But actually I do not see much difference from moving enabling timeout code.
Moreover the difference in performance almost not depend on the value of 
timeout.
I set it to 100 seconds with pgbench loop 30 seconds (so timeout never 
fired and recv is never called)
and still there is small difference in performance.

After some experiments I found out that just presence of active timer 
results some small performance penalty.
You can easily check it: set for example statement_timeout to the same 
large value (100 seconds) and you will get the same small slowdown.

So recv() itself is not source of the problem.
Actually any system call (may be except fsync) performed with frequency 
less than one second can not have some noticeable impact on performance.
So I do not think that recv(MSG_PEEK)  can cause any performance problem 
at Windows or any other platform.
But I wonder why we can not perform just pool with POLLOUT flag and zero 
timeout.
If OS detected closed connection, it should return POLLHUP, should not it?
I am not sure if it is more portable or more efficient way - just seems 
to be a little bit more natural way (from my point of view) to check if 
connection is still alive.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Sat, Mar 6, 2021 at 5:50 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> +       if (client_connection_check_interval > 0)
> +           enable_timeout_after(CLIENT_CONNECTION_CHECK_TIMEOUT,
>
> +   /* Start timeout for checking if the client has gone away if necessary. */
> +   if (client_connection_check_interval)

Thanks!  Fixed.

> +        Sets a time interval, in milliseconds, between periodic
>
> I wonder if the interval should be expressed in seconds. Since the description says: while running very long
queries.

Hmm.  Personally I think we should stop emphasising default units
(it's an internal detail) and encourage people to specify the units.
But that was indeed not good wording, so I fixed it.  I've now
rewritten the documentation completely, and moved it to the connection
section near the related TCP keepalive settings.  I tried to be really
explicit about what this feature really does, and in which cases it
helps, and what behaviour you can expect without this feature
configured.

Other changes:

1.  Fixed inconsistencies in .conf.sample and GUC descriptions from by
Ishii-san's review.
2.  Added comments to pg_check_client_connection() to explain the
conditional logic therein, justifying each conditional branch and the
interactions between this logic and the PqCommReadingMsg and
ClientConnectionLost variables.
3.  I added an ereport(COMMERROR) message for unexpected errnos in
this path, since otherwise an errno would be discarded making
diagnosis impossible.
4.  Avoided calling enable_timeout_after() multiple times, like we do
for statement timeouts.
5.  cfbot told me "Could not determine contrib module type for
connection" on Windows.  I do not understand this stuff, so I just
added the new test module "connection" to @contrib_excludes, like
everyone else apparently does.
6.  pgindented.

That's enough for today, but here are some things I'm still wondering about:

1.  Might need to rethink the name of the GUC.  By including "client"
in it, it sounds a bit like it affects behaviour of the client, rather
than the server.  Also the internal variable
CheckClientConnectionPending looks funny next to ClientConnectionLost
(could be ClientConnectionCheckPending?).
2.  The tests need tightening up.  The thing with the "sleep 3" will
not survive contact with the build farm, and I'm not sure if the SSL
test is as short as it could be.
3.  Needs testing on Windows.

I've now hacked this code around so much that I've added myself as
co-author in the commit message.

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Mon, Mar 22, 2021 at 3:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 2.  The tests need tightening up.  The thing with the "sleep 3" will
> not survive contact with the build farm, and I'm not sure if the SSL
> test is as short as it could be.

I don't think the TAP test can be done in the way Sergey had it,
because of multiple races that would probably fail on
slow/overloaded/valgrind machines.  I'll try to think of a better way,
but for now I've removed those tests.

I realised that this should really be testing DoingCommandRead to
decide when it's time to stop checking and re-arming (originally it
was checking PqReadingMessage, which isn't quite right), so I moved a
tiny bit more of the logic into postgres.c, keeping only the actual
connection-check in pqcomm.c.

That leaves the thorny problem Tom mentioned at the top of this
thread[1]: this socket-level approach can be fooled by an 'X' sitting
in the socket buffer, if a client that did PQsendQuery() and then
PQfinish().  Or perhaps even by SSL messages invisible to our protocol
level.  That can surely only be addressed by moving the 'peeking' one
level up the protocol stack.  I've attached a WIP attemp to do that,
on top of the other patch.  Lookahead happens in our receive buffer,
not the kernel's socket buffer.  It detects the simple 'X' case, but
not deeper pipelines of queries (which would seem to require an
unbounded receive buffer and lookahead that actually decodes message
instead of just looking at the first byte, which seems way over the
top considering the price of infinite RAM these days).  I think it's
probably safe in terms of protocol synchronisation because it consults
PqCommReadingMsg to avoid look at non-message-initial bytes, but I
could be wrong, it's a first swing at it...  Maybe it's a little
unprincipled to bother with detecting 'X' at all if you can't handle
pipelining in general... I don't know.

Today I learned that there have been other threads[2][3] with people
wanting some variant of this feature over the years.

[1] https://www.postgresql.org/message-id/19003.1547420739%40sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/flat/e09785e00907271728k4bf4d17kac0e7f5ec9316069%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/20130810.113901.1014453099921841746.t-ishii%40sraoss.co.jp

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Tue, Mar 23, 2021 at 11:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> That leaves the thorny problem Tom mentioned at the top of this
> thread[1]: this socket-level approach can be fooled by an 'X' sitting
> in the socket buffer, if a client that did PQsendQuery() and then
> PQfinish().  Or perhaps even by SSL messages invisible to our protocol
> level.  That can surely only be addressed by moving the 'peeking' one
> level up the protocol stack.  I've attached a WIP attemp to do that,
> on top of the other patch.  Lookahead happens in our receive buffer,
> not the kernel's socket buffer.

After sleeping on this, I'm still not seeing any problem with this
approach.  Sanity checks welcome.  Of course that function should be
called something like pq_peekmessage() -- done.  I think this patch
addresses all critiques leveled at the earlier versions, and I've
tested this with SSL and non-SSL connections, by killing psql while a
query runs, and using a client that calls PQfinish() after starting a
query, and in an earlier version I did yank-the-cable testing, having
set up TCP keepalive to make that last case work.

Вложения

Re: Add client connection check during the execution of the query

От
Zhihong Yu
Дата:
Hi,
In the description:

Provide a new optional GUC that can be used to check whether the client
connection has gone away periodically while running very long queries.

I think moving 'periodically' to the vicinity of 'to check' would make the sentence more readable.

+        the operating system, or the
+        <xref linkend="guc-tcp-keepalives-idle"/>,
+        <xref linkend="guc-tcp-keepalives-idle"/> and

The same guc is listed twice. I am not sure if that was intended.

Cheers

On Tue, Mar 23, 2021 at 2:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Tue, Mar 23, 2021 at 11:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> That leaves the thorny problem Tom mentioned at the top of this
> thread[1]: this socket-level approach can be fooled by an 'X' sitting
> in the socket buffer, if a client that did PQsendQuery() and then
> PQfinish().  Or perhaps even by SSL messages invisible to our protocol
> level.  That can surely only be addressed by moving the 'peeking' one
> level up the protocol stack.  I've attached a WIP attemp to do that,
> on top of the other patch.  Lookahead happens in our receive buffer,
> not the kernel's socket buffer.

After sleeping on this, I'm still not seeing any problem with this
approach.  Sanity checks welcome.  Of course that function should be
called something like pq_peekmessage() -- done.  I think this patch
addresses all critiques leveled at the earlier versions, and I've
tested this with SSL and non-SSL connections, by killing psql while a
query runs, and using a client that calls PQfinish() after starting a
query, and in an earlier version I did yank-the-cable testing, having
set up TCP keepalive to make that last case work.

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
Going back a couple of years to something Konstantin said:

On Sat, Aug 3, 2019 at 4:40 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> But I wonder why we can not perform just pool with POLLOUT flag and zero
> timeout.
> If OS detected closed connection, it should return POLLHUP, should not it?
> I am not sure if it is more portable or more efficient way - just seems
> to be a little bit more natural way (from my point of view) to check if
> connection is still alive.

... Andres just asked me the same question, when we were discussing
the pq_peekmessage() patch (v7).  I had remembered that POLLHUP didn't
work for this type of thing, from some earlier attempt at something
similar, and indeed on my first attempt to do that here as an
alternative design, it did not work... with TCP sockets (localhost)...
though it did work with Unix sockets.  Gah!  Then he pointed me at
POLLRDHUP (a Linux only extension) and that did seem to work in all
cases I tried.  But without that, this v8 patch doesn't seem to work
on FreeBSD (TCP), and for the rest of the menagerie, who knows?
Here's a sketch patch like that for discussion.

It's frustrating, because this patch is so simple, and doesn't have
v7's problem with pipelined queries.  Hmm.

(I tried to make it work on Windows too by reading the manual, no idea
if that part compiles or works).

Вложения

Re: Add client connection check during the execution of the query

От
Andres Freund
Дата:
Hi,

On 2021-03-24 16:08:13 +1300, Thomas Munro wrote:
> ... Andres just asked me the same question, when we were discussing
> the pq_peekmessage() patch (v7).  I had remembered that POLLHUP didn't
> work for this type of thing, from some earlier attempt at something
> similar, and indeed on my first attempt to do that here as an
> alternative design, it did not work... with TCP sockets (localhost)...
> though it did work with Unix sockets.  Gah!  Then he pointed me at
> POLLRDHUP (a Linux only extension) and that did seem to work in all
> cases I tried.  But without that, this v8 patch doesn't seem to work
> on FreeBSD (TCP), and for the rest of the menagerie, who knows?
> Here's a sketch patch like that for discussion.
> It's frustrating, because this patch is so simple, and doesn't have
> v7's problem with pipelined queries.  Hmm.

It is indeed frustrating. I searched a bit for other OSs and POLLRDHUP
and I'm annoyed by responses from various OS folks of "You don't need
that, just read the data upon POLLIN...".


I don't like the feature not handling pipelining etc, nor does working
undetectedly only on linux seem like a great answer. I guess we could
have the template files tell us wether it work, or configure test it,
but brrr.

I'm mostly joking, and I've not read the thread, but I assume just
sending an empty NoticeResponse or error message has been brought up and
laughed out of the room?


> (I tried to make it work on Windows too by reading the manual, no idea
> if that part compiles or works).

If the patch had tests, cfbot might tell you :)

Greetings,

Andres Freund



Re: Add client connection check during the execution of the query

От
Maksim Milyutin
Дата:
Hi Thomas! Thanks for working on this patch.

I have attached a new version with some typo corrections of doc entry, 
removing of redundant `include` entries and trailing whitespaces. Also I 
added in doc the case when single query transaction with disconnected 
client might be eventually commited upon completion in autocommit mode 
if it doesn't return any rows (doesn't communicate with user) before 
sending final commit message. This behavior might be unexpected for 
clients and hence IMO it's worth noticing.


 > diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
 > index 4c7b1e7bfd..8cf95d09a4 100644
 > --- a/src/backend/libpq/pqcomm.c
 > +++ b/src/backend/libpq/pqcomm.c
 >
 > @@ -919,13 +923,13 @@ socket_set_nonblocking(bool nonblocking)
 >  }
 >
 >  /* --------------------------------
 > - *        pq_recvbuf - load some bytes into the input buffer
 > + *        pq_recvbuf_ext - load some bytes into the input buffer
 >   *
 >   *        returns 0 if OK, EOF if trouble
 >   * --------------------------------
 >   */
 >  static int
 > -pq_recvbuf(void)
 > +pq_recvbuf_ext(bool nonblocking)
 >  {
 >      if (PqRecvPointer > 0)
 >      {
 > @@ -941,8 +945,7 @@ pq_recvbuf(void)
 >              PqRecvLength = PqRecvPointer = 0;
 >      }
 >
 > -    /* Ensure that we're in blocking mode */
 > -    socket_set_nonblocking(false);
 > +    socket_set_nonblocking(nonblocking);
 >
 >      /* Can fill buffer from PqRecvLength and upwards */
 >      for (;;)
 > @@ -954,6 +957,9 @@ pq_recvbuf(void)
 >
 >          if (r < 0)
 >          {
 > +            if (nonblocking && (errno == EAGAIN || errno == 
EWOULDBLOCK))
 > +                return 0;
 > +
 >              if (errno == EINTR)
 >                  continue;        /* Ok if interrupted */
 >
 > @@ -981,6 +987,13 @@ pq_recvbuf(void)
 >      }
 >  }
 >
 > +static int
 > +pq_recvbuf(void)
 > +{
 > +    return pq_recvbuf_ext(false);
 > +}
 > +
 > +

AFAICS, the above fragment is not related with primary fix directly.


AFAICS, there are the following open items that don't allow to treat the 
current patch completed:

* Absence of tap tests emulating some scenarios of client disconnection. 
IIRC, you wanted to rewrite the test case posted by Sergey.

* Concerns about portability of `pq_check_connection()`A implementation. 
BTW, on windows postgres with this patch have not been built [1].

* Absence of benchmark results to show lack of noticeable performance 
regression after applying non-zero timeout for checking client liveness.


1. 
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.131820


-- 
Regards,
Maksim Milyutin


Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Tue, Mar 30, 2021 at 6:25 AM Maksim Milyutin <milyutinma@gmail.com> wrote:
>
> Hi Thomas! Thanks for working on this patch.
>
> I have attached a new version with some typo corrections of doc entry,
> removing of redundant `include` entries and trailing whitespaces. Also I
> added in doc the case when single query transaction with disconnected
> client might be eventually commited upon completion in autocommit mode
> if it doesn't return any rows (doesn't communicate with user) before
> sending final commit message. This behavior might be unexpected for
> clients and hence IMO it's worth noticing.

Thanks!

>  > + *        pq_recvbuf_ext - load some bytes into the input buffer

> AFAICS, the above fragment is not related with primary fix directly.

Right, that was some stuff left over from the v7 patch that I forgot to remove.

> AFAICS, there are the following open items that don't allow to treat the
> current patch completed:
>
> * Absence of tap tests emulating some scenarios of client disconnection.
> IIRC, you wanted to rewrite the test case posted by Sergey.

Yeah, it's a bit tricky to write a test that won't fail randomly on
slow animals, but I think I see how now (I think you need to send
SELECT pg_sleep(60), then in another connection, wait until that shows
as running in pg_stat_activity, and then kill the first process, and
then wait until the log contains a disconnected-for-this-reason
message).  If we can answer the big question below, I'll invest the
time to do this.

> * Concerns about portability of `pq_check_connection()`A implementation.

Yes.  This is the main problem for me.  I just did a bit of testing to
refresh my memory of what doesn't work.  On FreeBSD, with TCP
keepalives enabled (I tested with tcp_keepalives_idle=1s,
tcp_keepalives_count=5, tcp_keepalives_interval=1s to make it
aggressive), the POLLUP patch works if you pull out the ethernet
cable.  But... it doesn't work if you kill -9 the remote psql process
(the remote kernel sends a FIN).  So this is a patch that only works
the way we need it to work on Linux.  Other OSes can only tell you
about this condition when you try to read or write.

> BTW, on windows postgres with this patch have not been built [1].

Huh, the manual says they have struct pollfd, and we include
winsock2.h.  But I guess it doesn't work reliably on Windows anyway.

https://docs.microsoft.com/en-us/windows/win32/api/winsock2/ns-winsock2-wsapollfd

> * Absence of benchmark results to show lack of noticeable performance
> regression after applying non-zero timeout for checking client liveness.

True, we should do that, but I'll be very surprised if it's still a
problem: the regression was caused by calling setitimer() for every
query, but we fixed that in commit 09cf1d52.

If we want to ship this in v14 we have to make a decision ASAP:

1.  Ship the POLLHUP patch (like v9) that only works reliably on
Linux.  Maybe disable the feature completely on other OSes?
2.  Ship the patch that tries to read (like v7).  It should work on
all systems, but it can be fooled by pipelined commands (though it can
detect a pipelined 'X').

Personally, I lean towards #2.

A third option was mentioned, but we have no patch:

3.  We could try to *send* a message.  This should work on all
systems.  Unfortunately our protocol doesn't seem to have an existing
"no-op" or "heartbeat" message we could use for this, and sending a
NOTICE would be noisy.  I suspect this would require some protocol
evolution, and there is no time for that in v14.



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Tue, Mar 30, 2021 at 10:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> If we want to ship this in v14 we have to make a decision ASAP:
>
> 1.  Ship the POLLHUP patch (like v9) that only works reliably on
> Linux.  Maybe disable the feature completely on other OSes?
> 2.  Ship the patch that tries to read (like v7).  It should work on
> all systems, but it can be fooled by pipelined commands (though it can
> detect a pipelined 'X').
>
> Personally, I lean towards #2.

I changed my mind.  Let's commit the pleasingly simple Linux-only
feature for now, and extend to it to send some kind of no-op message
in a later release.  So this is the version I'd like to go with.
Objections?

I moved the GUC into tcop/postgres.c and tcop/tcopprot.h, because it
directly controls postgres.c's behaviour, not pqcomm.c's.  The latter
only contains the code to perform the check.

Вложения

RE: Add client connection check during the execution of the query

От
"tsunakawa.takay@fujitsu.com"
Дата:
From: Thomas Munro <thomas.munro@gmail.com>
> I changed my mind.  Let's commit the pleasingly simple Linux-only feature for
> now, and extend to it to send some kind of no-op message in a later release.
> So this is the version I'd like to go with.
> Objections?

+1, as some of our users experienced the problem that the server kept processing (IIRC, a buggy PL/pgSQL procedure that
loopsindefinitely) after they killed the client.
 

TBH, Linux and Windows will be sufficient.  But I'm for providing a good feature on a specific OS first.


(1)
+    rc = poll(&pollfd, 1, 0);
+    if (rc < 0)
+    {
+        elog(COMMERROR, "could not poll socket: %m");
+        return false;

I think it's customary to use ereport() and errcode_for_socket_access().


(2)
pq_check_connection()

Following PostmasterIsAlive(), maybe it's better to name it as pq_connection_is_alive() pq_client_is_alive(), or
pq_frontend_is_alive()like the pqcomm.c's head comment uses the word frontend?
 


(3)
 #include <limits.h>
+#include <poll.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif

poll.h should be included between #ifndef WIN32 and #endif?


(4)
I think the new GUC works for walsender as well.  If so, how do we explain the relationship between the new GUC and
wal_receiver_timeoutand recommend the settings of them?
 


Regards
Takayuki Tsunakawa



Re: Add client connection check during the execution of the query

От
Bharath Rupireddy
Дата:
On Thu, Apr 1, 2021 at 11:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Mar 30, 2021 at 10:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > If we want to ship this in v14 we have to make a decision ASAP:
> >
> > 1.  Ship the POLLHUP patch (like v9) that only works reliably on
> > Linux.  Maybe disable the feature completely on other OSes?
> > 2.  Ship the patch that tries to read (like v7).  It should work on
> > all systems, but it can be fooled by pipelined commands (though it can
> > detect a pipelined 'X').
> >
> > Personally, I lean towards #2.
>
> I changed my mind.  Let's commit the pleasingly simple Linux-only
> feature for now, and extend to it to send some kind of no-op message
> in a later release.  So this is the version I'd like to go with.
> Objections?
>
> I moved the GUC into tcop/postgres.c and tcop/tcopprot.h, because it
> directly controls postgres.c's behaviour, not pqcomm.c's.  The latter
> only contains the code to perform the check.

Here's a minor comment: it would be good if we have an extra line
after variable assignments, before and after function calls/if
clauses, something like

+    pollfd.revents = 0;
+
+    rc = poll(&pollfd, 1, 0);
+
+    if (rc < 0)

And also here
     }
+
+    if (CheckClientConnectionPending)
+    {
+        CheckClientConnectionPending = false;

And
+        }
+    }
+
     if (ClientConnectionLost)

And
+        0, 0, INT_MAX,
+        check_client_connection_check_interval, NULL, NULL
+    },
+
     /* End-of-list marker */

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Thu, Apr 1, 2021 at 10:16 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> From: Thomas Munro <thomas.munro@gmail.com>
> > I changed my mind.  Let's commit the pleasingly simple Linux-only feature for
> > now, and extend to it to send some kind of no-op message in a later release.
> > So this is the version I'd like to go with.
> > Objections?
>
> +1, as some of our users experienced the problem that the server kept processing (IIRC, a buggy PL/pgSQL procedure
thatloops indefinitely) after they killed the client.
 

Cool.  Yeah, I have seen a few variants of that, and several other
complaints on the lists.

> TBH, Linux and Windows will be sufficient.  But I'm for providing a good feature on a specific OS first.

I discovered that at least one other OS has adopted POLLRDHUP, so I
changed the language to something slightly more general:

+       <para>
+        This option is currently available only on systems that support the
+        non-standard <symbol>POLLRDHUP</symbol> extension to the
+        <symbol>poll</symbol> system call, including Linux.
+       </para>

It seems like it must be quite easy for an OS to implement, since the
TCP stack surely has the information... it's just an API problem.
Hopefully that means that there aren't OSes that define the macro but
don't work the same way.  (I read somewhere that the POSIX compliance
test suite explicitly tests this half-shutdown case and fails any OS
that returns SIGHUP "prematurely".  Boo.)

> (1)
> +       rc = poll(&pollfd, 1, 0);
> +       if (rc < 0)
> +       {
> +               elog(COMMERROR, "could not poll socket: %m");
> +               return false;
>
> I think it's customary to use ereport() and errcode_for_socket_access().

Fixed.

> (2)
> pq_check_connection()
>
> Following PostmasterIsAlive(), maybe it's better to name it as pq_connection_is_alive() pq_client_is_alive(), or
pq_frontend_is_alive()like the pqcomm.c's head comment uses the word frontend?
 

I think it's OK, because it matches the name of the GUC.  I'm more
concerned about the name of the GUC.  Will we still be happy with this
name if a future releases sends a heartbeat message?  I think that is
still OK, so I'm happy with these names for now, but if someone has a
better name, please speak up very soon.

> (3)
>  #include <limits.h>
> +#include <poll.h>
>  #ifndef WIN32
>  #include <sys/mman.h>
>  #endif
>
> poll.h should be included between #ifndef WIN32 and #endif?

Oops, I forgot to wrap that in #ifdef HAVE_POLL_H while moving stuff
around.  Fixed.

> (4)
> I think the new GUC works for walsender as well.  If so, how do we explain the relationship between the new GUC and
wal_receiver_timeoutand recommend the settings of them?
 

No, it only works while executing a query.  (Is there something in
logical decoding, perhaps, that I have failed to consider?)

PS The "from" headers in emails received from Fujitsu seems to have
the names stripped, somewhere in the tubes of the internet.  I see the
full version when people from Fujitsu quote other people from Fujitsu.
I copied one of those into the commit message, complete with its
magnificent kanji characters (perhaps these are the cause of the
filtering?), and I hope that's OK with you.

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Apr 2, 2021 at 1:36 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Here's a minor comment: it would be good if we have an extra line
> after variable assignments, before and after function calls/if
> clauses, something like

Done in v11.  Thanks.



RE: Add client connection check during the execution of the query

От
"tsunakawa.takay@fujitsu.com"
Дата:
From: Thomas Munro <thomas.munro@gmail.com>
> > Following PostmasterIsAlive(), maybe it's better to name it as
> pq_connection_is_alive() pq_client_is_alive(), or pq_frontend_is_alive() like
> the pqcomm.c's head comment uses the word frontend?
> 
> I think it's OK, because it matches the name of the GUC.  I'm more concerned
> about the name of the GUC.  Will we still be happy with this name if a future
> releases sends a heartbeat message?  I think that is still OK, so I'm happy with
> these names for now, but if someone has a better name, please speak up very
> soon.

OK, agreed.


> > (4)
> > I think the new GUC works for walsender as well.  If so, how do we explain
> the relationship between the new GUC and wal_receiver_timeout and
> recommend the settings of them?
> 
> No, it only works while executing a query.  (Is there something in logical
> decoding, perhaps, that I have failed to consider?)

When I saw the following code, I thought the new GUC takes effect at the same time as wal\sender_timeout.  But they
don'tbecome effective simultaneously.  The new GUC is effective before the logical replication starts.  And
wal_sender_timeouttakes effect after the physical/logical replication starts.  So, there's no problem.
 

[PostgresMain]
                    if (am_walsender)
                    {
                        if (!exec_replication_command(query_string))
                            exec_simple_query(query_string);
                    }
                    else
                        exec_simple_query(query_string);


The patch looks committable to me.


> PS The "from" headers in emails received from Fujitsu seems to have the
> names stripped, somewhere in the tubes of the internet.  I see the full version
> when people from Fujitsu quote other people from Fujitsu.
> I copied one of those into the commit message, complete with its magnificent
> kanji characters (perhaps these are the cause of the filtering?), and I hope
> that's OK with you.

Certainly, it seems that only my email address appears in the sender field on the mailer.  I'll check my Outlook
settings.


Regards
Takayuki Tsunakawa



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Apr 2, 2021 at 6:18 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> The patch looks committable to me.

I checked for performance impact compared to master with pgbench -S,
and didn't see any problem.  I thought more about how to write a
decent race-free test but struggled with the lack of a good way to
control multiple connections from TAP tests and gave up for now.  I
previously tried to write something to help with that, but abandoned
it because it was terrible[1].  It seems a bit strange to me that psql
has to be involved at all, and we don't have a simple way to connect
one or more sockets and speak the protocol from perl.

Pushed!  Thanks to all who contributed.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2BFkUuDv-bcBns%3DZ_O-V9QGW0nWZNHOkEPxHZWjegRXvw%40mail.gmail.com,
see v2-0006-Add-TAP-test-for-snapshot-too-old.patch



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Sat, Apr 3, 2021 at 9:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Pushed!  Thanks to all who contributed.

Here's something I wanted to park here to look into for the next
cycle:  it turns out that kqueue's EV_EOF flag also has the right
semantics for this.  That leads to the idea of exposing the event via
the WaitEventSet API, and would the bring
client_connection_check_interval feature to 6/10 of our OSes, up from
2/10.  Maybe Windows' FD_CLOSE event could get us up to 7/10, not
sure.

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Apr 30, 2021 at 2:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's something I wanted to park here to look into for the next
> cycle:  it turns out that kqueue's EV_EOF flag also has the right
> semantics for this.  That leads to the idea of exposing the event via
> the WaitEventSet API, and would the bring
> client_connection_check_interval feature to 6/10 of our OSes, up from
> 2/10.  Maybe Windows' FD_CLOSE event could get us up to 7/10, not
> sure.

Rebased.  Added documentation tweak and a check to reject the GUC on
unsupported OSes.

Вложения

Re: Add client connection check during the execution of the query

От
Zhihong Yu
Дата:


On Fri, Jun 11, 2021 at 9:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Apr 30, 2021 at 2:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's something I wanted to park here to look into for the next
> cycle:  it turns out that kqueue's EV_EOF flag also has the right
> semantics for this.  That leads to the idea of exposing the event via
> the WaitEventSet API, and would the bring
> client_connection_check_interval feature to 6/10 of our OSes, up from
> 2/10.  Maybe Windows' FD_CLOSE event could get us up to 7/10, not
> sure.

Rebased.  Added documentation tweak and a check to reject the GUC on
unsupported OSes.
Hi,

-   Assert(count > 0);
+   /* For WL_SOCKET_READ -> WL_SOCKET_CLOSED, no change needed. */
+   if (count == 0)
+       return;
+
    Assert(count <= 2);
 
It seems that the remaining Assert() should say 1 <= count && count <= 2

+#ifdef POLLRDHUP
+           if ((cur_event->events & WL_SOCKET_CLOSED) &&
+               (cur_pollfd->revents & (POLLRDHUP | errflags)))

It seems the last condition above should be written as:

((cur_pollfd->revents & POLLRDHUP) | (cur_pollfd->revents & errflags))

Cheers

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Sat, Jun 12, 2021 at 8:31 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> +#ifdef POLLRDHUP
> +           if ((cur_event->events & WL_SOCKET_CLOSED) &&
> +               (cur_pollfd->revents & (POLLRDHUP | errflags)))
>
> It seems the last condition above should be written as:
>
> ((cur_pollfd->revents & POLLRDHUP) | (cur_pollfd->revents & errflags))

Hi Zhihong,

Why?  Isn't (A & B) | (A & C) is the same as A & (B | C)?



Re: Add client connection check during the execution of the query

От
Zhihong Yu
Дата:


On Thu, Oct 7, 2021 at 8:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Jun 12, 2021 at 8:31 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> +#ifdef POLLRDHUP
> +           if ((cur_event->events & WL_SOCKET_CLOSED) &&
> +               (cur_pollfd->revents & (POLLRDHUP | errflags)))
>
> It seems the last condition above should be written as:
>
> ((cur_pollfd->revents & POLLRDHUP) | (cur_pollfd->revents & errflags))

Hi Zhihong,

Why?  Isn't (A & B) | (A & C) is the same as A & (B | C)?
Hi,
My former comment was about 4 months old.

The current way as expressed in the patch should be fine.

Cheers

Re: Add client connection check during the execution of the query

От
Maksim Milyutin
Дата:
On 12.06.2021 07:24, Thomas Munro wrote:

> On Fri, Apr 30, 2021 at 2:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Here's something I wanted to park here to look into for the next
>> cycle:  it turns out that kqueue's EV_EOF flag also has the right
>> semantics for this.  That leads to the idea of exposing the event via
>> the WaitEventSet API, and would the bring
>> client_connection_check_interval feature to 6/10 of our OSes, up from
>> 2/10.  Maybe Windows' FD_CLOSE event could get us up to 7/10, not
>> sure.
> Rebased.  Added documentation tweak and a check to reject the GUC on
> unsupported OSes.


Good work. I have tested your patch on Linux and FreeBSD on three basic 
cases: client killing, cable breakdown (via manipulations with firewall) 
and silent closing client connection before completion of previously 
started query in asynchronous manner. And all works fine.

Some comments from me:

  bool
  pq_check_connection(void)
  {
-#if defined(POLLRDHUP)
+    WaitEvent events[3];


3 is looks like as magic constant. We might to specify a constant for 
all event groups in FeBeWaitSet.


+    ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET, NULL);
+    ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, 
WL_SOCKET_CLOSED, NULL);
+    rc = WaitEventSetWait(FeBeWaitSet, 0, events, lengthof(events), 0);
+    ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET, 
MyLatch);

AFAICS, side effect to 
(FeBeWaitSet->events[FeBeWaitSetSocketPos]).events by setting 
WL_SOCKET_CLOSED value under calling of pq_check_connection() function 
doesn't have negative impact later, does it? That is, all 
WaitEventSetWait() calls have to setup socket events on its own from 
scratch.


-- 
Regards,
Maksim Milyutin




Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Tue, Oct 12, 2021 at 3:10 AM Maksim Milyutin <milyutinma@gmail.com> wrote:
> Good work. I have tested your patch on Linux and FreeBSD on three basic
> cases: client killing, cable breakdown (via manipulations with firewall)
> and silent closing client connection before completion of previously
> started query in asynchronous manner. And all works fine.

Thanks for the testing and review!

> +    WaitEvent events[3];
>
> 3 is looks like as magic constant. We might to specify a constant for
> all event groups in FeBeWaitSet.

Yeah.  In fact, we really just need one event.  Fixed.

> +    ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET, NULL);
> +    ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos,
> WL_SOCKET_CLOSED, NULL);
> +    rc = WaitEventSetWait(FeBeWaitSet, 0, events, lengthof(events), 0);
> +    ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET,
> MyLatch);
>
> AFAICS, side effect to
> (FeBeWaitSet->events[FeBeWaitSetSocketPos]).events by setting
> WL_SOCKET_CLOSED value under calling of pq_check_connection() function
> doesn't have negative impact later, does it? That is, all
> WaitEventSetWait() calls have to setup socket events on its own from
> scratch.

Correct: every site that waits for FeBeWaitSet explicitly modifies the
socket event to say what it's waiting for (and that is often a no-op
internally), so we don't have to worry about restoring the previous
state.  I've added a comment about that.  We should work harder to
restore the latch than my previous patch did, though.  Now I'm using a
PG_FINALLY() block.

I'm hoping to push this soon, after another round of testing, if
there's no more feedback.

There is one more OS that could be added, but I'll leave it out of the
initial commit, pending further investigation.  Since I recently had
to set up a Windows dev VM up to test some other portability stuff, I
had a chance to test the FD_CLOSE hypothesis.  You just have to do
this to enable it:

@@ -2069,6 +2069,7 @@ WaitEventSetCanReportClosed(void)
 {
 #if (defined(WAIT_USE_POLL) && defined(POLLRDHUP)) || \
        defined(WAIT_USE_EPOLL) || \
+       defined(WAIT_USE_WIN32) || \
        defined(WAIT_USE_KQUEUE)
        return true;
 #else

It seems to work!  I'm not sure why it works, or whether we can count
on it, though.  These sentences from the documentation[1] seem to
contract each other:

"FD_CLOSE being posted after all data is read from a socket. An
application should check for remaining data upon receipt of FD_CLOSE
to avoid any possibility of losing data."

My test says that the first sentence is wrong, but the second doesn't
exactly say that it has reliable POLLRDHUP nature, and I haven't found
one that does, yet.  Perhaps we can convince ourselves of that in
follow-up work.

For the record, I tested two scenarios.  The client was a Unix system,
the server a Windows 10 VM.

1.  Connecting with psql and running "SELECT pg_sleep(60)" and then
killing the psql process.  I'm not surprised that this one worked; it
would work if we tested for WL_SOCKET_READABLE too, but we already
decided that's not good enough.
2.  Connecting from a C program that does PQsendQuery(conn, "SELECT
pg_sleep(60)") and then immediately PQfinish(conn), to test whether
the FD_CLOSE event is reported even though there is an unconsumed 'X'
in the socket.  I wouldn't want to ship the feature on an OS where
this case doesn't get reported, or doesn't get reported sometimes,
because it'd be unreliable and unlike the behaviour on other OSes.
But it worked for me.

[1] https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsaeventselect

Вложения

Re: Add client connection check during the execution of the query

От
Andres Freund
Дата:
Hi,

On 2021-12-11 17:41:34 +1300, Thomas Munro wrote:
> --- a/src/backend/libpq/pqcomm.c
> +++ b/src/backend/libpq/pqcomm.c
> @@ -1959,22 +1959,36 @@ pq_settcpusertimeout(int timeout, Port *port)
>  bool
>  pq_check_connection(void)
>  {
> -#if defined(POLLRDHUP)
> -    /*
> -     * POLLRDHUP is a Linux extension to poll(2) to detect sockets closed by
> -     * the other end.  We don't have a portable way to do that without
> -     * actually trying to read or write data on other systems.  We don't want
> -     * to read because that would be confused by pipelined queries and COPY
> -     * data. Perhaps in future we'll try to write a heartbeat message instead.
> -     */
> -    struct pollfd pollfd;
> +    WaitEvent    event;
>      int            rc;
>  
> -    pollfd.fd = MyProcPort->sock;
> -    pollfd.events = POLLOUT | POLLIN | POLLRDHUP;
> -    pollfd.revents = 0;
> +    /*
> +     * Temporarily ignore the latch, so that we can poll for just the one
> +     * event we care about.
> +     */
> +    ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET, NULL);
>  
> -    rc = poll(&pollfd, 1, 0);
> +    PG_TRY();
> +    {
> +        /*
> +         * It's OK to clobber the socket event to report only the event we're
> +         * interested in without restoring the previous state afterwards,
> +         * because every FeBeWaitSet wait site does the same.
> +         */
> +        ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, WL_SOCKET_CLOSED,
> +                        NULL);
> +        rc = WaitEventSetWait(FeBeWaitSet, 0, &event, 1, 0);
> +    }
> +    PG_FINALLY();
> +    {
> +        /*
> +         * Restore the latch, so we can't leave FeBeWaitSet in a broken state
> +         * that ignores latches.
> +         */
> +        ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET,
> +                        MyLatch);
> +    }
> +    PG_END_TRY();

Yuck. Is there really no better way to deal with this? What kind of errors is
this trying to handle transparently? Afaics this still changes when we'd
e.g. detect postmaster death.

Am I misunderstanding code or comment, or is the comment saying that it's ok
to clobber the wes, but then we actually unclobber it?

Greetings,

Andres Freund



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Sat, Dec 11, 2021 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
> Yuck. Is there really no better way to deal with this? What kind of errors is
> this trying to handle transparently? Afaics this still changes when we'd
> e.g. detect postmaster death.

The problem is that WaitEventSetWait() only reports the latch, if it's
set, so I removed it from the set (setting it to NULL), and then undo
that afterwards.  Perhaps we could fix that root problem instead.
That is, we could make it so that latches aren't higher priority in
that way, ie don't hide other events[1].  Then I wouldn't have to
modify the WES here, I could just ignore it in the output event list
(and make sure that's big enough for all possible events, as I had it
in the last version).  I'll think about that.

> Am I misunderstanding code or comment, or is the comment saying that it's ok
> to clobber the wes, but then we actually unclobber it?

It's explaining that it's OK to clobber the *socket*, but that the
*latch* needs to be unclobbered.

[1]  As described in WaitEventSetWait():

         * Check if the latch is set already. If so, leave the loop
         * immediately, avoid blocking again. We don't attempt to report any
         * other events that might also be satisfied.



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Sat, Dec 11, 2021 at 7:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Dec 11, 2021 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
> > Yuck. Is there really no better way to deal with this? What kind of errors is
> > this trying to handle transparently? Afaics this still changes when we'd
> > e.g. detect postmaster death.
>
> The problem is that WaitEventSetWait() only reports the latch, if it's
> set, so I removed it from the set (setting it to NULL), and then undo
> that afterwards.  Perhaps we could fix that root problem instead.
> That is, we could make it so that latches aren't higher priority in
> that way, ie don't hide other events[1].  Then I wouldn't have to
> modify the WES here, I could just ignore it in the output event list
> (and make sure that's big enough for all possible events, as I had it
> in the last version).  I'll think about that.

I tried that.  It seems OK, and gets rid of the PG_FINALLY(), which is
nice.  Latches still have higher priority, and still have the fast
return if already set and you asked for only one event, but now if you
ask for nevents > 1 we'll make the syscall too so we'll see the
WL_SOCKET_CLOSED.

It's possible that someone might want the old behaviour one day (fast
return even for nevents > 1, hiding other events), and then we'd have
to come up with some way to request that, which is the type of tricky
policy question that had put me off "fixing" this before.  But I guess
we could cross that bridge if we come to it.  Everywhere else calls
with nevents == 1, so that's hypothetical.

Better?

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Mon, Dec 13, 2021 at 5:51 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> [...] Everywhere else calls
> with nevents == 1, so that's hypothetical.

Erm, I forgot about ExecAppendAsyncEventWait(), so I'd have to update
the commit message on that point, but it's hard to worry too much
about that case -- it's creating and destroying a WES every time.  I'd
rather worry about fixing that problem.



Re: Add client connection check during the execution of the query

От
Andres Freund
Дата:
Hi,

On 2021-12-13 17:51:00 +1300, Thomas Munro wrote:
> On Sat, Dec 11, 2021 at 7:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sat, Dec 11, 2021 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
> > > Yuck. Is there really no better way to deal with this? What kind of errors is
> > > this trying to handle transparently? Afaics this still changes when we'd
> > > e.g. detect postmaster death.
> >
> > The problem is that WaitEventSetWait() only reports the latch, if it's
> > set, so I removed it from the set (setting it to NULL), and then undo
> > that afterwards.  Perhaps we could fix that root problem instead.
> > That is, we could make it so that latches aren't higher priority in
> > that way, ie don't hide other events[1].  Then I wouldn't have to
> > modify the WES here, I could just ignore it in the output event list
> > (and make sure that's big enough for all possible events, as I had it
> > in the last version).  I'll think about that.
> 
> I tried that.  It seems OK, and gets rid of the PG_FINALLY(), which is
> nice.  Latches still have higher priority, and still have the fast
> return if already set and you asked for only one event, but now if you
> ask for nevents > 1 we'll make the syscall too so we'll see the
> WL_SOCKET_CLOSED.

Isn't a certain postgres committer that cares a lot about unnecessary syscalls
going to be upset about this one? Even with the nevents > 1 optimization? Yes,
right now there's no other paths that do so, but I don't like the corner this
paints us in.

From a different angle: Why do we need to perform the client connection check
if the latch is set?

Greetings,

Andres Freund



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Tue, Dec 14, 2021 at 7:53 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-12-13 17:51:00 +1300, Thomas Munro wrote:
> > I tried that.  It seems OK, and gets rid of the PG_FINALLY(), which is
> > nice.  Latches still have higher priority, and still have the fast
> > return if already set and you asked for only one event, but now if you
> > ask for nevents > 1 we'll make the syscall too so we'll see the
> > WL_SOCKET_CLOSED.
>
> Isn't a certain postgres committer that cares a lot about unnecessary syscalls
> going to be upset about this one? Even with the nevents > 1 optimization? Yes,
> right now there's no other paths that do so, but I don't like the corner this
> paints us in.

Well, I was trying to avoid bikeshedding an API change just for a
hypothetical problem we could solve when the time comes (say, after
fixing the more egregious problems with Append's WES usage), but here
goes: we could do something like AddWaitEventToSet(FeBeWaitSet,
WL_LATCH_SET_LOPRIO, ...) that is translated to WL_LATCH_SET
internally but also sets a flag to enable this
no-really-please-poll-all-the-things-if-there-is-space behaviour.

> From a different angle: Why do we need to perform the client connection check
> if the latch is set?

Imagine a parallel message that arrives just as our connection check
CFI routine runs, and sets the latch.  It'd be arbitrary and bizarre
if that caused us to skip the check.  So we have to ignore it, and the
question is just how.  I presented two different ways.  A third way
would be to create an entirely new WES for this use case; nope, that's
either wasteful of an fd or wasteful of system calls for a temporary
WES and likely more PG_TRY() stuff to avoid leaking it.  A fourth
could be to modify the WES like the simple code in v2/v3, but make the
WES code no-throw or pretend it's no-throw, which I didn't seriously
consider (I mean, if, say, WaitForMultipleObjects() returns
WAIT_FAILED and ERRORs then your session is pretty much hosed and your
WES is probably never going to work correctly again, but it still
seems to break basic rules of programming decency and exception safety
to leave the WES sans latch on non-local exit, which is why I added
the PG_FINALLY() that offended you to v3).



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Tue, Dec 14, 2021 at 11:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Dec 14, 2021 at 7:53 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-12-13 17:51:00 +1300, Thomas Munro wrote:
> > > I tried that.  It seems OK, and gets rid of the PG_FINALLY(), which is
> > > nice.  Latches still have higher priority, and still have the fast
> > > return if already set and you asked for only one event, but now if you
> > > ask for nevents > 1 we'll make the syscall too so we'll see the
> > > WL_SOCKET_CLOSED.
> >
> > Isn't a certain postgres committer that cares a lot about unnecessary syscalls
> > going to be upset about this one? Even with the nevents > 1 optimization? Yes,
> > right now there's no other paths that do so, but I don't like the corner this
> > paints us in.
>
> Well, I was trying to avoid bikeshedding an API change just for a
> hypothetical problem we could solve when the time comes (say, after
> fixing the more egregious problems with Append's WES usage), but here
> goes: we could do something like AddWaitEventToSet(FeBeWaitSet,
> WL_LATCH_SET_LOPRIO, ...) that is translated to WL_LATCH_SET
> internally but also sets a flag to enable this
> no-really-please-poll-all-the-things-if-there-is-space behaviour.

Here's one like that.

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Tue, Dec 14, 2021 at 11:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Dec 14, 2021 at 11:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Well, I was trying to avoid bikeshedding an API change just for a
> > hypothetical problem we could solve when the time comes (say, after
> > fixing the more egregious problems with Append's WES usage), but here
> > goes: we could do something like AddWaitEventToSet(FeBeWaitSet,
> > WL_LATCH_SET_LOPRIO, ...) that is translated to WL_LATCH_SET
> > internally but also sets a flag to enable this
> > no-really-please-poll-all-the-things-if-there-is-space behaviour.

That API is probably useless for anything else and is just too
complicated for what it's doing here.

I considered another idea we discussed: if we see a latch event, clear
it and try again so that other events can be revealed (rince and
repeat), but remember if that happens and set the latch at the end.  I
think that still requires PG_FINALLY() if you want to guarantee not to
eat a latch event if WaitEventSetWait() throws.  This may be a
theoretical point because things must be pretty broken if
WaitEventSetWait() is throwing, but I don't like an egregious lack of
exception safety on principle.

So I think I had it better in the beginning: just mute the latch, and
then unmute it at the end in a PG_FINALLY() block.  I'm back to
proposing that short and sweet version, this time with some minor
cleanup.

Вложения

Re: Add client connection check during the execution of the query

От
Andres Freund
Дата:
Hi,

On 2022-01-11 22:59:13 +1300, Thomas Munro wrote:
> I considered another idea we discussed: if we see a latch event, clear
> it and try again so that other events can be revealed (rince and
> repeat), but remember if that happens and set the latch at the end.  I
> think that still requires PG_FINALLY() if you want to guarantee not to
> eat a latch event if WaitEventSetWait() throws.  This may be a
> theoretical point because things must be pretty broken if
> WaitEventSetWait() is throwing, but I don't like an egregious lack of
> exception safety on principle.

I don't think this is a problem. Not because of WaitEventSetWait() never
throwing, but because it's "just fine" to have reset the latch in that case.

The error will cause control flow to transfer to the next PG_CATCH site. The
point of latches is to avoid racy checks for events (or sleeps with short
timeouts to handle the races). The documented pattern is:

 * for (;;)
 * {
 *       ResetLatch();
 *       if (work to do)
 *           Do Stuff();
 *       WaitLatch();
 * }


Latches only work if there's a very limited amount of things happening between
the if (work_to_do) and the WaitLatch().

It definitely is *NOT* be ok to do something like:

for (;;)
{
    ResetLatch()
    if (work_to_do)
      DoStuff();

    PG_TRY():
       something_that_may_throw();
    PG_CATCH():
       something_not_throwing();

    WaitLatch();
}

For one, elog.c related code might have actually done network IO! During which
the latch very well might be reset. So there's just no way any remotely
reasonable code can rely on preserving latch state across errors.



> I considered another idea we discussed: if we see a latch event, clear
> it and try again so that other events can be revealed (rince and
> repeat), but remember if that happens and set the latch at the end.

The more I think about it, the less I see why we *ever* need to re-arm the
latch in pq_check_connection() in this approach. pq_check_connection() is only
used from from ProcessInterrupts(), and there's plenty things inside
ProcessInterrupts() that can cause latches to be reset (e.g. parallel message
processing causing log messages to be sent to the client, causing network IO,
which obviously can do a latch reset).

It makes sense to use CFI() in a latch loop, but it would have to be at the
bottom or top, not between if (work_to_do) and WaitLatch().

Greetings,

Andres Freund



Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Jan 14, 2022 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
> The more I think about it, the less I see why we *ever* need to re-arm the
> latch in pq_check_connection() in this approach. pq_check_connection() is only
> used from from ProcessInterrupts(), and there's plenty things inside
> ProcessInterrupts() that can cause latches to be reset (e.g. parallel message
> processing causing log messages to be sent to the client, causing network IO,
> which obviously can do a latch reset).

Thanks for the detailed explanation.  I guess I was being overly
cautious and a little myopic, "leave things exactly the way you found
them", so I didn't have to think about any of that.  I see now that
the scenario I was worrying about would be a bug in whatever
latch-wait loop happens to reach this code.  Alright then, here is
just... one... more... patch, this time consuming any latch that gets
in the way and retrying, with no restore.

Вложения

Re: Add client connection check during the execution of the query

От
Thomas Munro
Дата:
On Fri, Jan 14, 2022 at 7:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 14, 2022 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
> > The more I think about it, the less I see why we *ever* need to re-arm the
> > latch in pq_check_connection() in this approach. pq_check_connection() is only
> > used from from ProcessInterrupts(), and there's plenty things inside
> > ProcessInterrupts() that can cause latches to be reset (e.g. parallel message
> > processing causing log messages to be sent to the client, causing network IO,
> > which obviously can do a latch reset).
>
> Thanks for the detailed explanation.  I guess I was being overly
> cautious and a little myopic, "leave things exactly the way you found
> them", so I didn't have to think about any of that.  I see now that
> the scenario I was worrying about would be a bug in whatever
> latch-wait loop happens to reach this code.  Alright then, here is
> just... one... more... patch, this time consuming any latch that gets
> in the way and retrying, with no restore.

And pushed.

My excuse for taking so long to get this into the tree is that it was
tedious to retest this thing across so many OSes and determine that it
really does behave reliably for killed processes AND lost
processes/yanked cables/keepalive timeout, even with buffered data.
In the process I learned a bit more about TCP and got POLLRDHUP added
to FreeBSD (not that it matters for PostgreSQL 15, now that we can use
EV_EOF).  As for the FD_CLOSE behaviour I thought I saw on Windows
upthread: it was a mirage, caused by the RST thing.  There may be some
other way to implement this feature on that TCP implementation, but I
don't know what it is.