Обсуждение: backtrace_on_internal_error
We have backtrace support for server errors.  You can activate that 
either by setting backtrace_functions or by explicitly attaching 
errbacktrace() to an ereport() call.
I would like an additional mode that essentially triggers a backtrace 
anytime elog() (for internal errors) is called.  This is not well 
covered by backtrace_functions, because there are many equally-worded 
low-level errors in many functions.  And if you find out where the error 
is, then you need to manually rewrite the elog() to ereport() to attach 
the errbacktrace(), which is annoying.  Having a backtrace automatically 
on every elog() call would be very helpful during development for 
various kinds of common errors from palloc, syscache, node support, etc.
I think the implementation would be very simple, something like
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 6aeb855e49..45d40abe92 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -498,9 +498,11 @@ errfinish(const char *filename, int lineno, const 
char *funcname)
     /* Collect backtrace, if enabled and we didn't already */
     if (!edata->backtrace &&
-       edata->funcname &&
-       backtrace_functions &&
-       matches_backtrace_functions(edata->funcname))
+       ((edata->funcname &&
+         backtrace_functions &&
+         matches_backtrace_functions(edata->funcname)) ||
+        (edata->sqlerrcode == ERRCODE_INTERNAL_ERROR &&
+         backtrace_on_internal_error)))
         set_backtrace(edata, 2);
     /*
where backtrace_on_internal_error would be a GUC variable.
Would others find this useful?  Any other settings or variants in this 
area that should be considered while we're here?
			
		On Tue, Dec 05, 2023 at 11:55:05AM +0100, Peter Eisentraut wrote: > Would others find this useful? Yes. I think I would use this pretty frequently. > Any other settings or variants in this area > that should be considered while we're here? IMO it would be nice to have a way to turn on backtraces for everything, or at least everything above a certain logging level. That would primarily be useful for when you don't know exactly which C function is producing the error. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Dec 5, 2023 at 12:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Dec 05, 2023 at 11:55:05AM +0100, Peter Eisentraut wrote: > > Would others find this useful? > > Yes. I think I would use this pretty frequently. I think we should consider unconditionally emitting a backtrace when an elog() is hit, instead of requiring a GUC. Or at least any elog() that's not at a DEBUGn level. If the extra output annoys anybody, that means they're regularly hitting an elog(), and it ought to be promoted to ereport(). -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote: > I think we should consider unconditionally emitting a backtrace when > an elog() is hit, instead of requiring a GUC. Or at least any elog() > that's not at a DEBUGn level. If the extra output annoys anybody, that > means they're regularly hitting an elog(), and it ought to be promoted > to ereport(). Perhaps this should be a GUC that defaults to LOG or ERROR. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote: > > I think we should consider unconditionally emitting a backtrace when > > an elog() is hit, instead of requiring a GUC. Or at least any elog() > > that's not at a DEBUGn level. If the extra output annoys anybody, that > > means they're regularly hitting an elog(), and it ought to be promoted > > to ereport(). > > Perhaps this should be a GUC that defaults to LOG or ERROR. Why? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 5 Dec 2023 at 19:30, Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote: > > > I think we should consider unconditionally emitting a backtrace when > > > an elog() is hit, instead of requiring a GUC. Or at least any elog() > > > that's not at a DEBUGn level. If the extra output annoys anybody, that > > > means they're regularly hitting an elog(), and it ought to be promoted > > > to ereport(). > > > > Perhaps this should be a GUC that defaults to LOG or ERROR. > > Why? I can't speak for Nathan, but my reason would be that I'm not in the habit to attach a debugger to my program to keep track of state progression, but instead use elog() during patch development. I'm not super stoked for getting my developmental elog(LOG)-s spammed with stack traces, so I'd want to set this at least to ERROR, while in production LOG could be fine. Similarly, there are probably extensions that do not use ereport() directly, but instead use elog(), because of reasons like 'not planning on doing translations' and 'elog() is the easier API'. Forcing a change over to ereport because of stack trace spam in logs caused by elog would be quite annoying. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Tue, Dec 5, 2023 at 1:47 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I can't speak for Nathan, but my reason would be that I'm not in the > habit to attach a debugger to my program to keep track of state > progression, but instead use elog() during patch development. I'm not > super stoked for getting my developmental elog(LOG)-s spammed with > stack traces, so I'd want to set this at least to ERROR, while in > production LOG could be fine. > > Similarly, there are probably extensions that do not use ereport() > directly, but instead use elog(), because of reasons like 'not > planning on doing translations' and 'elog() is the easier API'. > Forcing a change over to ereport because of stack trace spam in logs > caused by elog would be quite annoying. That does seem like a fair complaint. But I also think it would be really good if we had something that could be enabled unconditionally instead of via a GUC... because if it's gated by aa GUC then it often won't be there when you need it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Dec 05, 2023 at 07:47:25PM +0100, Matthias van de Meent wrote: > On Tue, 5 Dec 2023 at 19:30, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> > Perhaps this should be a GUC that defaults to LOG or ERROR. >> >> Why? Sorry, I should've explained why in my message. > I can't speak for Nathan, but my reason would be that I'm not in the > habit to attach a debugger to my program to keep track of state > progression, but instead use elog() during patch development. I'm not > super stoked for getting my developmental elog(LOG)-s spammed with > stack traces, so I'd want to set this at least to ERROR, while in > production LOG could be fine. > > Similarly, there are probably extensions that do not use ereport() > directly, but instead use elog(), because of reasons like 'not > planning on doing translations' and 'elog() is the easier API'. > Forcing a change over to ereport because of stack trace spam in logs > caused by elog would be quite annoying. My main concern was forcing extra logging that users won't have a great way to turn off (except for maybe raising log_min_messages or something). Also, it'd give us a way to slowly ramp up backtraces over a few years without suddenly spamming everyones logs in v17. For example, maybe this GUC defaults to PANIC or FATAL in v17. Once we've had a chance to address any common backtraces there, we could bump it to ERROR or WARNING in v18. And so on. If we just flood everyone's logs immediately, I worry that folks will just turn it off, and we won't get the reports we are hoping for. I know we already have so many GUCs and would like to avoid adding new ones when possible. Maybe this is one that could eventually be retired as we gain confidence that it won't obliterate the log files. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> On Tue, 5 Dec 2023 at 19:30, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I think we should consider unconditionally emitting a backtrace when
>>> an elog() is hit, instead of requiring a GUC.
>> Perhaps this should be a GUC that defaults to LOG or ERROR.
> I can't speak for Nathan, but my reason would be that I'm not in the
> habit to attach a debugger to my program to keep track of state
> progression, but instead use elog() during patch development. I'm not
> super stoked for getting my developmental elog(LOG)-s spammed with
> stack traces, so I'd want to set this at least to ERROR, while in
> production LOG could be fine.
Yeah, I would not be happy either with elog(LOG) suddenly getting
10x more verbose.  I think it might be okay to unconditionally do this
when elevel >= ERROR, though.
(At the same time, I don't have a problem with the idea of a GUC
controlling the minimum elevel to cause the report.  Other people
might have other use-cases than I do.)
            regards, tom lane
			
		On 05.12.23 11:55, Peter Eisentraut wrote: > I think the implementation would be very simple, something like > > diff --git a/src/backend/utils/error/elog.c > b/src/backend/utils/error/elog.c > index 6aeb855e49..45d40abe92 100644 > --- a/src/backend/utils/error/elog.c > +++ b/src/backend/utils/error/elog.c > @@ -498,9 +498,11 @@ errfinish(const char *filename, int lineno, const > char *funcname) > > /* Collect backtrace, if enabled and we didn't already */ > if (!edata->backtrace && > - edata->funcname && > - backtrace_functions && > - matches_backtrace_functions(edata->funcname)) > + ((edata->funcname && > + backtrace_functions && > + matches_backtrace_functions(edata->funcname)) || > + (edata->sqlerrcode == ERRCODE_INTERNAL_ERROR && > + backtrace_on_internal_error))) > set_backtrace(edata, 2); > > /* > > where backtrace_on_internal_error would be a GUC variable. It looks like many people found this idea worthwhile. Several people asked for a way to set the minimum log level for this treatment. Something else to note: I wrote the above code to check the error code; it doesn't check whether the original code write elog() or ereport(). There are some internal errors that are written as ereport() now. Others might be changed from time to time; until now there would have been no external effect from this. I think it would be weird to introduce a difference between these forms now. But then, elog() only uses the error code ERRCODE_INTERNAL_ERROR if the error level is >=ERROR. So this already excludes everything below. Do people want a way to distinguish ERROR/FATAL/PANIC? Or do people want a way to enable backtraces for elog(LOG)? This didn't look too interesting to me. (Many current elog(LOG) calls are behind additional guards like TRACE_SORT or LOCK_DEBUG.) If neither of these two are very interesting, then the above code would already appear to do what was asked.
Peter Eisentraut <peter@eisentraut.org> writes:
> Something else to note:  I wrote the above code to check the error code; 
> it doesn't check whether the original code write elog() or ereport(). 
> There are some internal errors that are written as ereport() now. 
> Others might be changed from time to time; until now there would have 
> been no external effect from this.  I think it would be weird to 
> introduce a difference between these forms now.
Yeah, that was bothering me too.  IIRC, elog is already documented
as being *exactly equivalent* to ereport with a minimal set of
options.  I don't think we should break that equivalence.  So I
agree with driving this off the stated-or-imputed errcode rather
than which function is called.
> Do people want a way to distinguish ERROR/FATAL/PANIC?
> Or do people want a way to enable backtraces for elog(LOG)?
Personally I don't see a need for either.
            regards, tom lane
			
		Here is a patch to play with. I also found a related typo. One possible question for discussion is whether the default for this should be off, on, or possibly something like on-in-assert-builds. (Personally, I'm happy to turn it on myself at run time, but everyone has different workflows.)
Вложения
Peter Eisentraut <peter@eisentraut.org> writes:
> Here is a patch to play with.
Didn't read the patch yet, but ...
> One possible question for discussion is whether the default for this 
> should be off, on, or possibly something like on-in-assert-builds. 
> (Personally, I'm happy to turn it on myself at run time, but everyone 
> has different workflows.)
... there was already opinion upthread that this should be on by
default, which I agree with.  You shouldn't be hitting cases like
this commonly (if so, they're bugs to fix or the errcode should be
rethought), and the failure might be pretty hard to reproduce.
I'm not really sold that we even need YA GUC, for that matter.
How about committing the behavior without a GUC, and then
back-filling one if we get pushback?
            regards, tom lane
			
		Hi, On 2023-12-08 10:05:09 -0500, Tom Lane wrote: > Peter Eisentraut <peter@eisentraut.org> writes: > > One possible question for discussion is whether the default for this > > should be off, on, or possibly something like on-in-assert-builds. > > (Personally, I'm happy to turn it on myself at run time, but everyone > > has different workflows.) > > ... there was already opinion upthread that this should be on by > default, which I agree with. You shouldn't be hitting cases like > this commonly (if so, they're bugs to fix or the errcode should be > rethought), and the failure might be pretty hard to reproduce. FWIW, I did some analysis on aggregated logs on a larger number of machines, and it does look like that'd be a measurable increase in log volume. There are a few voluminous internal errors in core, but the bigger issue is extensions. They are typically much less disciplined about assigning error codes than core PG is. I've been wondering about doing some macro hackery to inform elog.c about whether a log message is from core or an extension. It might even be possible to identify the concrete extension, e.g. by updating the contents of PG_MODULE_MAGIC during module loading, and referencing that. Based on the aforementioned data, the most common, in-core, log messages without assigned error codes are: could not accept SSL connection: %m - with zero errno archive command was terminated by signal %d: %s could not send data to client: %m - with zero errno cache lookup failed for type %u archive command failed with exit code %d tuple concurrently updated could not restore file "%s" from archive: %s archive command was terminated by signal %d: %s %s at file "%s" line %u invalid memory alloc request size %zu could not send data to client: %m could not open directory "%s": %m - errno indicating ENOMEM could not write init file out of relcache_callback_list slots online backup was canceled, recovery cannot continue requested timeline %u does not contain minimum recovery point %X/%X on timeline %u There were a lot more in older PG versions, I tried to filter those out. I'm a bit confused about the huge number of "could not accept SSL connection: %m" with a zero errno. I guess we must be clearing errno somehow, but I don't immediately see where. Or perhaps we need to actually look at what SSL_get_error() returns? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes:
> On 2023-12-08 10:05:09 -0500, Tom Lane wrote:
>> ... there was already opinion upthread that this should be on by
>> default, which I agree with.  You shouldn't be hitting cases like
>> this commonly (if so, they're bugs to fix or the errcode should be
>> rethought), and the failure might be pretty hard to reproduce.
> FWIW, I did some analysis on aggregated logs on a larger number of machines,
> and it does look like that'd be a measurable increase in log volume. There are
> a few voluminous internal errors in core, but the bigger issue is
> extensions. They are typically much less disciplined about assigning error
> codes than core PG is.
Well, I don't see much wrong with making a push to assign error codes
to more calls.  We've had other discussions about doing that.
Certainly these SSL failures are not "internal" errors.
> could not accept SSL connection: %m - with zero errno
> ...
> I'm a bit confused about the huge number of "could not accept SSL connection:
> %m" with a zero errno. I guess we must be clearing errno somehow, but I don't
> immediately see where. Or perhaps we need to actually look at what
> SSL_get_error() returns?
Hmm, don't suppose you have a way to reproduce that?
            regards, tom lane
			
		Hi, On 2023-12-08 13:23:50 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-12-08 10:05:09 -0500, Tom Lane wrote: > >> ... there was already opinion upthread that this should be on by > >> default, which I agree with. You shouldn't be hitting cases like > >> this commonly (if so, they're bugs to fix or the errcode should be > >> rethought), and the failure might be pretty hard to reproduce. > > > FWIW, I did some analysis on aggregated logs on a larger number of machines, > > and it does look like that'd be a measurable increase in log volume. There are > > a few voluminous internal errors in core, but the bigger issue is > > extensions. They are typically much less disciplined about assigning error > > codes than core PG is. > > Well, I don't see much wrong with making a push to assign error codes > to more calls. Oh, very much agreed. But I suspect we won't quickly do the same for out-of-core extensions... > Certainly these SSL failures are not "internal" errors. > > > could not accept SSL connection: %m - with zero errno > > ... > > I'm a bit confused about the huge number of "could not accept SSL connection: > > %m" with a zero errno. I guess we must be clearing errno somehow, but I don't > > immediately see where. Or perhaps we need to actually look at what > > SSL_get_error() returns? > > Hmm, don't suppose you have a way to reproduce that? After a bit of trying, yes. I put an abort() into pgtls_open_client(), after initialize_SSL(). Connecting does result in: LOG: could not accept SSL connection: Success Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes:
> On 2023-12-08 13:23:50 -0500, Tom Lane wrote:
>> Hmm, don't suppose you have a way to reproduce that?
> After a bit of trying, yes.  I put an abort() into pgtls_open_client(), after
> initialize_SSL(). Connecting does result in:
> LOG:  could not accept SSL connection: Success
OK.  I can dig into that, unless you're already on it?
            regards, tom lane
			
		Hi, On 2023-12-08 13:46:07 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-12-08 13:23:50 -0500, Tom Lane wrote: > >> Hmm, don't suppose you have a way to reproduce that? > > > After a bit of trying, yes. I put an abort() into pgtls_open_client(), after > > initialize_SSL(). Connecting does result in: > > LOG: could not accept SSL connection: Success > > OK. I can dig into that, unless you're already on it? I think I figured it it out. Looks like we need to translate a closed socket (recvfrom() returning 0) to ECONNRESET or such. Greetings, Andres Freund
Hi,
On 2023-12-08 10:51:01 -0800, Andres Freund wrote:
> On 2023-12-08 13:46:07 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2023-12-08 13:23:50 -0500, Tom Lane wrote:
> > >> Hmm, don't suppose you have a way to reproduce that?
> >
> > > After a bit of trying, yes.  I put an abort() into pgtls_open_client(), after
> > > initialize_SSL(). Connecting does result in:
> > > LOG:  could not accept SSL connection: Success
> >
> > OK.  I can dig into that, unless you're already on it?
>
> I think I figured it it out. Looks like we need to translate a closed socket
> (recvfrom() returning 0) to ECONNRESET or such.
I think we might just need to expand the existing branch for EOF:
                if (r < 0)
                    ereport(COMMERROR,
                            (errcode_for_socket_access(),
                             errmsg("could not accept SSL connection: %m")));
                else
                    ereport(COMMERROR,
                            (errcode(ERRCODE_PROTOCOL_VIOLATION),
                             errmsg("could not accept SSL connection: EOF detected")));
The openssl docs say:
 The following return values can occur:
0
    The TLS/SSL handshake was not successful but was shut down controlled and by the specifications of the TLS/SSL
protocol.Call SSL_get_error() with the return value ret to find out the reason.
 
1
    The TLS/SSL handshake was successfully completed, a TLS/SSL connection has been established.
<0
    The TLS/SSL handshake was not successful because a fatal error occurred either at the protocol level or a
connectionfailure occurred. The shutdown was not clean. It can also occur if action is needed to continue the operation
fornonblocking BIOs. Call SSL_get_error() with the return value ret to find out the reason.
 
Which fits with my reproducer - due to the abort the connection was *not* shut
down via SSL in a controlled manner, therefore r < 0.
Hm, oddly enough, there's this tidbit in the SSL_get_error() manpage:
 On an unexpected EOF, versions before OpenSSL 3.0 returned SSL_ERROR_SYSCALL,
 nothing was added to the error stack, and errno was 0. Since OpenSSL 3.0 the
 returned error is SSL_ERROR_SSL with a meaningful error on the error stack.
But I reproduced this with 3.1.
Seems like we should just treat errno == 0 as a reason to emit the "EOF
detected" message?
I wonder if we should treat send/recv returning 0 different from an error
message perspective during an established connection. Right now we produce
  could not receive data from client: Connection reset by peer
because be_tls_read() sets errno to ECONNRESET - despite that not having been
returned by the OS.  But I guess that's a topic for another day.
Greetings,
Andres Freund
			
		Hi, On 2023-12-08 11:33:16 -0800, Andres Freund wrote: > On 2023-12-08 10:51:01 -0800, Andres Freund wrote: > > On 2023-12-08 13:46:07 -0500, Tom Lane wrote: > > > Andres Freund <andres@anarazel.de> writes: > > > > On 2023-12-08 13:23:50 -0500, Tom Lane wrote: > > > >> Hmm, don't suppose you have a way to reproduce that? > > > > > > > After a bit of trying, yes. I put an abort() into pgtls_open_client(), after > > > > initialize_SSL(). Connecting does result in: > > > > LOG: could not accept SSL connection: Success > > > > > > OK. I can dig into that, unless you're already on it? > > [...] > > Seems like we should just treat errno == 0 as a reason to emit the "EOF > detected" message? I thought it'd be nice to have a test for this, particularly because it's not clear that the behaviour is consistent across openssl versions. I couldn't think of a way to do that with psql. But it's just a few lines of perl to gin up an "SSL" startup packet and then close the socket. I couldn't quite figure out when IO::Socket::INET was added, but I think it's likely been long enough, I see references from 1999. This worked easily on linux and freebsd, but not on windows and macos, where it seems to cause ECONNRESET. I thought that explicitly shutting down the socket might help, but that just additionally caused freebsd to fail. Windows uses an older openssl, so it could also be caused by the behaviour differing back then. To deal with that, I changed the test to instead check if "not accept SSL connection: Success" is not logged. I'm not sure that actually would be logged on windows, it does seem to have different strings for errors than other platforms. Greetings, Andres Freund
Вложения
Andres Freund <andres@anarazel.de> writes:
>> I think I figured it it out. Looks like we need to translate a closed socket
>> (recvfrom() returning 0) to ECONNRESET or such.
> Seems like we should just treat errno == 0 as a reason to emit the "EOF
> detected" message?
Agreed.  I think we want to do that after the initial handshake,
too, so maybe as attached.
            regards, tom lane
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6b8125695a..f0b35f08c6 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -460,6 +460,7 @@ aloop:
      * per-thread error queue following another call to an OpenSSL I/O
      * routine.
      */
+    errno = 0;
     ERR_clear_error();
     r = SSL_accept(port->ssl);
     if (r <= 0)
@@ -496,7 +497,7 @@ aloop:
                                          WAIT_EVENT_SSL_OPEN_SERVER);
                 goto aloop;
             case SSL_ERROR_SYSCALL:
-                if (r < 0)
+                if (r < 0 && errno != 0)
                     ereport(COMMERROR,
                             (errcode_for_socket_access(),
                              errmsg("could not accept SSL connection: %m")));
@@ -732,7 +733,7 @@ be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
             break;
         case SSL_ERROR_SYSCALL:
             /* leave it to caller to ereport the value of errno */
-            if (n != -1)
+            if (n != -1 || errno == 0)
             {
                 errno = ECONNRESET;
                 n = -1;
			
		Andres Freund <andres@anarazel.de> writes:
> I thought it'd be nice to have a test for this, particularly because it's not
> clear that the behaviour is consistent across openssl versions.
Perhaps, but ...
> To deal with that, I changed the test to instead check if "not accept SSL
> connection: Success" is not logged.
... testing only that much seems entirely not worth the cycles, given the
shape of the patches we both just made.  If we can't rely on "errno != 0"
to ensure we won't get "Success", there is one heck of a lot of other
code that will be broken worse than this.
            regards, tom lane
			
		On 2023-12-08 17:29:45 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> I think I figured it it out. Looks like we need to translate a closed socket > >> (recvfrom() returning 0) to ECONNRESET or such. > > > Seems like we should just treat errno == 0 as a reason to emit the "EOF > > detected" message? > > Agreed. I think we want to do that after the initial handshake, > too, so maybe as attached. I was wondering about that too. But if we do so, why not also do it for writes?
Hi, On 2023-12-08 17:35:26 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I thought it'd be nice to have a test for this, particularly because it's not > > clear that the behaviour is consistent across openssl versions. > > Perhaps, but ... > > > To deal with that, I changed the test to instead check if "not accept SSL > > connection: Success" is not logged. > > ... testing only that much seems entirely not worth the cycles, given the > shape of the patches we both just made. If we can't rely on "errno != 0" > to ensure we won't get "Success", there is one heck of a lot of other > code that will be broken worse than this. I was certainly more optimistic about the usefullness of the test before disocvering the above difficulties... I considered accepting both ECONNRESET and the errno = 0 phrasing, but after discovering that the phrasing differs between platforms that seemed less attractive. I guess the test might still provide some value, by ensuring those paths are reached. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes:
> On 2023-12-08 17:29:45 -0500, Tom Lane wrote:
>> Agreed.  I think we want to do that after the initial handshake,
>> too, so maybe as attached.
> I was wondering about that too. But if we do so, why not also do it for
> writes?
Writes don't act that way, do they?  EOF on a pipe gives you an error,
not silently reporting that zero bytes were written and leaving you
to retry indefinitely.
What I was wondering about was if we needed similar changes on the
libpq side, but it's still about reads not writes.
            regards, tom lane
			
		Hi, On 2023-12-08 19:39:20 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-12-08 17:29:45 -0500, Tom Lane wrote: > >> Agreed. I think we want to do that after the initial handshake, > >> too, so maybe as attached. > > > I was wondering about that too. But if we do so, why not also do it for > > writes? > > Writes don't act that way, do they? EOF on a pipe gives you an error, > not silently reporting that zero bytes were written and leaving you > to retry indefinitely. Err, yes. /me looks for a brown paper bag. > What I was wondering about was if we needed similar changes on the > libpq side, but it's still about reads not writes. Perhaps. It's probably harder to reach in practice. But there seems little reason to have a plausible codepath emitting "SSL SYSCALL error: Success", so instead mapping errno == 0 to "EOF detected" pgtls_read() and open_client_SSL() makes sense to me. I wish there were an easy userspace solution to simulating TCP connection failures. I know how to do it with iptables et al, but that's not great for automated testing in PG... Greetings, Andres Freund
I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I was wondering about that too. But if we do so, why not also do it for
>> writes?
> Writes don't act that way, do they?  EOF on a pipe gives you an error,
> not silently reporting that zero bytes were written and leaving you
> to retry indefinitely.
On further reflection I realized that you're right so far as the SSL
code path goes, because SSL_write() can involve physical reads as well
as writes, so at least in principle it's possible that we'd see EOF
reported this way from that function.
Also, the libpq side does need work of the same sort, leading to the
v2-0001 patch attached.
I also realized that we have more or less the same problem at the
caller level, allowing a similar failure for non-SSL connections.
So I'm also proposing 0002 attached.  Your results from aggregated
logs didn't show "could not receive data from client: Success" as a
common case, but since we weren't bothering to zero errno beforehand,
it's likely that such failures would show up with very random errnos.
I took a quick look at the GSSAPI code path too, but it seems not to
add any new assumptions of this sort.
            regards, tom lane
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6b8125695a..3ec17fb11a 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -460,6 +460,7 @@ aloop:
      * per-thread error queue following another call to an OpenSSL I/O
      * routine.
      */
+    errno = 0;
     ERR_clear_error();
     r = SSL_accept(port->ssl);
     if (r <= 0)
@@ -496,7 +497,7 @@ aloop:
                                          WAIT_EVENT_SSL_OPEN_SERVER);
                 goto aloop;
             case SSL_ERROR_SYSCALL:
-                if (r < 0)
+                if (r < 0 && errno != 0)
                     ereport(COMMERROR,
                             (errcode_for_socket_access(),
                              errmsg("could not accept SSL connection: %m")));
@@ -732,7 +733,7 @@ be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
             break;
         case SSL_ERROR_SYSCALL:
             /* leave it to caller to ereport the value of errno */
-            if (n != -1)
+            if (n != -1 || errno == 0)
             {
                 errno = ECONNRESET;
                 n = -1;
@@ -791,7 +792,7 @@ be_tls_write(Port *port, void *ptr, size_t len, int *waitfor)
             break;
         case SSL_ERROR_SYSCALL:
             /* leave it to caller to ereport the value of errno */
-            if (n != -1)
+            if (n != -1 || errno == 0)
             {
                 errno = ECONNRESET;
                 n = -1;
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index e669bdbf1d..9ebebea777 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -200,7 +200,7 @@ rloop:
              */
             goto rloop;
         case SSL_ERROR_SYSCALL:
-            if (n < 0)
+            if (n < 0 && SOCK_ERRNO != 0)
             {
                 result_errno = SOCK_ERRNO;
                 if (result_errno == EPIPE ||
@@ -301,7 +301,7 @@ pgtls_write(PGconn *conn, const void *ptr, size_t len)
             n = 0;
             break;
         case SSL_ERROR_SYSCALL:
-            if (n < 0)
+            if (n < 0 && SOCK_ERRNO != 0)
             {
                 result_errno = SOCK_ERRNO;
                 if (result_errno == EPIPE || result_errno == ECONNRESET)
@@ -1510,11 +1510,12 @@ open_client_SSL(PGconn *conn)
                      * was using the system CA pool. For other errors, log
                      * them using the normal SYSCALL logging.
                      */
-                    if (!save_errno && vcode == X509_V_ERR_UNABLE_TO_GET_ISSUER_CERT_LOCALLY &&
+                    if (save_errno == 0 &&
+                        vcode == X509_V_ERR_UNABLE_TO_GET_ISSUER_CERT_LOCALLY &&
                         strcmp(conn->sslrootcert, "system") == 0)
                         libpq_append_conn_error(conn, "SSL error: certificate verify failed: %s",
                                                 X509_verify_cert_error_string(vcode));
-                    else if (r == -1)
+                    else if (r == -1 && save_errno != 0)
                         libpq_append_conn_error(conn, "SSL SYSCALL error: %s",
                                                 SOCK_STRERROR(save_errno, sebuf, sizeof(sebuf)));
                     else
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 2802efc63f..0084a9bf13 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -936,6 +936,8 @@ pq_recvbuf(void)
     {
         int            r;
+        errno = 0;
+
         r = secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
                         PQ_RECV_BUFFER_SIZE - PqRecvLength);
@@ -948,10 +950,13 @@ pq_recvbuf(void)
              * Careful: an ereport() that tries to write to the client would
              * cause recursion to here, leading to stack overflow and core
              * dump!  This message must go *only* to the postmaster log.
+             *
+             * If errno is zero, assume it's EOF and let the caller complain.
              */
-            ereport(COMMERROR,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive data from client: %m")));
+            if (errno != 0)
+                ereport(COMMERROR,
+                        (errcode_for_socket_access(),
+                         errmsg("could not receive data from client: %m")));
             return EOF;
         }
         if (r == 0)
@@ -1028,6 +1033,8 @@ pq_getbyte_if_available(unsigned char *c)
     /* Put the socket into non-blocking mode */
     socket_set_nonblocking(true);
+    errno = 0;
+
     r = secure_read(MyProcPort, c, 1);
     if (r < 0)
     {
@@ -1044,10 +1051,13 @@ pq_getbyte_if_available(unsigned char *c)
              * Careful: an ereport() that tries to write to the client would
              * cause recursion to here, leading to stack overflow and core
              * dump!  This message must go *only* to the postmaster log.
+             *
+             * If errno is zero, assume it's EOF and let the caller complain.
              */
-            ereport(COMMERROR,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive data from client: %m")));
+            if (errno != 0)
+                ereport(COMMERROR,
+                        (errcode_for_socket_access(),
+                         errmsg("could not receive data from client: %m")));
             r = EOF;
         }
     }
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index bd72a87bbb..76b647ce1c 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -211,6 +211,8 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
     int            result_errno = 0;
     char        sebuf[PG_STRERROR_R_BUFLEN];
+    SOCK_ERRNO_SET(0);
+
     n = recv(conn->sock, ptr, len, 0);
     if (n < 0)
@@ -232,6 +234,7 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
             case EPIPE:
             case ECONNRESET:
+            case 0:                /* treat as EOF */
                 libpq_append_conn_error(conn, "server closed the connection unexpectedly\n"
                                         "\tThis probably means the server terminated abnormally\n"
                                         "\tbefore or while processing the request.");
			
		Hi, On 2023-12-09 12:41:30 -0500, Tom Lane wrote: > I wrote: > > Andres Freund <andres@anarazel.de> writes: > >> I was wondering about that too. But if we do so, why not also do it for > >> writes? > > > Writes don't act that way, do they? EOF on a pipe gives you an error, > > not silently reporting that zero bytes were written and leaving you > > to retry indefinitely. > > On further reflection I realized that you're right so far as the SSL > code path goes, because SSL_write() can involve physical reads as well > as writes, so at least in principle it's possible that we'd see EOF > reported this way from that function. Heh. I'll just claim that's what I was thinking about. > Also, the libpq side does need work of the same sort, leading to the > v2-0001 patch attached. I'd perhaps add a comment explaining why it's plausible that we'd see that that in the write case. > I also realized that we have more or less the same problem at the > caller level, allowing a similar failure for non-SSL connections. > So I'm also proposing 0002 attached. Your results from aggregated > logs didn't show "could not receive data from client: Success" as a > common case, but since we weren't bothering to zero errno beforehand, > it's likely that such failures would show up with very random errnos. I did only look at the top ~100 internal errors, after trying to filter out extensions, i.e. the list wasn't exhaustive. There's also very few non-ssl connections. But I just checked, and for that error message, I do see some XX000, but only in older versions. There's ENETUNREACH, ECONNRESET, ETIMEDOUT, EHOSTUNREACH which these days are all handled as non XX000 by errcode_for_socket_access(). > diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c > index bd72a87bbb..76b647ce1c 100644 > --- a/src/interfaces/libpq/fe-secure.c > +++ b/src/interfaces/libpq/fe-secure.c > @@ -211,6 +211,8 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len) > int result_errno = 0; > char sebuf[PG_STRERROR_R_BUFLEN]; > > + SOCK_ERRNO_SET(0); > + > n = recv(conn->sock, ptr, len, 0); > > if (n < 0) > @@ -232,6 +234,7 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len) > > case EPIPE: > case ECONNRESET: > + case 0: /* treat as EOF */ > libpq_append_conn_error(conn, "server closed the connection unexpectedly\n" > "\tThis probably means the server terminated abnormally\n" > "\tbefore or while processing the request."); If we were treating it as EOF, we'd not "queue" an error message, no? Normally recv() returns 0 in that case, so we'd just return, right? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes:
> On 2023-12-09 12:41:30 -0500, Tom Lane wrote:
>> On further reflection I realized that you're right so far as the SSL
>> code path goes, because SSL_write() can involve physical reads as well
>> as writes, so at least in principle it's possible that we'd see EOF
>> reported this way from that function.
> Heh. I'll just claim that's what I was thinking about.
> I'd perhaps add a comment explaining why it's plausible that we'd see that
> that in the write case.
Done in v3 attached.
>> I also realized that we have more or less the same problem at the
>> caller level, allowing a similar failure for non-SSL connections.
> If we were treating it as EOF, we'd not "queue" an error message, no? Normally
> recv() returns 0 in that case, so we'd just return, right?
Duh, right, so more like this version.
I'm not actually sure that the fe-secure.c part of v3-0002 is
necessary, because it's guarding plain recv(2) which really shouldn't
return -1 without setting errno.  Still, it's a pretty harmless
addition.
            regards, tom lane
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6b8125695a..22e3dc5a81 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -460,6 +460,7 @@ aloop:
      * per-thread error queue following another call to an OpenSSL I/O
      * routine.
      */
+    errno = 0;
     ERR_clear_error();
     r = SSL_accept(port->ssl);
     if (r <= 0)
@@ -496,7 +497,7 @@ aloop:
                                          WAIT_EVENT_SSL_OPEN_SERVER);
                 goto aloop;
             case SSL_ERROR_SYSCALL:
-                if (r < 0)
+                if (r < 0 && errno != 0)
                     ereport(COMMERROR,
                             (errcode_for_socket_access(),
                              errmsg("could not accept SSL connection: %m")));
@@ -732,7 +733,7 @@ be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
             break;
         case SSL_ERROR_SYSCALL:
             /* leave it to caller to ereport the value of errno */
-            if (n != -1)
+            if (n != -1 || errno == 0)
             {
                 errno = ECONNRESET;
                 n = -1;
@@ -790,8 +791,14 @@ be_tls_write(Port *port, void *ptr, size_t len, int *waitfor)
             n = -1;
             break;
         case SSL_ERROR_SYSCALL:
-            /* leave it to caller to ereport the value of errno */
-            if (n != -1)
+
+            /*
+             * Leave it to caller to ereport the value of errno.  However, if
+             * errno is still zero then assume it's a read EOF situation, and
+             * report ECONNRESET.  (This seems possible because SSL_write can
+             * also do reads.)
+             */
+            if (n != -1 || errno == 0)
             {
                 errno = ECONNRESET;
                 n = -1;
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index e669bdbf1d..2b221e7d15 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -200,7 +200,7 @@ rloop:
              */
             goto rloop;
         case SSL_ERROR_SYSCALL:
-            if (n < 0)
+            if (n < 0 && SOCK_ERRNO != 0)
             {
                 result_errno = SOCK_ERRNO;
                 if (result_errno == EPIPE ||
@@ -301,7 +301,13 @@ pgtls_write(PGconn *conn, const void *ptr, size_t len)
             n = 0;
             break;
         case SSL_ERROR_SYSCALL:
-            if (n < 0)
+
+            /*
+             * If errno is still zero then assume it's a read EOF situation,
+             * and report EOF.  (This seems possible because SSL_write can
+             * also do reads.)
+             */
+            if (n < 0 && SOCK_ERRNO != 0)
             {
                 result_errno = SOCK_ERRNO;
                 if (result_errno == EPIPE || result_errno == ECONNRESET)
@@ -1510,11 +1516,12 @@ open_client_SSL(PGconn *conn)
                      * was using the system CA pool. For other errors, log
                      * them using the normal SYSCALL logging.
                      */
-                    if (!save_errno && vcode == X509_V_ERR_UNABLE_TO_GET_ISSUER_CERT_LOCALLY &&
+                    if (save_errno == 0 &&
+                        vcode == X509_V_ERR_UNABLE_TO_GET_ISSUER_CERT_LOCALLY &&
                         strcmp(conn->sslrootcert, "system") == 0)
                         libpq_append_conn_error(conn, "SSL error: certificate verify failed: %s",
                                                 X509_verify_cert_error_string(vcode));
-                    else if (r == -1)
+                    else if (r == -1 && save_errno != 0)
                         libpq_append_conn_error(conn, "SSL SYSCALL error: %s",
                                                 SOCK_STRERROR(save_errno, sebuf, sizeof(sebuf)));
                     else
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 2802efc63f..0084a9bf13 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -936,6 +936,8 @@ pq_recvbuf(void)
     {
         int            r;
+        errno = 0;
+
         r = secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
                         PQ_RECV_BUFFER_SIZE - PqRecvLength);
@@ -948,10 +950,13 @@ pq_recvbuf(void)
              * Careful: an ereport() that tries to write to the client would
              * cause recursion to here, leading to stack overflow and core
              * dump!  This message must go *only* to the postmaster log.
+             *
+             * If errno is zero, assume it's EOF and let the caller complain.
              */
-            ereport(COMMERROR,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive data from client: %m")));
+            if (errno != 0)
+                ereport(COMMERROR,
+                        (errcode_for_socket_access(),
+                         errmsg("could not receive data from client: %m")));
             return EOF;
         }
         if (r == 0)
@@ -1028,6 +1033,8 @@ pq_getbyte_if_available(unsigned char *c)
     /* Put the socket into non-blocking mode */
     socket_set_nonblocking(true);
+    errno = 0;
+
     r = secure_read(MyProcPort, c, 1);
     if (r < 0)
     {
@@ -1044,10 +1051,13 @@ pq_getbyte_if_available(unsigned char *c)
              * Careful: an ereport() that tries to write to the client would
              * cause recursion to here, leading to stack overflow and core
              * dump!  This message must go *only* to the postmaster log.
+             *
+             * If errno is zero, assume it's EOF and let the caller complain.
              */
-            ereport(COMMERROR,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive data from client: %m")));
+            if (errno != 0)
+                ereport(COMMERROR,
+                        (errcode_for_socket_access(),
+                         errmsg("could not receive data from client: %m")));
             r = EOF;
         }
     }
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index bd72a87bbb..b2430362a9 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -211,6 +211,8 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
     int            result_errno = 0;
     char        sebuf[PG_STRERROR_R_BUFLEN];
+    SOCK_ERRNO_SET(0);
+
     n = recv(conn->sock, ptr, len, 0);
     if (n < 0)
@@ -237,6 +239,11 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
                                         "\tbefore or while processing the request.");
                 break;
+            case 0:
+                /* If errno didn't get set, treat it as regular EOF */
+                n = 0;
+                break;
+
             default:
                 libpq_append_conn_error(conn, "could not receive data from server: %s",
                                         SOCK_STRERROR(result_errno,
			
		On 08.12.23 19:14, Andres Freund wrote: > FWIW, I did some analysis on aggregated logs on a larger number of machines, > and it does look like that'd be a measurable increase in log volume. There are > a few voluminous internal errors in core, but the bigger issue is > extensions. They are typically much less disciplined about assigning error > codes than core PG is. Good point. Also, during development, I often just put elog(ERROR, "real error later").
On Fri, Dec 8, 2023 at 1:34 PM Andres Freund <andres@anarazel.de> wrote: > Oh, very much agreed. But I suspect we won't quickly do the same for > out-of-core extensions... I feel like this is a problem that will sort itself out just fine. The rules about using ereport() and elog() could probably be better documented than they are, but doing that won't cause people to follow the practices any more rigorously than they have been. However, a change like this just might. If we make this policy change in core, then extension authors will probably get pressure from users to clean up any calls that are emitting excessively verbose log output, and that seems like a good thing. It's impossible to make an omelet without breaking some eggs, but the thing we're talking about here is, IMHO, extremely important. Users are forever hitting weird errors in production that aren't easy to reproduce on test systems, and because most elog() calls are written with the expectation that they won't be hit, they often contain minimal information, which IME makes it really difficult to understand what went wrong. A lot of these are things like - oh, this function expected a valid value of some sort, say a relkind, and it got some nonsense value, say a zero byte. But where did that nonsense value originate? That elog message can't tell you that, but a stack trace will. The last change we made in this area that, at least for me, massively improved debuggability was the change to log the current query string when a backend crashes. That's such a huge help; I can't imagine going back to the old way where you had basically no idea what made things go boom. I think doing something like this can have a similarly positive impact. It is going to take some work - from us and from extension authors - to tidy things up so that it doesn't produce a bunch of unwanted output, but the payoff will be the ability to actually find and fix the bugs instead of just saying to a customer "hey, sucks that you hit a bug, let us know if you find a reproducer." -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
> The last change we made in this area that, at least for me, massively
> improved debuggability was the change to log the current query string
> when a backend crashes. That's such a huge help; I can't imagine going
> back to the old way where you had basically no idea what made things
> go boom. I think doing something like this can have a similarly
> positive impact. It is going to take some work - from us and from
> extension authors - to tidy things up so that it doesn't produce a
> bunch of unwanted output, but the payoff will be the ability to
> actually find and fix the bugs instead of just saying to a customer
> "hey, sucks that you hit a bug, let us know if you find a reproducer."
IMO, we aren't really going to get a massive payoff from this with
the current backtrace output; it's just not detailed enough.  It's
better than nothing certainly, but to really move the goalposts
we'd need something approaching gdb's "bt full" output.  I wonder
if it'd be sane to try to auto-invoke gdb.  That's just blue sky
for now, though.  In the meantime, I agree with the proposal as it
stands (that is, auto-backtrace on any XX000 error).  We'll soon find
out whether it's useless, or needs more detail to be really helpful,
or is just right as it is.  Once we have some practical experience
with it, we can course-correct as needed.
            regards, tom lane
			
		On Tue, Dec 19, 2023 at 11:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > IMO, we aren't really going to get a massive payoff from this with > the current backtrace output; it's just not detailed enough. It's > better than nothing certainly, but to really move the goalposts > we'd need something approaching gdb's "bt full" output. I wonder > if it'd be sane to try to auto-invoke gdb. That's just blue sky > for now, though. In the meantime, I agree with the proposal as it > stands (that is, auto-backtrace on any XX000 error). We'll soon find > out whether it's useless, or needs more detail to be really helpful, > or is just right as it is. Once we have some practical experience > with it, we can course-correct as needed. That all seems fair to me. I'm more optimistic than you are about getting something useful out of the current backtrace output, but (1) I could be wrong, (2) I'd still like to have something better, and (3) improving the backtrace output is a separate project from including backtraces more frequently. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, 10 Dec 2023 at 00:14, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not actually sure that the fe-secure.c part of v3-0002 is > necessary, because it's guarding plain recv(2) which really shouldn't > return -1 without setting errno. Still, it's a pretty harmless > addition. v3-0002 seems have a very similar goal to v23-0002 in my non-blocking and encrypted cancel request patchset here: https://www.postgresql.org/message-id/flat/CAGECzQQirExbHe6uLa4C-sP%3DwTR1jazR_wgCWd4177QE-%3DVFDw%40mail.gmail.com#0b6cc1897c6d507cef49a3f3797181aa Would it be possible to merge that on instead or at least use the same approach as that one (i.e. return -2 on EOF). Otherwise I have to update that patchset to match the new style of communicating that there is an EOF. Also I personally think a separate return value for EOF clearer when reading the code than checking for errno being 0.
On Tue, 19 Dec 2023 at 17:12, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 8, 2023 at 1:34 PM Andres Freund <andres@anarazel.de> wrote: > > Oh, very much agreed. But I suspect we won't quickly do the same for > > out-of-core extensions... > > I feel like this is a problem that will sort itself out just fine. The > rules about using ereport() and elog() could probably be better > documented than they are, but doing that won't cause people to follow > the practices any more rigorously than they have been. However, a > change like this just might. If we make this policy change in core, > then extension authors will probably get pressure from users to clean > up any calls that are emitting excessively verbose log output, and > that seems like a good thing. As an extension author I wanted to make clear that Andres his concern is definitely not theoretical. Citus (as well as most other extensions me and our team at Microsoft maintains) use ereport without an error code very often. And while we try to use elog actually only for internal errors, there's definitely places where we haven't. We've had "adding error codes to all our errors" on our backlog for years though. I'm guessing this is mostly a combination of it being a boring task, it being a lot of work, and the impact not being particularly huge (i.e. now users can check error codes for all our errors wohoo!). If ereport without an errorcode would suddenly cause a log flood in the next postgres release then suddenly the impact of adding error codes would increase drastically. And as Robert said we'd basically be forced to adopt the pattern. Which I agree isn't necessarily a bad thing. But I'm not sure that smaller extensions that are not maintained by a team that's paid to do so would be happy about this change. Also I think we'd even change our extension to add errror codes to all ereport calls if the stack traces are useful enough, because then the impact of adding error codes suddenly increases a lot. So I think having a way for extensions to opt-in/opt-out of this change for their extension would be very much appreciated by those authors.
Hi, On 2023-12-20 10:08:42 +0100, Jelte Fennema-Nio wrote: > On Sun, 10 Dec 2023 at 00:14, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm not actually sure that the fe-secure.c part of v3-0002 is > > necessary, because it's guarding plain recv(2) which really shouldn't > > return -1 without setting errno. Still, it's a pretty harmless > > addition. > > v3-0002 seems have a very similar goal to v23-0002 in my non-blocking > and encrypted cancel request patchset here: > https://www.postgresql.org/message-id/flat/CAGECzQQirExbHe6uLa4C-sP%3DwTR1jazR_wgCWd4177QE-%3DVFDw%40mail.gmail.com#0b6cc1897c6d507cef49a3f3797181aa > > Would it be possible to merge that on instead or at least use the same > approach as that one (i.e. return -2 on EOF). Otherwise I have to > update that patchset to match the new style of communicating that > there is an EOF. Also I personally think a separate return value for > EOF clearer when reading the code than checking for errno being 0. Tom's patch imo doesn't really introduce anything really new - we already deal with EOF that way in other places. And it's how the standard APIs deal with the issue. I'd not design it this way on a green field, but given the current state Tom's approach seems more sensible... Greetings, Andres Freund
On Wed, 20 Dec 2023 at 11:30, Andres Freund <andres@anarazel.de> wrote: > Tom's patch imo doesn't really introduce anything really new - we already deal > with EOF that way in other places. And it's how the standard APIs deal with > the issue. I'd not design it this way on a green field, but given the current > state Tom's approach seems more sensible... Okay, while I think it's a really non-obvious way of checking for EOF, I agree that staying consistent with this non-obvious existing pattern is the best choice here. I also just noticed that the proposed patch is already merged. So I just updated my patchset to use it. For my patchset this does introduce a slight problem though: I'm using pqReadData, instead of pqsecure_read directly. And pqReadData has other reasons for failing without setting an errno than just EOF. Specifically allocation failures or passing an invalid socket. I see three options to handle this: 1. Don't change pqReadData and simply consider all these EOF too from PQcancelPoll 2. Set errno to something non-zero for these non EOF failures in pqReadData 3. Return -2 from pqReadData on EOF Any preference on those? For now I went for option 1.
On 19.12.23 17:29, Tom Lane wrote: > IMO, we aren't really going to get a massive payoff from this with > the current backtrace output; it's just not detailed enough. It's > better than nothing certainly, but to really move the goalposts > we'd need something approaching gdb's "bt full" output. I wonder > if it'd be sane to try to auto-invoke gdb. That's just blue sky > for now, though. In the meantime, I agree with the proposal as it > stands (that is, auto-backtrace on any XX000 error). We'll soon find > out whether it's useless, or needs more detail to be really helpful, > or is just right as it is. Once we have some practical experience > with it, we can course-correct as needed. Based on this, I have committed my original patch.