Обсуждение: Add non-blocking version of PQcancel
The existing PQcancel API is using blocking IO. This makes PQcancel impossible to use in an event loop based codebase, without blocking the event loop until the call returns. This patch adds a new cancellation API to libpq which is called PQcancelConnectionStart. This API can be used to send cancellations in a non-blocking fashion. To do this it internally uses regular PGconn connection establishment. This has as a downside that PQcancelConnectionStart cannot be safely called from a signal handler. Luckily, this should be fine for most usages of this API. Since most code that's using an event loop handles signals in that event loop as well (as opposed to calling functions from the signal handler directly). There are also a few advantages of this approach: 1. No need to add and maintain a second non-blocking connection establishment codepath. 2. Cancel connections benefit automatically from any improvements made to the normal connection establishment codepath. Examples of things that it currently gets for free currently are TLS support and keepalive settings. This patch also includes a test for this new API (and also the already existing cancellation APIs). The test can be easily run like this: cd src/test/modules/libpq_pipeline make && ./libpq_pipeline cancel NOTE: I have not tested this with GSS for the moment. My expectation is that using this new API with a GSS connection will result in a CONNECTION_BAD status when calling PQcancelStatus. The reason for this is that GSS reads will also need to communicate back that an EOF was found, just like I've done for TLS reads and unencrypted reads. Since in case of a cancel connection an EOF is actually expected, and should not be treated as an error.
Вложения
Hi, On 2022-01-12 15:22:18 +0000, Jelte Fennema wrote: > This patch also includes a test for this new API (and also the already > existing cancellation APIs). The test can be easily run like this: > > cd src/test/modules/libpq_pipeline > make && ./libpq_pipeline cancel Right now tests fails to build on windows with: [15:45:10.518] src/interfaces/libpq/libpqdll.def : fatal error LNK1121: duplicate ordinal number '189' [c:\cirrus\libpq.vcxproj] on fails tests on other platforms. See https://cirrus-ci.com/build/4791821363576832 > NOTE: I have not tested this with GSS for the moment. My expectation is > that using this new API with a GSS connection will result in a > CONNECTION_BAD status when calling PQcancelStatus. The reason for this > is that GSS reads will also need to communicate back that an EOF was > found, just like I've done for TLS reads and unencrypted reads. Since in > case of a cancel connection an EOF is actually expected, and should not > be treated as an error. The failures do not seem related to this. Greetings, Andres Freund
Attached is an updated patch which I believe fixes windows and the other test failures.
At least on my machine make check-world passes now when compiled with --enable-tap-tests
I also included a second patch which adds some basic documentation for the libpq tests.
Вложения
On Thu, 2022-01-13 at 14:51 +0000, Jelte Fennema wrote: > Attached is an updated patch which I believe fixes windows and the other test failures. > At least on my machine make check-world passes now when compiled with --enable-tap-tests > > I also included a second patch which adds some basic documentation for the libpq tests. This is not a full review by any means, but here are my thoughts so far: > NOTE: I have not tested this with GSS for the moment. My expectation is > that using this new API with a GSS connection will result in a > CONNECTION_BAD status when calling PQcancelStatus. The reason for this > is that GSS reads will also need to communicate back that an EOF was > found, just like I've done for TLS reads and unencrypted reads. For what it's worth, I did a smoke test with a Kerberos environment via ./libpq_pipeline cancel '... gssencmode=require' and the tests claim to pass. > 2. Cancel connections benefit automatically from any improvements made > to the normal connection establishment codepath. Examples of things > that it currently gets for free currently are TLS support and > keepalive settings. This seems like a big change compared to PQcancel(); one that's not really hinted at elsewhere. Having the async version of an API open up a completely different code path with new features is pretty surprising to me. And does the backend actually handle cancel requests via TLS (or GSS)? It didn't look that way from a quick scan, but I may have missed something. > @@ -1555,6 +1665,7 @@ print_test_list(void) > printf("singlerow\n"); > printf("transaction\n"); > printf("uniqviol\n"); > + printf("cancel\n"); > } This should probably go near the top; it looks like the existing list is alphabetized. The new cancel tests don't print any feedback. It'd be nice to get the same sort of output as the other tests. > /* issue a cancel request */ > extern int PQcancel(PGcancel *cancel, char *errbuf, int errbufsize); > +extern PGcancelConn * PQcancelConnectStart(PGconn *conn); > +extern PGcancelConn * PQcancelConnect(PGconn *conn); > +extern PostgresPollingStatusType PQcancelConnectPoll(PGcancelConn * cancelConn); > +extern ConnStatusType PQcancelStatus(const PGcancelConn * cancelConn); > +extern int PQcancelSocket(const PGcancelConn * cancelConn); > +extern char *PQcancelErrorMessage(const PGcancelConn * cancelConn); > +extern void PQcancelFinish(PGcancelConn * cancelConn); That's a lot of new entry points, most of which don't do anything except call their twin after a pointer cast. How painful would it be to just use the existing APIs as-is, and error out when calling unsupported functions if conn->cancelRequest is true? --Jacob
Jacob Champion <pchampion@vmware.com> writes: > On Thu, 2022-01-13 at 14:51 +0000, Jelte Fennema wrote: >> 2. Cancel connections benefit automatically from any improvements made >> to the normal connection establishment codepath. Examples of things >> that it currently gets for free currently are TLS support and >> keepalive settings. > This seems like a big change compared to PQcancel(); one that's not > really hinted at elsewhere. Having the async version of an API open up > a completely different code path with new features is pretty surprising > to me. Well, the patch lacks any user-facing doco at all, so a-fortiori this point is not covered. I trust the plan was to write docs later. I kind of feel that this patch is going in the wrong direction. I do see the need for a version of PQcancel that can encrypt the transmitted cancel request (and yes, that should work on the backend side; see recursion in ProcessStartupPacket). I have not seen requests for a non-blocking version, and this doesn't surprise me. I feel that the whole non-blocking aspect of libpq probably belongs to another era when people didn't trust threads. So what I'd do is make a version that just takes a PGconn, sends the cancel request, and returns success or failure; never mind the non-blocking aspect. One possible long-run advantage of this is that it might be possible to "sync" the cancel request so that we know, or at least can find out afterwards, exactly which query got cancelled; something that's fundamentally impossible if the cancel function works from a clone data structure that is disconnected from the current connection state. (Note that it probably makes sense to make a clone PGconn to pass to fe-connect.c, internally to this function. I just don't want to expose that to the app.) regards, tom lane
Hi, On 2022-03-24 17:41:53 -0400, Tom Lane wrote: > I kind of feel that this patch is going in the wrong direction. > I do see the need for a version of PQcancel that can encrypt the > transmitted cancel request (and yes, that should work on the backend > side; see recursion in ProcessStartupPacket). I have not seen > requests for a non-blocking version, and this doesn't surprise me. > I feel that the whole non-blocking aspect of libpq probably belongs > to another era when people didn't trust threads. That's not a whole lot of fun if you think of cases like postgres_fdw (or citus as in Jelte's case), which run inside the backend. Even with just a single postgres_fdw, we don't really want to end up in an uninterruptible PQcancel() that doesn't even react to pg_terminate_backend(). Even if using threads weren't an issue, I don't really buy the premise - most networking code has moved *away* from using dedicated threads for each connection. It just doesn't scale. Leaving PQcancel aside, we use the non-blocking libpq stuff widely ourselves. I think walreceiver, isolationtester, pgbench etc would be *much* harder to get working equally well if there was just blocking calls. If anything, we're getting to the point where purely blocking functionality shouldn't be added anymore. Greetings, Andres Freund
On Thu, Mar 24, 2022 at 6:49 PM Andres Freund <andres@anarazel.de> wrote: > That's not a whole lot of fun if you think of cases like postgres_fdw (or > citus as in Jelte's case), which run inside the backend. Even with just a > single postgres_fdw, we don't really want to end up in an uninterruptible > PQcancel() that doesn't even react to pg_terminate_backend(). > > Even if using threads weren't an issue, I don't really buy the premise - most > networking code has moved *away* from using dedicated threads for each > connection. It just doesn't scale. > > Leaving PQcancel aside, we use the non-blocking libpq stuff widely > ourselves. I think walreceiver, isolationtester, pgbench etc would be *much* > harder to get working equally well if there was just blocking calls. If > anything, we're getting to the point where purely blocking functionality > shouldn't be added anymore. +1. I think having a non-blocking version of PQcancel() available is a great idea, and I've wanted it myself. See commit ae9bfc5d65123aaa0d1cca9988037489760bdeae. That said, I don't think that this particular patch is going in the right direction. I think Jacob's comment upthread is right on point: "This seems like a big change compared to PQcancel(); one that's not really hinted at elsewhere. Having the async version of an API open up a completely different code path with new features is pretty surprising to me." It seems to me that we want to end up with similar code paths for PQcancel() and the non-blocking version of cancel. We could get there in two ways. One way would be to implement the non-blocking functionality in a manner that matches exactly what PQcancel() does now. I imagine that the existing code from PQcancel() would move, with some amount of change, into a new set of non-blocking APIs. Perhaps PQcancel() would then be rewritten to use those new APIs instead of hand-rolling the same logic. The other possible approach would be to first change the blocking version of PQcancel() to use the regular connection code instead of its own idiosyncratic logic, and then as a second step, extend it with non-blocking interfaces that use the regular non-blocking connection code. With either of these approaches, we end up with the functionality working similarly in the blocking and non-blocking code paths. Leaving the question of approach aside, I think it's fairly clear that this patch cannot be seriously considered for v15. One problem is the lack of user-facing documentation, but there's a other stuff that just doesn't look sufficiently well-considered. For example, it updates the comment for pqsecure_read() to say "Returns -1 in case of failures, except in the case of clean connection closure then it returns -2." But that function calls any of three different implementation functions depending on the situation and the patch only updates one of them. And it updates that function to return -2 when the is ECONNRESET, which seems to fly in the face of the comment's idea that this is the "clean connection closure" case. I think it's probably a bad sign that this function is tinkering with logic in this sort of low-level function anyway. pqReadData() is a really general function that manages to work with non-blocking I/O already, so why does non-blocking query cancellation need to change its return values, or whether or not it drops data in certain cases? I'm also skeptical about the fact that we end up with a whole bunch of new functions that are just wrappers around existing functions. That's not a scalable approach. Every function that we have for a PGconn will eventually need a variant that deals with a PGcancelConn. That seems kind of pointless, especially considering that a PGcancelConn is *exactly* a PGconn in disguise. If we decide to pursue the approach of using the existing infrastructure for PGconn objects to handle query cancellation, we ought to manipulate them using the same functions we currently do, with some kind of mode or flag or switch or something that you can use to turn a regular PGconn into something that cancels a query. Maybe you create the PGconn and call PQsprinkleMagicCancelDust() on it, and then you just proceed using the existing functions, or something like that. Then, not only do the existing functions not need query-cancel analogues, but any new functions we add in the future don't either. I'll set the target version for this patch to 16. I hope work continues. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > That said, I don't think that this particular patch is going in the > right direction. I think Jacob's comment upthread is right on point: > "This seems like a big change compared to PQcancel(); one that's not > really hinted at elsewhere. Having the async version of an API open up > a completely different code path with new features is pretty > surprising to me." It seems to me that we want to end up with similar > code paths for PQcancel() and the non-blocking version of cancel. We > could get there in two ways. One way would be to implement the > non-blocking functionality in a manner that matches exactly what > PQcancel() does now. I imagine that the existing code from PQcancel() > would move, with some amount of change, into a new set of non-blocking > APIs. Perhaps PQcancel() would then be rewritten to use those new APIs > instead of hand-rolling the same logic. The other possible approach > would be to first change the blocking version of PQcancel() to use the > regular connection code instead of its own idiosyncratic logic, and > then as a second step, extend it with non-blocking interfaces that use > the regular non-blocking connection code. With either of these > approaches, we end up with the functionality working similarly in the > blocking and non-blocking code paths. I think you misunderstand where the real pain point is. The reason that PQcancel's functionality is so limited has little to do with blocking vs non-blocking, and everything to do with the fact that it's designed to be safe to call from a SIGINT handler. That makes it quite impractical to invoke OpenSSL, and probably our GSS code as well. If we want support for all connection-time options then we have to make a new function that does not promise signal safety. I'm prepared to yield on the question of whether we should provide a non-blocking version, though I still say that (a) an easier-to-call, one-step blocking alternative would be good too, and (b) it should not be designed around the assumption that there's a completely independent state object being used to perform the cancel. Even in the non-blocking case, callers should only deal with the original PGconn. > Leaving the question of approach aside, I think it's fairly clear that > this patch cannot be seriously considered for v15. Yeah, I don't think it's anywhere near fully baked yet. On the other hand, we do have a couple of weeks left. regards, tom lane
On Fri, Mar 25, 2022 at 2:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think you misunderstand where the real pain point is. The reason > that PQcancel's functionality is so limited has little to do with > blocking vs non-blocking, and everything to do with the fact that > it's designed to be safe to call from a SIGINT handler. That makes > it quite impractical to invoke OpenSSL, and probably our GSS code > as well. If we want support for all connection-time options then > we have to make a new function that does not promise signal safety. Well, that's a fair point, but it's somewhat orthogonal to the one I'm making, which is that a non-blocking version of function X might be expected to share code or at least functionality with X itself. Having something that is named in a way that implies asynchrony without other differences but which is actually different in other important ways is no good. > I'm prepared to yield on the question of whether we should provide > a non-blocking version, though I still say that (a) an easier-to-call, > one-step blocking alternative would be good too, and (b) it should > not be designed around the assumption that there's a completely > independent state object being used to perform the cancel. Even in > the non-blocking case, callers should only deal with the original > PGconn. Well, this sounds like you're arguing for the first of the two approaches I thought would be acceptable, rather than the second. > > Leaving the question of approach aside, I think it's fairly clear that > > this patch cannot be seriously considered for v15. > > Yeah, I don't think it's anywhere near fully baked yet. On the other > hand, we do have a couple of weeks left. We do? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Well, that's a fair point, but it's somewhat orthogonal to the one I'm > making, which is that a non-blocking version of function X might be > expected to share code or at least functionality with X itself. Having > something that is named in a way that implies asynchrony without other > differences but which is actually different in other important ways is > no good. Yeah. We need to choose a name for these new function(s) that is sufficiently different from "PQcancel" that people won't expect them to behave exactly the same as that does. I lack any good ideas about that, how about you? >> Yeah, I don't think it's anywhere near fully baked yet. On the other >> hand, we do have a couple of weeks left. > We do? Um, you did read the psql-release discussion about setting the feature freeze deadline, no? regards, tom lane
Thanks for all the feedback everyone. I'll try to send a new patch later this week that includes user facing docs and a simplified API. For now a few responses: > Yeah. We need to choose a name for these new function(s) that is > sufficiently different from "PQcancel" that people won't expect them > to behave exactly the same as that does. I lack any good ideas about > that, how about you? So I guess the names I proposed were not great, since everyone seems to be falling over them. But I'd like to make my intention clear with the current naming. After this patch there would be four different APIs for starting a cancelation: 1. PQrequestCancel: deprecated+old, not signal-safe function for requesting query cancellation, only uses a specific setof connection options 2. PQcancel: Cancel queries in a signal safe way, to be signal-safe it only uses a limited set of connection options 3. PQcancelConnect: Cancel queries in a non-signal safe way that uses all connection options 4. PQcancelConnectStart: Cancel queries in a non-signal safe and non-blocking way that uses all connection options So the idea was that you should not look at PQcancelConnectStart as the non-blocking version of PQcancel, but as the non-blocking version of PQcancelConnect. I'll try to think of some different names too, but IMHO these names could be acceptable when their differences are addressed sufficiently in the documentation. One other approach to naming that comes to mind now is repurposing PQrequestCancel: 1. PQrequestCancel: Cancel queries in a non-signal safe way that uses all connection options 2. PQrequestCancelStart: Cancel queries in a non-signal safe and non-blocking way that uses all connection options 3. PQcancel: Cancel queries in a signal safe way, to be signal-safe it only uses a limited set of connection options > I think it's probably a > bad sign that this function is tinkering with logic in this sort of > low-level function anyway. pqReadData() is a really general function > that manages to work with non-blocking I/O already, so why does > non-blocking query cancellation need to change its return values, or > whether or not it drops data in certain cases? The reason for this low level change is that the cancellation part of the Postgres protocol is following a different, much more simplistic design than all the other parts. The client does not expect a response message back from the server after sending the cancellation request. The expectation is that the server signals completion by closing the connection, i.e. sending EOF. For all other parts of the protocol, connection termination should be initiated client side by sending a Terminate message. So the server closing (sending EOF) is always unexpected and is thus currently considered an error by pqReadData. But since this is not the case for the cancellation protocol, the result is changed to -2 in case of EOF to make it possible to distinguish between an EOF and an actual error. > And it updates that function to return -2 when the is > ECONNRESET, which seems to fly in the face of the comment's idea that > this is the "clean connection closure" case. The diff sadly does not include the very relevant comment right above these lines. Pasting the whole case statement here to clear up this confusion: case SSL_ERROR_ZERO_RETURN: /* * Per OpenSSL documentation, this error code is only returned for * a clean connection closure, so we should not report it as a * server crash. */ appendPQExpBufferStr(&conn->errorMessage, libpq_gettext("SSL connection has been closed unexpectedly\n")); result_errno = ECONNRESET; n = -2; break; > For example, it updates the > comment for pqsecure_read() to say "Returns -1 in case of failures, > except in the case of clean connection closure then it returns -2." > But that function calls any of three different implementation > functions depending on the situation and the patch only updates one of > them. That comment is indeed not describing what is happening correctly and I'll try to make it clearer. The main reason for it being incorrect is coming from the fact that receiving EOFs is handled in different places based on the encryption method: 1. Unencrypted TCP: EOF is not returned as an error by pqsecure_read, but detected by pqReadData (see comments related todefinitelyEOF) 2. OpenSSL: EOF is returned as an error by pqsecure_read (see copied case statement above) 3. GSS: When writing the patch I was not sure how EOF handling worked here, but given that the tests passed for Jacob onGSS, I'm guessing it works the same as unencrypted TCP.
I attached a new version of this patch. Which does three main things: 1. Change the PQrequestCancel implementation to use the regular connection establishement code, to support all connection options including encryption. 2. Add PQrequestCancelStart which is a thread-safe and non-blocking version of this new PQrequestCancel implementation. 3. Add PQconnectComplete, which completes a connection started by PQrequestCancelStart. This is useful if you want a thread-safe, but blocking cancel (without having a need for signal safety). This change un-deprecates PQrequestCancel, since now there's actually an advantage to using it over PQcancel. It also includes user facing documentation for all these functions. As a API design change from the previous version, PQrequestCancelStart now returns a regular PGconn for the cancel connection. @Tom Lane regarding this: > Even in the non-blocking case, callers should only deal with the original PGconn. This would by definition result in non-threadsafe code (afaict). So I refrained from doing this. The blocking version doesn't expose a PGconn at all, but the non-blocking one now returns a new PGconn. There's two more changes that I at least want to do before considering this patch mergable: 1. Go over all the functions that can be called with a PGconn, but should not be called with a cancellation PGconn and error out or exit early. 2. Copy over the SockAddr from the original connection and always connect to the same socket. I believe with the current code the cancellation could end up at the wrong server if there are multiple hosts listed in the connection string. And there's a third item that I would like to do as a bonus: 3. Actually use the non-blocking API for the postgres_fdw code to implement a timeout. Which would allow this comment can be removed: /* * Issue cancel request. Unfortunately, there's no good way to limit the * amount of time that we might block inside PQgetCancel(). */ So a next version of this patch can be expected somewhere later this week. But any feedback on the current version would be appreciated. Because these 3 changes won't change the overall design much. Jelte
Вложения
Note that the patch is still variously failing in cirrus. https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3511 You may already know that it's possible to trigger the cirrus ci tasks using a github branch. See src/tools/ci/README.
Attached is the latest version of this patch, which I think is now in a state in which it could be merged. The changes are: 1. Don't do host and address discovery for cancel connections. It now reuses raddr and whichhost from the original connection. This makes sure the cancel always goes to the right server, even when DNS records change or another server would be chosen now in case of connnection strings containing multiple hosts. 2. Fix the windows CI failure. This is done by both using the threadsafe code in the the dblink cancellation code, and also by not erroring a cancellation connection on windows in case of any errors. This last one is to work around the issue described in this thread: https://www.postgresql.org/message-id/flat/90b34057-4176-7bb0-0dbb-9822a5f6425b%40greiz-reinsdorf.de I also went over most of the functions that take a PGconn, to see if they needed extra checks to guard against being executed on cancel. So far all seemed fine, either they should be okay to execute against a cancellation connection, or they failed already anyway because a cancellation connection never reaches the CONNECTION_OK state. So I didn't add any checks specifically for cancel connections. I'll do this again next week with a fresh head, to see if I haven't missed any cases. I'll try to find some time early next week to implement non-blocking cancellation usage in postgres_fdw, i.e. the bonus task I mentioned in my previous email. But I don't think it's necessary to have that implemented before merging.
Вложения
Hereby what I consider the final version of this patch. I don't have any changes planned myself (except for ones that come up during review). Things that changed since the previous iteration: 1. postgres_fdw now uses the non-blocking cancellation API (including test). 2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.
Вложения
Resending with a problematic email removed from CC... On Mon, Apr 04, 2022 at 03:21:54PM +0000, Jelte Fennema wrote: > 2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD. Apparently there's still an occasional issue. https://cirrus-ci.com/task/6613309985128448 result 232/352 (error): ERROR: duplicate key value violates unique constraint "ppln_uniqviol_pkey" DETAIL: Key (id)=(116) already exists. This shows that the issue is pretty rare: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/38/3511 -- Justin
(resent because it was blocked from the mailing-list due to inclusion of a blocked email address in the To line) From: Andres Freund <andres@anarazel.de> > On 2022-04-04 15:21:54 +0000, Jelte Fennema wrote: > > 2. Added some extra sleeps to the cancellation test, to remove random > > failures on FreeBSD. > > That's extremely extremely rarely the solution to address test reliability > issues. It'll fail when running test under valgrind etc. > > Why do you need sleeps / can you find another way to make the test reliable? The problem they are solving is racy behaviour between sending the query and sending the cancellation. If the cancellation is handled before the query is started, then the query doesn't get cancelled. To solve this problem I used the sleeps to wait a bit before sending the cancelation request. When I wrote this, I couldn't think of a better way to do it then with sleeps. But I didn't like it either (and I still don't). These emails made me start to think again, about other ways of solving the problem. I think I've found another solution (see attached patch). The way I solve it now is by using another connection to check the state of the first one. Jelte
Вложения
On Fri, Jun 24, 2022 at 07:36:16PM -0500, Justin Pryzby wrote: > Resending with a problematic email removed from CC... > > On Mon, Apr 04, 2022 at 03:21:54PM +0000, Jelte Fennema wrote: > > 2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD. > > Apparently there's still an occasional issue. > https://cirrus-ci.com/task/6613309985128448 I think that failure is actually not related to this patch. There are probably others, but I noticed because it also affected one of my patches, which changes nothing relevant. https://cirrus-ci.com/task/5904044051922944
On 2022-Jun-27, Justin Pryzby wrote: > On Fri, Jun 24, 2022 at 07:36:16PM -0500, Justin Pryzby wrote: > > Apparently there's still an occasional issue. > > https://cirrus-ci.com/task/6613309985128448 > > I think that failure is actually not related to this patch. Yeah, it's not -- Kyotaro diagnosed it as a problem in libpq's pipeline mode. I hope to push his fix soon, but there are nearby problems that I haven't been able to track down a good fix for. I'm looking into the whole. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Jelte Fennema <Jelte.Fennema@microsoft.com> writes: > [ non-blocking PQcancel ] I pushed the 0001 patch (libpq_pipeline documentation) with a bit of further wordsmithing. As for 0002, I'm not sure that's anywhere near ready. I doubt it's a great idea to un-deprecate PQrequestCancel with a major change in its behavior. If there is anybody out there still using it, they're not likely to appreciate that. Let's leave that alone and pick some other name. I'm also finding the entire design of PQrequestCancelStart etc to be horribly confusing --- it's not *bad* necessarily, but the chosen function names are seriously misleading. PQrequestCancelStart doesn't actually "start" anything, so the apparent parallel with PQconnectStart is just wrong. It's also fairly unclear what the state of a cancel PQconn is after the request cycle is completed, and whether you can re-use it (especially after a failed request), and whether you have to dispose of it separately. On the whole it feels like a mistake to have two separate kinds of PGconn with fundamentally different behaviors and yet no distinction in the API. I think I'd recommend having a separate struct type (which might internally contain little more than a pointer to a cloned PGconn), and provide only a limited set of operations on it. Seems like create, start/continue cancel request, destroy, and fetch error message ought to be enough. I don't see a reason why we need to support all of libpq's inquiry operations on such objects --- for instance, if you want to know which host is involved, you could perfectly well query the parent PGconn. Nor do I want to run around and add code to every single libpq entry point to make it reject cancel PGconns if it can't support them, but we'd have to do so if there's just one struct type. I'm not seeing the use-case for PQconnectComplete. If you want a non-blocking cancel request, why would you then use a blocking operation to complete the request? Seems like it'd be better to have just a monolithic cancel function for those who don't need non-blocking. This change: --- a/src/interfaces/libpq/libpq-fe.h +++ b/src/interfaces/libpq/libpq-fe.h @@ -59,12 +59,15 @@ typedef enum { CONNECTION_OK, CONNECTION_BAD, + CONNECTION_CANCEL_FINISHED, /* Non-blocking mode only below here */ is an absolute non-starter: it breaks ABI for every libpq client, even ones that aren't using this facility. Why do we need a new ConnStatusType value anyway? Seems like PostgresPollingStatusType covers what we need: once you reach PGRES_POLLING_OK, the cancel request is done. The test case is still not very bulletproof on slow machines, as it seems to be assuming that 30 seconds == forever. It would be all right to use $PostgreSQL::Test::Utils::timeout_default, but I'm not sure that that's easily retrievable by C code. Maybe make the TAP test pass it in with another optional switch to libpq_pipeline? Alternatively, we could teach libpq_pipeline to do getenv("PG_TEST_TIMEOUT_DEFAULT") with a fallback to 180, but that feels like it might be overly familiar with the innards of Utils.pm. regards, tom lane
Thanks for all the feedback. I attached a new patch that I think addresses all of it. Below some additional info. > On the whole it feels like a mistake to have two separate kinds of > PGconn with fundamentally different behaviors and yet no distinction > in the API. I think I'd recommend having a separate struct type > (which might internally contain little more than a pointer to a > cloned PGconn), and provide only a limited set of operations on it. In my first version of this patch, this is exactly what I did. But then I got this feedback from Jacob, so I changed it to reusing PGconn: > > /* issue a cancel request */ > > extern int PQcancel(PGcancel *cancel, char *errbuf, int errbufsize); > > +extern PGcancelConn * PQcancelConnectStart(PGconn *conn); > > +extern PGcancelConn * PQcancelConnect(PGconn *conn); > > +extern PostgresPollingStatusType PQcancelConnectPoll(PGcancelConn * cancelConn); > > +extern ConnStatusType PQcancelStatus(const PGcancelConn * cancelConn); > > +extern int PQcancelSocket(const PGcancelConn * cancelConn); > > +extern char *PQcancelErrorMessage(const PGcancelConn * cancelConn); > > +extern void PQcancelFinish(PGcancelConn * cancelConn); > > That's a lot of new entry points, most of which don't do anything > except call their twin after a pointer cast. How painful would it be to > just use the existing APIs as-is, and error out when calling > unsupported functions if conn->cancelRequest is true? I changed it back to use PGcancelConn as per your suggestion and I agree that the API got better because of it. > + CONNECTION_CANCEL_FINISHED, > /* Non-blocking mode only below here */ > > is an absolute non-starter: it breaks ABI for every libpq client, > even ones that aren't using this facility. I removed this now. The main reason was so it was clear that no queries could be sent over the connection, like is normally the case when CONNECTION_OK happens. I don't think this is as useful anymore now that this patch has a dedicated PGcancelStatus function. NOTE: The CONNECTION_STARTING ConnStatusType is still necessary. But to keep ABI compatibility I moved it to the end of the enum. > Alternatively, we could teach libpq_pipeline > to do getenv("PG_TEST_TIMEOUT_DEFAULT") with a fallback to 180, > but that feels like it might be overly familiar with the innards > of Utils.pm. I went with this approach, because this environment variable was already used in 2 other places than Utils.pm: - contrib/test_decoding/sql/twophase.sql - src/test/isolation/isolationtester.c So, one more place seemed quite harmless. P.S. I noticed a logical conflict between this patch and my libpq load balancing patch. Because this patch depends on the connhost array is constructed the exact same on the second invocation of connectOptions2. But the libpq loadbalancing patch breaks this assumption. I'm making a mental (and public) note that whichever of these patches gets merged last should address this issue.
Вложения
On 10/5/22 06:23, Jelte Fennema wrote: > In my first version of this patch, this is exactly what I did. But then > I got this feedback from Jacob, so I changed it to reusing PGconn: > >> [snip] > > I changed it back to use PGcancelConn as per your suggestion and I > agree that the API got better because of it. Sorry for the whiplash! Is the latest attachment the correct version? I don't see any difference between the latest 0001 and the previous version's 0002 -- it has no references to PG_TEST_TIMEOUT_DEFAULT, PGcancelConn, etc. Thanks, --Jacob
Ugh, it indeed seems like I somehow messed up sending the new patch. Here's the correct one.
Вложения
> On 15 Nov 2022, at 12:38, Jelte Fennema <Jelte.Fennema@microsoft.com> wrote: > Here's the correct one.<0001-Add-non-blocking-version-of-PQcancel.patch> This version of the patch no longer applies, a rebased version is needed. -- Daniel Gustafsson https://vmware.com/
> This version of the patch no longer applies, a rebased version is needed. Attached is a patch that applies cleanly again and is also changed to use the recently introduced libpq_append_conn_error. I also attached a patch that runs pgindent after the introduction of libpq_append_conn_error. I noticed that this hadn't happened when trying to run pgindent on my own changes.
Вложения
Is there anything that is currently blocking this patch? I'd quite like it to get into PG16. Especially since I ran into another use case that I would want to use this patch for recently: Adding an async cancel function to Python it's psycopg3 library. This library exposes both a Connection class and an AsyncConnection class (using python its asyncio feature). But one downside of the AsyncConnection type is that it doesn't have a cancel method. I ran into this while changing the PgBouncer tests to use python. And the cancellation tests were the only tests that required me to use a ThreadPoolExecutor instead of simply being able to use async-await style programming: https://github.com/pgbouncer/pgbouncer/blob/master/test/test_cancel.py#LL9C17-L9C17
After discussing this patch privately with Andres I created a new version of this patch. The main changes are: 1. Build on top of a refactor to addrinfo handling I had done for another patch of mine (libpq load balancing). This allows creation of a fake addrinfo list, which made it possible to remove lots of special cases for cancel requests from PQconnectPoll 2. Move -2 return value of pqReadData to a separate commit. 3. Move usage of new cancel APIs to a separate commit. 4. Move most of the logic that's specific to cancel requests to cancel related functions, e.g. PQcancelPoll does more than simply forwarding to PQconnectPoll now. 5. Copy over the connhost data from the original connection, instead of assuming that it will be rebuilt identically in the cancel connection. The main reason for this is that when/if the loadbalancing patch gets merged, then it won't necessarily be rebuilt identically anymore.
Вложения
Another small update. Mostly some trivial cleanup in the comments/docs/code. But also change patch 0005 to call PQcancelFinish in more error cases.
Вложения
This looks like it needs a rebase. === Applying patches on top of PostgreSQL commit ID 71a75626d5271f2bcdbdc43b8c13065c4634fd9f === === applying patch ./v11-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch patching file src/interfaces/libpq/fe-auth-scram.c patching file src/interfaces/libpq/fe-auth.c patching file src/interfaces/libpq/fe-connect.c Hunk #35 FAILED at 3216. Hunk #36 succeeded at 3732 (offset 27 lines). Hunk #37 succeeded at 3782 (offset 27 lines). Hunk #38 succeeded at 3795 (offset 27 lines). Hunk #39 succeeded at 7175 (offset 27 lines). 1 out of 39 hunks FAILED -- saving rejects to file src/interfaces/libpq/fe-connect.c.rej patching file src/interfaces/libpq/fe-exec.c patching file src/interfaces/libpq/fe-lobj.c patching file src/interfaces/libpq/fe-misc.c patching file src/interfaces/libpq/fe-protocol3.c patching file src/interfaces/libpq/fe-secure-common.c patching file src/interfaces/libpq/fe-secure-gssapi.c Hunk #3 succeeded at 590 (offset 2 lines). patching file src/interfaces/libpq/fe-secure-openssl.c Hunk #3 succeeded at 415 (offset 5 lines). Hunk #4 succeeded at 967 (offset 5 lines). Hunk #5 succeeded at 993 (offset 5 lines). Hunk #6 succeeded at 1037 (offset 5 lines). Hunk #7 succeeded at 1089 (offset 5 lines). Hunk #8 succeeded at 1122 (offset 5 lines). Hunk #9 succeeded at 1140 (offset 5 lines). Hunk #10 succeeded at 1239 (offset 5 lines). Hunk #11 succeeded at 1250 (offset 5 lines). Hunk #12 succeeded at 1265 (offset 5 lines). Hunk #13 succeeded at 1278 (offset 5 lines). Hunk #14 succeeded at 1315 (offset 5 lines). Hunk #15 succeeded at 1326 (offset 5 lines). Hunk #16 succeeded at 1383 (offset 5 lines). Hunk #17 succeeded at 1399 (offset 5 lines). Hunk #18 succeeded at 1452 (offset 5 lines). Hunk #19 succeeded at 1494 (offset 5 lines). patching file src/interfaces/libpq/fe-secure.c patching file src/interfaces/libpq/libpq-int.h
On Tue, 28 Feb 2023 at 15:59, Gregory Stark <stark@postgresql.org> wrote: > > This looks like it needs a rebase. So I'm updating the patch to Waiting on Author
On Wed, 1 Mar 2023 at 20:09, Greg S <stark.cfm@gmail.com> wrote: > > On Tue, 28 Feb 2023 at 15:59, Gregory Stark <stark@postgresql.org> wrote: > > > > This looks like it needs a rebase. done
Вложения
On Wed, 1 Mar 2023 at 14:48, Jelte Fennema <postgres@jeltef.nl> wrote: > > > This looks like it needs a rebase. > > done Great. Please update the CF entry to Needs Review or Ready for Committer as appropriate :) -- Gregory Stark As Commitfest Manager
On Wed, 1 Mar 2023 at 20:51, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote: > Great. Please update the CF entry to Needs Review or Ready for > Committer as appropriate :) I realised I rebased a slightly outdated version of my branch (thanks to git its --force-with-lease flag). Attached is the newest version rebased (only patch 0004 changed slightly). And I updated the CF entry to Ready for Committer now.
Вложения
Updated wording in the docs slightly. On Wed, 1 Mar 2023 at 21:00, Jelte Fennema <postgres@jeltef.nl> wrote: > > On Wed, 1 Mar 2023 at 20:51, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote: > > Great. Please update the CF entry to Needs Review or Ready for > > Committer as appropriate :) > > I realised I rebased a slightly outdated version of my branch (thanks > to git its --force-with-lease flag). Attached is the newest version > rebased (only patch 0004 changed slightly). > > And I updated the CF entry to Ready for Committer now.
Вложения
It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and fe-connect.c. Every hunk is failing which perhaps means the code you're patching has been moved or refactored?
"Gregory Stark (as CFM)" <stark.cfm@gmail.com> writes: > It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and > fe-connect.c. Every hunk is failing which perhaps means the code > you're patching has been moved or refactored? The cfbot is giving up after v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails, but that's been superseded (at least in part) by b6dfee28f. regards, tom lane
On Tue, 14 Mar 2023 at 13:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Gregory Stark (as CFM)" <stark.cfm@gmail.com> writes: > > It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and > > fe-connect.c. Every hunk is failing which perhaps means the code > > you're patching has been moved or refactored? > > The cfbot is giving up after > v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails, > but that's been superseded (at least in part) by b6dfee28f. Ah, same with Jelte Fennema's patch for load balancing in libpq. -- greg
The rebase was indeed trivial (git handled everything automatically), because my first patch was doing a superset of the changes that were committed in b6dfee28f. Attached are the new patches. On Tue, 14 Mar 2023 at 19:04, Greg Stark <stark@mit.edu> wrote: > > On Tue, 14 Mar 2023 at 13:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > "Gregory Stark (as CFM)" <stark.cfm@gmail.com> writes: > > > It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and > > > fe-connect.c. Every hunk is failing which perhaps means the code > > > you're patching has been moved or refactored? > > > > The cfbot is giving up after > > v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails, > > but that's been superseded (at least in part) by b6dfee28f. > > Ah, same with Jelte Fennema's patch for load balancing in libpq. > > -- > greg
Вложения
Rebased after conflicts with bfc9497ece01c7c45437bc36387cb1ebe346f4d2 Also included the fix for feedback from Daniel on patch 2, which he had shared in the load balancing thread. On Wed, 15 Mar 2023 at 09:49, Jelte Fennema <postgres@jeltef.nl> wrote: > > The rebase was indeed trivial (git handled everything automatically), > because my first patch was doing a superset of the changes that were > committed in b6dfee28f. Attached are the new patches. > > On Tue, 14 Mar 2023 at 19:04, Greg Stark <stark@mit.edu> wrote: > > > > On Tue, 14 Mar 2023 at 13:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > "Gregory Stark (as CFM)" <stark.cfm@gmail.com> writes: > > > > It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and > > > > fe-connect.c. Every hunk is failing which perhaps means the code > > > > you're patching has been moved or refactored? > > > > > > The cfbot is giving up after > > > v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails, > > > but that's been superseded (at least in part) by b6dfee28f. > > > > Ah, same with Jelte Fennema's patch for load balancing in libpq. > > > > -- > > greg
Вложения
Hi Jelte, I had a look into your patchset (v16), did a quick review and played a bit with the feature. Patch 2 is missing the documentation about PQcancelSocket() and contains a few typos; please find attached a (fixup) patch to correct these. --- a/src/interfaces/libpq/libpq-fe.h +++ b/src/interfaces/libpq/libpq-fe.h @@ -321,16 +328,28 @@ extern PostgresPollingStatusType PQresetPoll(PGconn *conn); /* Synchronous (blocking) */ extern void PQreset(PGconn *conn); +/* issue a cancel request */ +extern PGcancelConn * PQcancelSend(PGconn *conn); [...] Maybe I'm missing something, but this function above seems a bit strange. Namely, I wonder why it returns a PGcancelConn and what's the point of requiring the user to call PQcancelStatus() to see if something got wrong. Maybe it could be defined as: int PQcancelSend(PGcancelConn *cancelConn); where the return value would be status? And the user would only need to call PQcancelErrorMessage() in case of error. This would leave only one single way to create a PGcancelConn value (i.e. PQcancelConn()), which seems less confusing to me. Jelte Fennema wrote: > Especially since I ran into another use case that I would want to use > this patch for recently: Adding an async cancel function to Python > it's psycopg3 library. This library exposes both a Connection class > and an AsyncConnection class (using python its asyncio feature). But > one downside of the AsyncConnection type is that it doesn't have a > cancel method. As part of my testing, I've implemented non-blocking cancellation in Psycopg, based on v16 on this patchset. Overall this worked fine and seems useful; if you want to try it: https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel (The only thing I found slightly inconvenient is the need to convey the connection encoding (from PGconn) when handling error message from the PGcancelConn.) Cheers, Denis
Вложения
On Tue, 28 Mar 2023 at 16:54, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > I had a look into your patchset (v16), did a quick review and played a > bit with the feature. > > Patch 2 is missing the documentation about PQcancelSocket() and contains > a few typos; please find attached a (fixup) patch to correct these. Thanks applied that patch and attached a new patchset > Namely, I wonder why it returns a PGcancelConn and what's the > point of requiring the user to call PQcancelStatus() to see if something > got wrong. Maybe it could be defined as: > > int PQcancelSend(PGcancelConn *cancelConn); > > where the return value would be status? And the user would only need to > call PQcancelErrorMessage() in case of error. This would leave only one > single way to create a PGcancelConn value (i.e. PQcancelConn()), which > seems less confusing to me. To clarify what you mean, the API would then be like this: PGcancelConn cancelConn = PQcancelConn(conn); if (PQcancelSend(cancelConn) == CONNECTION_BAD) { printf("ERROR %s\n", PQcancelErrorMessage(cancelConn)) exit(1) } Instead of: PGcancelConn cancelConn = PQcancelSend(conn); if (PQcancelStatus(cancelConn) == CONNECTION_BAD) { printf("ERROR %s\n", PQcancelErrorMessage(cancelConn)) exit(1) } Those are so similar, that I have no preference either way. If more people prefer one over the other I'm happy to change it, but for now I'll keep it as is. > As part of my testing, I've implemented non-blocking cancellation in > Psycopg, based on v16 on this patchset. Overall this worked fine and > seems useful; if you want to try it: > > https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel That's great to hear! I'll try to take a closer look at that change tomorrow. > (The only thing I found slightly inconvenient is the need to convey the > connection encoding (from PGconn) when handling error message from the > PGcancelConn.) Could you expand a bit more on this? And if you have any idea on how to improve the API with regards to this?
Вложения
Jelte Fennema a écrit : > > Namely, I wonder why it returns a PGcancelConn and what's the > > point of requiring the user to call PQcancelStatus() to see if something > > got wrong. Maybe it could be defined as: > > > > int PQcancelSend(PGcancelConn *cancelConn); > > > > where the return value would be status? And the user would only need to > > call PQcancelErrorMessage() in case of error. This would leave only one > > single way to create a PGcancelConn value (i.e. PQcancelConn()), which > > seems less confusing to me. > > To clarify what you mean, the API would then be like this: > PGcancelConn cancelConn = PQcancelConn(conn); > if (PQcancelSend(cancelConn) == CONNECTION_BAD) { > printf("ERROR %s\n", PQcancelErrorMessage(cancelConn)) > exit(1) > } I'm not sure it's worth returning the connection status, maybe just an int value (the return value of connectDBComplete() for instance). More importantly, not having PQcancelSend() creating the PGcancelConn makes reuse of that value, passing through PQcancelReset(), more intuitive. E.g., in the tests: diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c index 6764ab513b..91363451af 100644 --- a/src/test/modules/libpq_pipeline/libpq_pipeline.c +++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c @@ -217,17 +217,18 @@ test_cancel(PGconn *conn, const char *conninfo) pg_fatal("failed to run PQrequestCancel: %s", PQerrorMessage(conn)); confirm_query_cancelled(conn); + cancelConn = PQcancelConn(conn); + /* test PQcancelSend */ send_cancellable_query(conn, monitorConn); - cancelConn = PQcancelSend(conn); - if (PQcancelStatus(cancelConn) == CONNECTION_BAD) + if (PQcancelSend(cancelConn) == CONNECTION_BAD) pg_fatal("failed to run PQcancelSend: %s", PQcancelErrorMessage(cancelConn)); confirm_query_cancelled(conn); - PQcancelFinish(cancelConn); + + PQcancelReset(cancelConn); /* test PQcancelConn and then polling with PQcancelPoll */ send_cancellable_query(conn, monitorConn); - cancelConn = PQcancelConn(conn); if (PQcancelStatus(cancelConn) == CONNECTION_BAD) pg_fatal("bad cancel connection: %s", PQcancelErrorMessage(cancelConn)); while (true) Otherwise, it's not clear if the PGcancelConn created by PQcancelSend() should be reused or not. But maybe that's a matter of documentation? > > As part of my testing, I've implemented non-blocking cancellation in > > Psycopg, based on v16 on this patchset. Overall this worked fine and > > seems useful; if you want to try it: > > > > https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel > > That's great to hear! I'll try to take a closer look at that change tomorrow. See also https://github.com/psycopg/psycopg/issues/534 if you want to discuss about this. > > (The only thing I found slightly inconvenient is the need to convey the > > connection encoding (from PGconn) when handling error message from the > > PGcancelConn.) > > Could you expand a bit more on this? And if you have any idea on how > to improve the API with regards to this? The thing is that we need the connection encoding (client_encoding) when eventually forwarding the result of PQcancelErrorMessage(), decoded, to the user. More specifically, it seems to me that we'd the encoding of the *cancel connection*, but since PQparameterStatus() cannot be used with a PGcancelConn, I use that of the PGconn. Roughly, in Python: encoding = conn.parameter_status(b"client_encoding") # i.e, in C: char *encoding PQparameterStatus(conn, "client_encoding"); cancel_conn = conn.cancel_conn() # i.e., in C: PGcancelConn *cancelConn = PQcancelConn(conn); # [... then work with with cancel_conn ...] if cancel_conn.status == ConnStatus.BAD: raise OperationalError(cancel_conn.error_message().decode(encoding)) This feels a bit non-atomic to me; isn't there a risk that client_encoding be changed between PQparameterStatus(conn) and PQcancelConn(conn) calls? So maybe PQcancelParameterStatus(PGcancelConn *cancelConn, char *name) is needed?
On Wed, 29 Mar 2023 at 10:43, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > More importantly, not having PQcancelSend() creating the PGcancelConn > makes reuse of that value, passing through PQcancelReset(), more > intuitive. E.g., in the tests: You convinced me. Attached is an updated patch where PQcancelSend takes the PGcancelConn and returns 1 or 0. > The thing is that we need the connection encoding (client_encoding) when > eventually forwarding the result of PQcancelErrorMessage(), decoded, to > the user. Cancel connections don't have an encoding specified. They never receive an error from the server. All errors come from the machine that libpq is on. So I think you're making the decoding more complicated than it needs to be.
Вложения
Jelte Fennema a écrit : > On Wed, 29 Mar 2023 at 10:43, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > > More importantly, not having PQcancelSend() creating the PGcancelConn > > makes reuse of that value, passing through PQcancelReset(), more > > intuitive. E.g., in the tests: > > You convinced me. Attached is an updated patch where PQcancelSend > takes the PGcancelConn and returns 1 or 0. Patch 5 is missing respective changes; please find attached a fixup patch for these.
Вложения
On Thu, 30 Mar 2023 at 10:07, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > Patch 5 is missing respective changes; please find attached a fixup > patch for these. Thanks, attached are newly rebased patches that include this change. I also cast the result of PQcancelSend to to void in the one case where it's ignored on purpose. Note that the patchset shrunk by one, since the original patch 0002 has been committed now.
Вложения
The patch set does not apply any more. I tried to rebase locally; even leaving out 1 ("libpq: Run pgindent after a9e9a9f32b3"), patch 4 ("Start using new libpq cancel APIs") is harder to resolve following 983ec23007b (I suppose). Appart from that, the implementation in v19 sounds good to me, and seems worthwhile. FWIW, as said before, I also implemented it in Psycopg in a sort of an end-to-end validation.
Okay, I rebased again. Indeed 983ec23007b gave the most problems. On Fri, 7 Apr 2023 at 10:02, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > > The patch set does not apply any more. > > I tried to rebase locally; even leaving out 1 ("libpq: Run pgindent > after a9e9a9f32b3"), patch 4 ("Start using new libpq cancel APIs") is > harder to resolve following 983ec23007b (I suppose). > > Appart from that, the implementation in v19 sounds good to me, and seems > worthwhile. FWIW, as said before, I also implemented it in Psycopg in a > sort of an end-to-end validation.
Вложения
I noticed that cfbot was unable to run tests due to some rebase conflict. It seems the pgindent changes from patch 1 have now been made. So adding the rebased patches without patch 1 now to unblock cfbot.
Вложения
Rebased again to resolve some conflicts
Вложения
Trivial observation: these patches obviously introduce many instances of words derived from "cancel", but they don't all conform to established project decisions (cf 21f1e15a) about how to spell them. We follow the common en-US usage: "canceled", "canceling" but "cancellation". Blame Webstah et al. https://english.stackexchange.com/questions/176957/cancellation-canceled-canceling-us-usage
On Mon, 13 Nov 2023 at 03:39, Thomas Munro <thomas.munro@gmail.com> wrote: > We follow the common en-US usage: "canceled", "canceling" but > "cancellation". Blame Webstah et al. I changed all the places that were not adhering to those spellings. There were also a few of such places in parts of the codebase that these changes didn't touch. I included a new 0001 patch to fix those. I do feel like this patchset is pretty much in a committable state. So it would be very much appreciated if any comitter could help push it over the finish line.
Вложения
On Thu, 14 Dec 2023 at 13:57, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > I changed all the places that were not adhering to those spellings. It seems I forgot a /g on my sed command to do this so it turned out I missed one that caused the test to fail to compile... Attached is a fixed version. I also updated the patchset to use the EOF detection provided by 0a5c46a7a488f2f4260a90843bb9de6c584c7f4e instead of introducing a new way of EOF detection using a -2 return value.
Вложения
On Wed, 20 Dec 2023 at 19:17, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > On Thu, 14 Dec 2023 at 13:57, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > I changed all the places that were not adhering to those spellings. > > It seems I forgot a /g on my sed command to do this so it turned out I > missed one that caused the test to fail to compile... Attached is a > fixed version. > > I also updated the patchset to use the EOF detection provided by > 0a5c46a7a488f2f4260a90843bb9de6c584c7f4e instead of introducing a new > way of EOF detection using a -2 return value. CFBot shows that the patch does not apply anymore as in [1]: patching file doc/src/sgml/libpq.sgml ... patching file src/interfaces/libpq/exports.txt Hunk #1 FAILED at 191. 1 out of 1 hunk FAILED -- saving rejects to file src/interfaces/libpq/exports.txt.rej patching file src/interfaces/libpq/fe-connect.c Please post an updated version for the same. [1] - http://cfbot.cputube.org/patch_46_3511.log Regards, Vignesh
On Fri, 26 Jan 2024 at 02:59, vignesh C <vignesh21@gmail.com> wrote: > Please post an updated version for the same. Done.
Вложения
Pushed 0001. I wonder, would it make sense to put all these new functions in a separate file fe-cancel.c? -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "World domination is proceeding according to plan" (Andrew Morton)
On Fri, 26 Jan 2024 at 13:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I wonder, would it make sense to put all these new functions in a > separate file fe-cancel.c? Okay I tried doing that. I think the end result is indeed quite nice, having all the cancellation related functions together in a file. But it did require making a bunch of static functions in fe-connect extern, and adding them to libpq-int.h. On one hand that seems fine to me, on the other maybe that indicates that this cancellation logic makes sense to be in the same file as the other connection functions (in a sense, connecting is all that a cancel request does).
Вложения
On 2024-Jan-26, Jelte Fennema-Nio wrote: > Okay I tried doing that. I think the end result is indeed quite nice, > having all the cancellation related functions together in a file. But > it did require making a bunch of static functions in fe-connect > extern, and adding them to libpq-int.h. On one hand that seems fine to > me, on the other maybe that indicates that this cancellation logic > makes sense to be in the same file as the other connection functions > (in a sense, connecting is all that a cancel request does). Yeah, I see that point of view as well. I like the end result; the additional protos in libpq-int.h don't bother me. Does anybody else wants to share their opinion on it? If none, then I'd consider going ahead with this version. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "We’ve narrowed the problem down to the customer’s pants being in a situation of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)
On Fri, 26 Jan 2024 at 18:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Yeah, I see that point of view as well. I like the end result; the > additional protos in libpq-int.h don't bother me. Does anybody else > wants to share their opinion on it? If none, then I'd consider going > ahead with this version. To be clear, I'm +1 on the new file structure (although if people feel strongly against it, I don't care enough to make a big deal out of it). @Alvaro did you have any other comments on the contents of the patch btw?
On Fri, 26 Jan 2024 at 22:22, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > On Fri, 26 Jan 2024 at 13:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I wonder, would it make sense to put all these new functions in a > > separate file fe-cancel.c? > > > Okay I tried doing that. I think the end result is indeed quite nice, > having all the cancellation related functions together in a file. But > it did require making a bunch of static functions in fe-connect > extern, and adding them to libpq-int.h. On one hand that seems fine to > me, on the other maybe that indicates that this cancellation logic > makes sense to be in the same file as the other connection functions > (in a sense, connecting is all that a cancel request does). CFBot shows that the patch has few compilation errors as in [1]: [17:07:07.621] /usr/bin/ld: ../../../src/fe_utils/libpgfeutils.a(cancel.o): in function `handle_sigint': [17:07:07.621] cancel.c:(.text+0x50): undefined reference to `PQcancel' [17:07:07.621] /usr/bin/ld: ../../../src/fe_utils/libpgfeutils.a(cancel.o): in function `SetCancelConn': [17:07:07.621] cancel.c:(.text+0x10c): undefined reference to `PQfreeCancel' [17:07:07.621] /usr/bin/ld: cancel.c:(.text+0x114): undefined reference to `PQgetCancel' [17:07:07.621] /usr/bin/ld: ../../../src/fe_utils/libpgfeutils.a(cancel.o): in function `ResetCancelConn': [17:07:07.621] cancel.c:(.text+0x148): undefined reference to `PQfreeCancel' [17:07:07.621] /usr/bin/ld: ../../../src/fe_utils/libpgfeutils.a(connect_utils.o): in function `disconnectDatabase': [17:07:07.621] connect_utils.c:(.text+0x2fc): undefined reference to `PQcancelConn' [17:07:07.621] /usr/bin/ld: connect_utils.c:(.text+0x307): undefined reference to `PQcancelSend' [17:07:07.621] /usr/bin/ld: connect_utils.c:(.text+0x30f): undefined reference to `PQcancelFinish' [17:07:07.623] /usr/bin/ld: ../../../src/interfaces/libpq/libpq.so: undefined reference to `PQcancelPoll' [17:07:07.626] collect2: error: ld returned 1 exit status [17:07:07.626] make[3]: *** [Makefile:31: pg_amcheck] Error 1 [17:07:07.626] make[2]: *** [Makefile:45: all-pg_amcheck-recurse] Error 2 [17:07:07.626] make[2]: *** Waiting for unfinished jobs.... [17:07:08.126] /usr/bin/ld: ../../../src/interfaces/libpq/libpq.so: undefined reference to `PQcancelPoll' [17:07:08.130] collect2: error: ld returned 1 exit status [17:07:08.131] make[3]: *** [Makefile:42: initdb] Error 1 [17:07:08.131] make[2]: *** [Makefile:45: all-initdb-recurse] Error 2 [17:07:08.492] /usr/bin/ld: ../../../src/interfaces/libpq/libpq.so: undefined reference to `PQcancelPoll' [17:07:08.495] collect2: error: ld returned 1 exit status [17:07:08.496] make[3]: *** [Makefile:50: pg_basebackup] Error 1 [17:07:08.496] make[2]: *** [Makefile:45: all-pg_basebackup-recurse] Error 2 [17:07:09.060] /usr/bin/ld: parallel.o: in function `sigTermHandler': [17:07:09.060] parallel.c:(.text+0x1aa): undefined reference to `PQcancel' Please post an updated version for the same. [1] - https://cirrus-ci.com/task/6210637211107328 Regards, Vignesh
On Sun, 28 Jan 2024 at 04:15, vignesh C <vignesh21@gmail.com> wrote: > CFBot shows that the patch has few compilation errors as in [1]: > [17:07:07.621] /usr/bin/ld: > ../../../src/fe_utils/libpgfeutils.a(cancel.o): in function > `handle_sigint': > [17:07:07.621] cancel.c:(.text+0x50): undefined reference to `PQcancel' I forgot to update ./configure based builds with the new file, only meson was working. Also it seems I trimmed the header list fe-cancel.c a bit too much for OSX, so I added unistd.h back. Both of those are fixed now.
Вложения
On Sun, 28 Jan 2024 at 10:51, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > Both of those are fixed now. Okay, there turned out to also be an issue on Windows with setKeepalivesWin32 not being available in fe-cancel.c. That's fixed now too (as well as some minor formatting issues).
Вложения
On 2024-Jan-28, Jelte Fennema-Nio wrote: > On Sun, 28 Jan 2024 at 10:51, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > Both of those are fixed now. > > Okay, there turned out to also be an issue on Windows with > setKeepalivesWin32 not being available in fe-cancel.c. That's fixed > now too (as well as some minor formatting issues). Thanks! I committed 0001 now. I also renamed the new pq_parse_int_param to pqParseIntParam, for consistency with other routines there. Please rebase the other patches. Thanks, -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest "foo" someone someday shall type "supercalifragilisticexpialidocious" (5th Commandment for C programmers)
On Mon, 29 Jan 2024 at 12:44, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Thanks! I committed 0001 now. I also renamed the new > pq_parse_int_param to pqParseIntParam, for consistency with other > routines there. Please rebase the other patches. Awesome! Rebased, and renamed pq_release_conn_hosts to pqReleaseConnHosts for the same consistency reasons.
Вложения
On 2024-Jan-29, Jelte Fennema-Nio wrote: > On Mon, 29 Jan 2024 at 12:44, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Thanks! I committed 0001 now. I also renamed the new > > pq_parse_int_param to pqParseIntParam, for consistency with other > > routines there. Please rebase the other patches. > > Awesome! Rebased, and renamed pq_release_conn_hosts to > pqReleaseConnHosts for the same consistency reasons. Thank you, looks good. I propose the following minor/trivial fixes over your initial 3 patches. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
Вложения
On Fri, 2 Feb 2024 at 13:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Thank you, looks good. > > I propose the following minor/trivial fixes over your initial 3 patches. All of those seem good like fixes. Attached is an updated patchset where they are all applied. As well as adding a missing word ("been") in a comment that I noticed while reading your fixes.
Вложения
Hello, The patched docs claim that PQrequestCancel is insecure, but neither the code nor docs explain why. The docs for PQcancel on the other hand do mention that encryption is not used; does that apply to PQrequestCancel as well and is that the reason? If so, I think we should copy the warning and perhaps include a code comment about that. Also, maybe that final phrase in PQcancel should be a <caution> box: remove from "So, for example" and add <caution><para>Because gssencmode and sslencmode are not preserved from the original connection, the cancel request is not encrypted.</para></caution> or something like that. I wonder if Section 33.7 Canceling Queries in Progress should be split in three subsections, and I propose the following order: 33.7.1 PGcancelConn-based Cancellation API PQcancelConn -- we first document the basics PQcancelSend PQcancelFinish PQcancelPoll -- the nonblocking interface is documented next PQcancelReset -- reuse a cancelconn, later in docs because it's more advanced PQcancelStatus -- accessors go last PQcancelSocket PQcancelErrorMessage 33.7.2 Obsolete interface PQgetCancel PQfreeCancel PQcancel 33.7.3 Deprecated and Insecure Methods PQrequestCancel I have a hard time coming up with good subsection titles though. Now, looking at this list, I think it's surprising that the nonblocking request for a cancellation is called PQcancelPoll. PQcancelSend() is at odds with the asynchronous query API, which uses the verb "send" for the asynchronous variants. This would suggest that PQcancelPoll should actually be called PQcancelSend or maybe PQcancelStart (mimicking PQconnectStart). I'm not sure what's a good alternative name for the blocking one, which you have called PQcancelSend. I see upthread that the names of these functions were already quite heavily debated. Sorry to beat that dead horse some more ... I'm just not sure it's decided matter. Lastly -- the doc blurbs that say simply "a version of XYZ that can be used for cancellation connections" are a bit underwhelming. Shouldn't we document these more fully instead of making users go read the docs for the other functions and wonder what the differences might be, if any? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Before you were born your parents weren't as boring as they are now. They got that way paying your bills, cleaning up your room and listening to you tell them how idealistic you are." -- Charles J. Sykes' advice to teenagers
On Fri, 2 Feb 2024 at 16:06, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Now, looking at this list, I think it's surprising that the nonblocking > request for a cancellation is called PQcancelPoll. PQcancelSend() is at > odds with the asynchronous query API, which uses the verb "send" for the > asynchronous variants. This would suggest that PQcancelPoll should > actually be called PQcancelSend or maybe PQcancelStart (mimicking > PQconnectStart). I'm not sure what's a good alternative name for the > blocking one, which you have called PQcancelSend. I agree that Send is an unfortunate suffix. I'd love to use PQcancel for this, but obviously that one is already taken. Some other options that I can think of are (from favorite to less favorite): - PQcancelBlocking - PQcancelAndWait - PQcancelGo - PQcancelNow Finally, another option would be to renome PQcancelConn to PQgetCancelConn and then rename PQcancelSend to PQcancelConn. Regarding PQcancelPoll, I think it's a good name for the polling function, but I agree it's a bit confusing to use it to also start sending the connection. Even the code of PQcancelPoll basically admits that this is confusing behaviour: /* * Before we can call PQconnectPoll we first need to start the connection * using pqConnectDBStart. Non-cancel connections already do this whenever * the connection is initialized. But cancel connections wait until the * caller starts polling, because there might be a large delay between * creating a cancel connection and actually wanting to use it. */ if (conn->status == CONNECTION_STARTING) { if (!pqConnectDBStart(&cancelConn->conn)) { cancelConn->conn.status = CONNECTION_STARTED; return PGRES_POLLING_WRITING; } } The only reasonable thing I can think of to make that situation better is to move that part of the function outside of PQcancelPoll and create a dedicated PQcancelStart function for it. It introduces an extra function, but it does seem more in line with how we do the regular connection establishment. Basically you would have code like this then, which looks quite nice honestly: cancelConn = PQcancelConn(conn); if (!PQcancelStart(cancelConn)) pg_fatal("bad cancel connection: %s", PQcancelErrorMessage(cancelConn)); while (true) { // polling using PQcancelPoll here }
On 2024-Feb-02, Jelte Fennema-Nio wrote: > The only reasonable thing I can think of to make that situation better > is to move that part of the function outside of PQcancelPoll and > create a dedicated PQcancelStart function for it. It introduces an > extra function, but it does seem more in line with how we do the > regular connection establishment. Basically you would have code like > this then, which looks quite nice honestly: > > cancelConn = PQcancelConn(conn); > if (!PQcancelStart(cancelConn)) > pg_fatal("bad cancel connection: %s", PQcancelErrorMessage(cancelConn)); > while (true) > { > // polling using PQcancelPoll here > } Maybe this is okay? I'll have a look at the whole final situation more carefully later; or if somebody else wants to share an opinion, please do so. In the meantime I pushed your 0002 and 0003 patches, so you can take this as an opportunity to rebase the remaining ones. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom." (Isaac Asimov)
On Sun, 4 Feb 2024 at 16:39, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Maybe this is okay? I'll have a look at the whole final situation more > carefully later; or if somebody else wants to share an opinion, please > do so. Attached is a new version of the final patches, with much improved docs (imho) and the new function names: PQcancelStart and PQcancelBlocking.
Вложения
On 2024-Feb-14, Jelte Fennema-Nio wrote: > Attached is a new version of the final patches, with much improved > docs (imho) and the new function names: PQcancelStart and > PQcancelBlocking. Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Wed, 14 Feb 2024 at 18:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004. Yeah, you're correct. Fixed that now.
Вложения
In patch 0004, I noticed a couple of typos in the documentation; please find attached a fixup patch correcting these. Still in the documentation, same patch, the last paragraph documenting PQcancelPoll() ends as: + indicate the current stage of the connection procedure and might be useful + to provide feedback to the user for example. These statuses are: + </para> while not actually listing the "statuses". Should we list them? Adjust the wording? Or refer to PQconnectPoll() documentation (since the paragraph is copied from there it seems)? Otherwise, the feature still works fine as far as I can tell.
Вложения
On Wed, 6 Mar 2024 at 15:03, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > > In patch 0004, I noticed a couple of typos in the documentation; please > find attached a fixup patch correcting these. Thanks, applied. > while not actually listing the "statuses". Should we list them? I listed the relevant statuses over now and updated the PQcancelStatus docs to look more like the PQstatus one. I didn't list any statuses that a cancel connection could never have (but a normal connection can). While going over the list of statuses possible for a cancel connection I realized that the docs for PQconnectStart were not listing all relevant statuses, so I fixed that in patch 0001.
Вложения
Docs: one bogus "that that". Did we consider having PQcancelConn() instead be called PQcancelCreate()? I think this better conveys that what we're doing is create an object that can be used to do something, and that nothing else is done with it by default. Also, the comment still says "Asynchronously cancel a query on the given connection. This requires polling the returned PGcancelConn to actually complete the cancellation of the query." but this is no longer a good description of what this function does. Why do we return a non-NULL pointer from PQcancelConn in the first three cases where we return errors? (original conn was NULL, original conn is PGINVALID_SOCKET, pqCopyPGconn returns failure) Wouldn't it make more sense to free the allocated object and return NULL? Actually, I wonder if there's any reason at all to return a valid pointer in any failure cases; I mean, do we really expect that application authors are going to read/report the error message from a PGcancelConn that failed to be fully created? Anyway, maybe there are reasons for this; but in any case we should set ->cancelRequest in all cases, not only after the first tests for errors. I think the extra PGconn inside pg_cancel_conn is useless; it would be simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the indirection through the extra struct. You're actually dereferencing the object in two ways in the new code, both by casting the outer object straight to PGconn (taking advantage that the struct member is first in the struct), and by using PGcancelConn->conn. This seems pointless. I mean, if we're going to cast to "PGconn *" in some places anyway, then we may as well access all members directly. Perhaps, if you want, you could add asserts that ->cancelRequest is set true in all the fe-cancel.c functions. Anyway, we'd still have compiler support to tell you that you're passing the wrong struct to the function. (I didn't actually try to change the code this way, so I might be wrong.) We could move the definition of struct pg_cancel to fe-cancel.c. Nobody outside that needs to know that definition anyway. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "XML!" Exclaimed C++. "What are you doing here? You're not a programming language." "Tell that to the people who use me," said XML. https://burningbird.net/the-parable-of-the-languages/
On Wed, 6 Mar 2024 at 19:22, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Docs: one bogus "that that". will fix > Did we consider having PQcancelConn() instead be called > PQcancelCreate()? Fine by me > Also, the comment still says > "Asynchronously cancel a query on the given connection. This requires > polling the returned PGcancelConn to actually complete the cancellation > of the query." but this is no longer a good description of what this > function does. will fix > Why do we return a non-NULL pointer from PQcancelConn in the first three > cases where we return errors? (original conn was NULL, original conn is > PGINVALID_SOCKET, pqCopyPGconn returns failure) Wouldn't it make more > sense to free the allocated object and return NULL? Actually, I wonder > if there's any reason at all to return a valid pointer in any failure > cases; I mean, do we really expect that application authors are going to > read/report the error message from a PGcancelConn that failed to be fully > created? I think having a useful error message when possible is quite nice. And I do think people will read/report this error message. Especially since many people will simply pass it to PQcancelBlocking, whether it's NULL or not. And then check the status, and then report the error if the status was CONNECTION_BAD. > but in any case we > should set ->cancelRequest in all cases, not only after the first tests > for errors. makes sense > I think the extra PGconn inside pg_cancel_conn is useless; it would be > simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the > indirection through the extra struct. That sounds nice indeed. I'll try it out. > We could move the definition of struct pg_cancel to fe-cancel.c. Nobody > outside that needs to know that definition anyway. will do
Attached is a new patchset with various changes. I created a dedicated 0002 patch to add tests for the already existing cancellation functions, because that seemed useful for another thread where changes to the cancellation protocol are being proposed[1]. [1]: https://www.postgresql.org/message-id/flat/508d0505-8b7a-4864-a681-e7e5edfe32aa%40iki.fi On Wed, 6 Mar 2024 at 19:22, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Docs: one bogus "that that". This was already fixed by my previous doc changes in v32, I guess that email got crossed with this one > Did we consider having PQcancelConn() instead be called > PQcancelCreate()? Done > "Asynchronously cancel a query on the given connection. This requires > polling the returned PGcancelConn to actually complete the cancellation > of the query." but this is no longer a good description of what this > function does. Fixed > Anyway, maybe there are reasons for this; but in any case we > should set ->cancelRequest in all cases, not only after the first tests > for errors. Done > I think the extra PGconn inside pg_cancel_conn is useless; it would be > simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the > indirection through the extra struct. You're actually dereferencing the > object in two ways in the new code, both by casting the outer object > straight to PGconn (taking advantage that the struct member is first in > the struct), and by using PGcancelConn->conn. This seems pointless. I > mean, if we're going to cast to "PGconn *" in some places anyway, then > we may as well access all members directly. Perhaps, if you want, you > could add asserts that ->cancelRequest is set true in all the > fe-cancel.c functions. Anyway, we'd still have compiler support to tell > you that you're passing the wrong struct to the function. (I didn't > actually try to change the code this way, so I might be wrong.) Turns out you were wrong about the compiler support to tell us we're passing the wrong struct: When both the PGconn and PGcancelConn typedefs refer to the same struct, the compiler allows passing PGconn to PGcancelConn functions and vice versa without complaining. This seems enough reason for me to keep indirection through the extra struct. So instead of adding the proposed typed this typedef I chose to add a comment to pg_cancel_conn explaining its purpose, as well as not casting PGcancelConn to PGconn but always accessing the conn field for consistency. > We could move the definition of struct pg_cancel to fe-cancel.c. Nobody > outside that needs to know that definition anyway. Done in 0003
Вложения
Here's a last one for the cfbot. I have a question about this one int PQcancelStart(PGcancelConn *cancelConn) { [...] if (cancelConn->conn.status != CONNECTION_ALLOCATED) { libpq_append_conn_error(&cancelConn->conn, "cancel request is already being sent on this connection"); cancelConn->conn.status = CONNECTION_BAD; return 0; } If we do this and we see conn.status is not ALLOCATED, meaning a cancel is already ongoing, shouldn't we leave conn.status alone instead of changing to CONNECTION_BAD? I mean, we shouldn't be juggling the elbow of whoever's doing that, should we? Maybe just add the error message and return 0? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "If it is not right, do not do it. If it is not true, do not say it." (Marcus Aurelius, Meditations)
Вложения
On Tue, 12 Mar 2024 at 10:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > If we do this and we see conn.status is not ALLOCATED, meaning a cancel > is already ongoing, shouldn't we leave conn.status alone instead of > changing to CONNECTION_BAD? I mean, we shouldn't be juggling the elbow > of whoever's doing that, should we? Maybe just add the error message > and return 0? I'd rather fail as hard as possible when someone is using the API wrongly. Not doing so is bound to cause confusion imho. e.g. if the state is still CONNECTION_OK because the user forgot to call PQcancelReset then keeping the connection status "as is" might seem as if the cancel request succeeded even though nothing happened. So if the user uses the API incorrectly then I'd rather use all the avenues possible to indicate that there was an error. Especially since in all other cases if PQcancelStart returns false CONNECTION_BAD is the status, and this in turn means that PQconnectPoll will return PGRES_POLLING_FAILED. So I doubt people will always check the actual return value of the function to check if an error happened. They might check PQcancelStatus or PQconnectPoll instead, because that integrates easier with the rest of their code.
On Tue, 12 Mar 2024 at 10:53, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > I'd rather fail as hard as possible when someone is using the API > wrongly. To be clear, this is my way of looking at it. If you feel strongly about that we should not change conn.status, I'm fine with making that change to the patchset.
On Tue, 12 Mar 2024 at 10:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Here's a last one for the cfbot. Thanks for committing the first 3 patches btw. Attached a tiny change to 0001, which adds "(backing struct for PGcancelConn)" to the comment on pg_cancel_conn.
Вложения
On Tue, 12 Mar 2024 at 15:04, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > Attached a tiny change to 0001 One more tiny comment change, stating that pg_cancel is used by the deprecated PQcancel function.
Вложения
On 2024-Mar-12, Jelte Fennema-Nio wrote: > On Tue, 12 Mar 2024 at 10:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Here's a last one for the cfbot. > > Thanks for committing the first 3 patches btw. Attached a tiny change > to 0001, which adds "(backing struct for PGcancelConn)" to the comment > on pg_cancel_conn. Thanks, I included it. I hope there were no other changes, because I didn't verify :-) but if there were, please let me know to incorporate them. I made a number of other small changes, mostly to the documentation, nothing fundamental. (Someday we should stop using <listentry> to document the libpq functions and use refentry's instead ... it'd be useful to have manpages for these functions.) One thing I don't like very much is release_conn_addrinfo(), which is called conditionally in two places but unconditionally in other places. Maybe it'd make more sense to put this conditionality inside the function itself, possibly with a "bool force" flag to suppress that in the cases where it is not desired. In pqConnectDBComplete, we cast the PGconn * to PGcancelConn * in order to call PQcancelPoll, which is a bit abusive, but I don't know how to do better. Maybe we just accept this ... but if PQcancelStart is the only way to have pqConnectDBStart called from a cancel connection, maybe it'd be saner to duplicate pqConnectDBStart for cancel conns. Thanks! -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Hmm, buildfarm member kestrel (which uses -fsanitize=undefined,alignment) failed: # Running: libpq_pipeline -r 700 cancel port=49975 host=/tmp/dFh46H7YGc dbname='postgres' test cancellations... libpq_pipeline:260: query did not fail when it was expected https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-12%2016%3A41%3A27 -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom." (Isaac Asimov)
On 2024-Mar-12, Alvaro Herrera wrote: > Hmm, buildfarm member kestrel (which uses > -fsanitize=undefined,alignment) failed: > > # Running: libpq_pipeline -r 700 cancel port=49975 host=/tmp/dFh46H7YGc > dbname='postgres' > test cancellations... > libpq_pipeline:260: query did not fail when it was expected Hm, I tried using the same compile flags, couldn't reproduce. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
On Tue, 12 Mar 2024 at 19:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Mar-12, Alvaro Herrera wrote: > > > Hmm, buildfarm member kestrel (which uses > > -fsanitize=undefined,alignment) failed: > > > > # Running: libpq_pipeline -r 700 cancel port=49975 host=/tmp/dFh46H7YGc > > dbname='postgres' > > test cancellations... > > libpq_pipeline:260: query did not fail when it was expected > > Hm, I tried using the same compile flags, couldn't reproduce. Okay, it passed now it seems so I guess this test is flaky somehow. The error message and the timing difference between failed and succeeded buildfarm run clearly indicates that the pg_sleep ran its 180 seconds to completion (so cancel was never processed for some reason). **failed case** 282/285 postgresql:libpq_pipeline / libpq_pipeline/001_libpq_pipeline ERROR 191.56s exit status 1 **succeeded case** 252/285 postgresql:libpq_pipeline / libpq_pipeline/001_libpq_pipeline OK 10.01s 21 subtests passed I don't see any obvious reason for how this test can be flaky, but I'll think a bit more about it tomorrow.
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > On Tue, 12 Mar 2024 at 19:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >>> Hmm, buildfarm member kestrel (which uses >>> -fsanitize=undefined,alignment) failed: >> Hm, I tried using the same compile flags, couldn't reproduce. > Okay, it passed now it seems so I guess this test is flaky somehow. Two more intermittent failures: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bushmaster&dt=2024-03-13%2003%3A15%3A09 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-03-13%2003%3A15%3A31 These animals all belong to Andres' flotilla, but other than that I'm not seeing a connection. I suspect it's basically just a timing dependency. Have you thought about the fact that a cancel request is a no-op if it arrives after the query's done? regards, tom lane
On Wed, 13 Mar 2024 at 04:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I suspect it's basically just a > timing dependency. Have you thought about the fact that a cancel > request is a no-op if it arrives after the query's done? I agree it's probably a timing issue. The cancel being received after the query is done seems very unlikely, since the query takes 180 seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these animals). I think it's more likely that the cancel request arrives too early, and thus being ignored because no query is running yet. The test already had logic to wait until the query backend was in the "active" state, before sending a cancel to solve that issue. But my guess is that that somehow isn't enough. Sadly I'm having a hard time reliably reproducing this race condition locally. So it's hard to be sure what is happening here. Attached is a patch with a wild guess as to what the issue might be (i.e. seeing an outdated "active" state and thus passing the check even though the query is not running yet)
Вложения
On 2024-Mar-13, Jelte Fennema-Nio wrote: > I agree it's probably a timing issue. The cancel being received after > the query is done seems very unlikely, since the query takes 180 > seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these > animals). I think it's more likely that the cancel request arrives too > early, and thus being ignored because no query is running yet. The > test already had logic to wait until the query backend was in the > "active" state, before sending a cancel to solve that issue. But my > guess is that that somehow isn't enough. > > Sadly I'm having a hard time reliably reproducing this race condition > locally. So it's hard to be sure what is happening here. Attached is a > patch with a wild guess as to what the issue might be (i.e. seeing an > outdated "active" state and thus passing the check even though the > query is not running yet) I tried leaving the original running in my laptop to see if I could reproduce it, but got no hits ... and we didn't get any other failures apart from the three ones already reported ... so it's not terribly high probability. Anyway I pushed your patch now since the theory seems plausible; let's see if we still get the issue to reproduce. If it does, we could make the script more verbose to hunt for further clues. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Here's a general engineering tip: if the non-fun part is too complex for you to figure out, that might indicate the fun part is too ambitious." (John Naylor) https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com
On Wed, Mar 13, 2024 at 12:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2024-Mar-13, Jelte Fennema-Nio wrote: > > Sadly I'm having a hard time reliably reproducing this race condition > > locally. So it's hard to be sure what is happening here. Attached is a > > patch with a wild guess as to what the issue might be (i.e. seeing an > > outdated "active" state and thus passing the check even though the > > query is not running yet) > > I tried leaving the original running in my laptop to see if I could > reproduce it, but got no hits ... and we didn't get any other failures > apart from the three ones already reported ... so it's not terribly high > probability. Anyway I pushed your patch now since the theory seems > plausible; let's see if we still get the issue to reproduce. If it > does, we could make the script more verbose to hunt for further clues. I hit this on my machine. With the attached diff I can reproduce constantly (including with the most recent test patch); I think the cancel must be arriving between the bind/execute steps? Thanks, --Jacob
Вложения
On Wed, 13 Mar 2024 at 20:08, Jacob Champion <jacob.champion@enterprisedb.com> wrote: > I hit this on my machine. With the attached diff I can reproduce > constantly (including with the most recent test patch); I think the > cancel must be arriving between the bind/execute steps? Nice find! Your explanation makes total sense. Attached a patchset that fixes/works around this issue by using the simple query protocol in the cancel test.
Вложения
On 2024-Mar-14, Jelte Fennema-Nio wrote: > On Wed, 13 Mar 2024 at 20:08, Jacob Champion > <jacob.champion@enterprisedb.com> wrote: > > I hit this on my machine. With the attached diff I can reproduce > > constantly (including with the most recent test patch); I think the > > cancel must be arriving between the bind/execute steps? > > Nice find! Your explanation makes total sense. Attached a patchset > that fixes/works around this issue by using the simple query protocol > in the cancel test. Hmm, isn't this basically saying that we're giving up on reliably canceling queries altogether? I mean, maybe we'd like to instead fix the bug about canceling queries in extended query protocol ... Isn't that something you're worried about? -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "World domination is proceeding according to plan" (Andrew Morton)
On Thu, 14 Mar 2024 at 11:33, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hmm, isn't this basically saying that we're giving up on reliably > canceling queries altogether? I mean, maybe we'd like to instead fix > the bug about canceling queries in extended query protocol ... > Isn't that something you're worried about? In any case I think it's worth having (non-flaky) test coverage of our libpq cancellation sending code. So I think it makes sense to commit the patch I proposed, even if the backend code to handle that code is arguably buggy. Regarding the question if the backend code is actually buggy or not: the way cancel requests are defined to work is a bit awkward. They cancel whatever operation is running on the session when they arrive. So if the session is just in the middle of a Bind and Execute message there is nothing to cancel. While surprising and probably not what someone would want, I don't think this behaviour is too horrible in practice in this case. Most of the time people cancel queries while the Execute message is being processed. The new test really only runs into this problem because it sends a cancel request, immediately after sending the query. I definitely think it's worth rethinking the way we do query cancellations though. I think what we would probably want is a way to cancel a specific query/message on a session. Instead of cancelling whatever is running at the moment when the cancel request is processed by Postgres. Because this "cancel whatever is running" behaviour is fraught with issues, this Bind/Execute issue being only one of them. One really annoying race condition of a cancel request cancelling another query than intended can happen with this flow (that I spend lots of time on addressing in PgBouncer): 1. You send query A on session 1 2. You send a cancel request for session 1 (intending to cancel query A) 3. Query A completes by itself 4. You now send query B 5. The cancel request is now processed 6. Query B is now cancelled But solving that race condition would involve changing the postgres protocol. Which I'm trying to make possible with the first few commits in[1]. And while those first few commits might still land in PG17, I don't think a large protocol change like adding query identifiers to cancel requests is feasible for PG17 anymore.
I enabled the test again and also pushed the changes to dblink, isolationtester and fe_utils (which AFAICS is used by pg_dump, pg_amcheck, reindexdb and vacuumdb). I chickened out of committing the postgres_fdw changes though, so here they are again. Not sure I'll find courage to get these done by tomorrow, or whether I should just let them for Fujita-san or Noah, who have been the last committers to touch this. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No renuncies a nada. No te aferres a nada."
Вложения
On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote: > I enabled the test again and also pushed the changes to dblink, > isolationtester and fe_utils (which AFAICS is used by pg_dump, I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to use from dblink and postgres_fdw. pgxn modules calling PQcancel() from the backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt the new way.
On Thu, 21 Mar 2024 at 03:54, Noah Misch <noah@leadboat.com> wrote: > > On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote: > > I enabled the test again and also pushed the changes to dblink, > > isolationtester and fe_utils (which AFAICS is used by pg_dump, > > I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to > use from dblink and postgres_fdw. pgxn modules calling PQcancel() from the > backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt > the new way. Done
Вложения
On 2024-Mar-22, Jelte Fennema-Nio wrote: > On Thu, 21 Mar 2024 at 03:54, Noah Misch <noah@leadboat.com> wrote: > > > > On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote: > > > I enabled the test again and also pushed the changes to dblink, > > > isolationtester and fe_utils (which AFAICS is used by pg_dump, > > > > I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to > > use from dblink and postgres_fdw. pgxn modules calling PQcancel() from the > > backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt > > the new way. > > Done Nice, thanks. I played with it a bit, mostly trying to figure out if the chosen API is usable. I toyed with making it return boolean success and the error message as an output argument, because I was nervous about what'd happen in OOM. But since this is backend environment, what actually happens is that we elog(ERROR) anyway, so we never return a NULL error message. So after the detour I think Jelte's API is okay. I changed it so that the error messages are returned as translated phrases, and was bothered by the fact that if errors happen repeatedly, the memory for them might be leaked. Maybe this is fine depending on the caller's memory context, but since it's only at most one string each time, it's quite easy to just keep track of it so that we can release it on the next. I ended up reducing the two PG_TRY blocks to a single one. I see no reason to split them up, and this way it looks more legible. What do you think? -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Tiene valor aquel que admite que es un cobarde" (Fernandel)
Вложения
On 2024-Mar-27, Alvaro Herrera wrote: > I changed it so that the error messages are returned as translated > phrases, and was bothered by the fact that if errors happen repeatedly, > the memory for them might be leaked. Maybe this is fine depending on > the caller's memory context, but since it's only at most one string each > time, it's quite easy to just keep track of it so that we can release it > on the next. (Actually this sounds clever but fails pretty obviously if the caller does free the string, such as in a memory context reset. So I guess we have to just accept the potential leakage.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La conclusión que podemos sacar de esos estudios es que no podemos sacar ninguna conclusión de ellos" (Tanenbaum)
On Wed, 27 Mar 2024 at 19:46, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Mar-27, Alvaro Herrera wrote: > > > I changed it so that the error messages are returned as translated > > phrases, and was bothered by the fact that if errors happen repeatedly, > > the memory for them might be leaked. Maybe this is fine depending on > > the caller's memory context, but since it's only at most one string each > > time, it's quite easy to just keep track of it so that we can release it > > on the next. > > (Actually this sounds clever but fails pretty obviously if the caller > does free the string, such as in a memory context reset. So I guess we > have to just accept the potential leakage.) Your changes look good, apart from the prverror stuff indeed. If you remove the prverror stuff again I think this is ready to commit.
On Wed, 27 Mar 2024 at 19:27, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I ended up reducing the two PG_TRY blocks to a single one. I see no > reason to split them up, and this way it looks more legible. I definitely agree this looks better. Not sure why I hadn't done that, maybe it wasn't possible in one of the earlier iterations of the API.
On 2024-Mar-28, Jelte Fennema-Nio wrote: > Your changes look good, apart from the prverror stuff indeed. If you > remove the prverror stuff again I think this is ready to commit. Great, thanks for looking. Pushed now, I'll be closing the commitfest entry shortly. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Always assume the user will do much worse than the stupidest thing you can imagine." (Julien PUYDT)
Hm, indri failed: ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new-Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -fno-strict-aliasing-fwrapv -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -g -O2 -fno-common -Werror -fvisibility=hidden -bundle -o dblink.dylib dblink.o -L../../src/port -L../../src/common -L../../src/interfaces/libpq-lpq -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk -L/opt/local/libexec/llvm-15/lib-L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -Wl,-dead_strip_dylibs -Werror -fvisibility=hidden -bundle_loader ../../src/backend/postgres Undefined symbols for architecture arm64: "_libintl_gettext", referenced from: _libpqsrv_cancel in dblink.o _libpqsrv_cancel in dblink.o ld: symbol(s) not found for architecture arm64 clang: error: linker command failed with exit code 1 (use -v to see invocation) make[1]: *** [dblink.dylib] Error 1 make: *** [all-dblink-recurse] Error 2 -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On 2024-Mar-28, Alvaro Herrera wrote: > Undefined symbols for architecture arm64: > "_libintl_gettext", referenced from: > _libpqsrv_cancel in dblink.o > _libpqsrv_cancel in dblink.o > ld: symbol(s) not found for architecture arm64 > clang: error: linker command failed with exit code 1 (use -v to see invocation) > make[1]: *** [dblink.dylib] Error 1 > make: *** [all-dblink-recurse] Error 2 I just removed the _() from the new function. There's not much point in wasting more time on this, given that contrib doesn't have translation support anyway, and we're not using this in libpqwalreceiver. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Crear es tan difícil como ser libre" (Elsa Triolet)
Eh, kestrel has also failed[1], apparently every query after the large JOIN that this commit added as test fails with a statement timeout error. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-28%2016%3A01%3A14 -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No deja de ser humillante para una persona de ingenio saber que no hay tonto que no le pueda enseñar algo." (Jean B. Say)
On Thu, 28 Mar 2024 at 17:34, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Eh, kestrel has also failed[1], apparently every query after the large > JOIN that this commit added as test fails with a statement timeout error. > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-28%2016%3A01%3A14 Ugh that's annoying, the RESET is timing out too I guess. That can hopefully be easily fixed by changing the new test to: BEGIN; SET LOCAL statement_timeout = '10ms'; select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long ROLLBACK;
On 2024-Mar-28, Jelte Fennema-Nio wrote: > Ugh that's annoying, the RESET is timing out too I guess. Hah, you're right, I can reproduce with a smaller timeout, and using SET LOCAL works as a fix. If we're doing that, why not reduce the timeout to 1ms? We don't need to wait extra 9ms ... -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ “Cuando no hay humildad las personas se degradan” (A. Christie)
On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hah, you're right, I can reproduce with a smaller timeout, and using SET > LOCAL works as a fix. If we're doing that, why not reduce the timeout > to 1ms? We don't need to wait extra 9ms ... I think we don't really want to make the timeout too short. Otherwise the query might get cancelled before we push any query down to the FDW. I guess that means that for some slow machines even 10ms is not enough to make the test do the intended purpose. I'd keep it at 10ms, which seems long enough for normal systems, while still being pretty short.
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> Hah, you're right, I can reproduce with a smaller timeout, and using SET >> LOCAL works as a fix. If we're doing that, why not reduce the timeout >> to 1ms? We don't need to wait extra 9ms ... > I think we don't really want to make the timeout too short. Otherwise > the query might get cancelled before we push any query down to the > FDW. I guess that means that for some slow machines even 10ms is not > enough to make the test do the intended purpose. I'd keep it at 10ms, > which seems long enough for normal systems, while still being pretty > short. If the test fails both when the machine is too slow and when it's too fast, then there's zero hope of making it stable and we should just remove it. regards, tom lane
On 2024-Mar-28, Tom Lane wrote: > Jelte Fennema-Nio <postgres@jeltef.nl> writes: > > > I think we don't really want to make the timeout too short. Otherwise > > the query might get cancelled before we push any query down to the > > FDW. I guess that means that for some slow machines even 10ms is not > > enough to make the test do the intended purpose. I'd keep it at 10ms, > > which seems long enough for normal systems, while still being pretty > > short. > > If the test fails both when the machine is too slow and when it's > too fast, then there's zero hope of making it stable and we should > just remove it. It doesn't fail when it's too fast -- it's just that it doesn't cover the case we want to cover. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2024-Mar-28, Tom Lane wrote: >> If the test fails both when the machine is too slow and when it's >> too fast, then there's zero hope of making it stable and we should >> just remove it. > It doesn't fail when it's too fast -- it's just that it doesn't cover > the case we want to cover. That's hardly better, because then you think you have test coverage but maybe you don't. Could we make this test bulletproof by using an injection point? If not, I remain of the opinion that we're better off without it. regards, tom lane
On Thu, 28 Mar 2024 at 19:03, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > It doesn't fail when it's too fast -- it's just that it doesn't cover > > the case we want to cover. > > That's hardly better, because then you think you have test > coverage but maybe you don't. Honestly, that seems quite a lot better. Instead of having randomly failing builds, you have a test that creates coverage 80+% of the time. And that also seems a lot better than having no coverage at all (which is what we had for the last 7 years since introduction of cancellations to postgres_fdw). It would be good to expand the comment in the test though saying that the test might not always cover the intended code path, due to timing problems. > Could we make this test bulletproof by using an injection point? > If not, I remain of the opinion that we're better off without it. Possibly, and if so, I agree that would be better than the currently added test. But I honestly don't feel like spending the time on creating such a test. And given 7 years have passed without someone adding any test for this codepath at all, I don't expect anyone else will either. If you both feel we're better off without the test, feel free to remove it. This was just some small missing test coverage that I noticed while working on this patch, that I thought I'd quickly address. I don't particularly care a lot about the specific test.
On Fri, Mar 29, 2024 at 09:17:55AM +0100, Jelte Fennema-Nio wrote: > On Thu, 28 Mar 2024 at 19:03, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Could we make this test bulletproof by using an injection point? > > If not, I remain of the opinion that we're better off without it. > > Possibly, and if so, I agree that would be better than the currently > added test. But I honestly don't feel like spending the time on > creating such a test. The SQL test is more representative of real applications, and it's way simpler to understand. In general, I prefer 6-line SQL tests that catch a problem 10% of the time over injection point tests that catch it 100% of the time. For low detection rate to be exciting, it needs to be low enough to have a serious chance of all buildfarm members reporting green for the bad commit. With ~115 buildfarm members running in the last day, 0.1% detection rate would have been low enough to bother improving, but 4% would be high enough to call it good.
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Great, thanks for looking. Pushed now, I'll be closing the commitfest > entry shortly. On my machine, headerscheck does not like this: $ src/tools/pginclude/headerscheck --cplusplus In file included from /tmp/headerscheck.4gTaW5/test.cpp:3: ./src/include/libpq/libpq-be-fe-helpers.h: In function 'char* libpqsrv_cancel(PGconn*, TimestampTz)': ./src/include/libpq/libpq-be-fe-helpers.h:393:10: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings] return "out of memory"; ^~~~~~~~~~~~~~~ ./src/include/libpq/libpq-be-fe-helpers.h:421:13: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings] error = "cancel request timed out"; ^~~~~~~~~~~~~~~~~~~~~~~~~~ The second part of that could easily be fixed by declaring "error" as "const char *". As for the first part, can we redefine the whole function as returning "const char *"? (If not, this coding is very questionable anyway.) regards, tom lane
On 2024-Apr-03, Tom Lane wrote: > On my machine, headerscheck does not like this: > > $ src/tools/pginclude/headerscheck --cplusplus > In file included from /tmp/headerscheck.4gTaW5/test.cpp:3: > ./src/include/libpq/libpq-be-fe-helpers.h: In function 'char* libpqsrv_cancel(PGconn*, TimestampTz)': > ./src/include/libpq/libpq-be-fe-helpers.h:393:10: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings] > return "out of memory"; > ^~~~~~~~~~~~~~~ > ./src/include/libpq/libpq-be-fe-helpers.h:421:13: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings] > error = "cancel request timed out"; > ^~~~~~~~~~~~~~~~~~~~~~~~~~ > > The second part of that could easily be fixed by declaring "error" as > "const char *". As for the first part, can we redefine the whole > function as returning "const char *"? (If not, this coding is very > questionable anyway.) Yeah, this seems to work and I no longer get that complaint from headerscheck. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Вложения
On Thu, 4 Apr 2024 at 10:45, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Yeah, this seems to work and I no longer get that complaint from > headerscheck. patch LGTM
[ from a week ago ] Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Hm, indri failed: > ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new-Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -fno-strict-aliasing-fwrapv -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -g -O2 -fno-common -Werror -fvisibility=hidden -bundle -o dblink.dylib dblink.o -L../../src/port -L../../src/common -L../../src/interfaces/libpq-lpq -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk -L/opt/local/libexec/llvm-15/lib-L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -Wl,-dead_strip_dylibs -Werror -fvisibility=hidden -bundle_loader ../../src/backend/postgres > Undefined symbols for architecture arm64: > "_libintl_gettext", referenced from: > _libpqsrv_cancel in dblink.o > _libpqsrv_cancel in dblink.o > ld: symbol(s) not found for architecture arm64 > clang: error: linker command failed with exit code 1 (use -v to see invocation) > make[1]: *** [dblink.dylib] Error 1 > make: *** [all-dblink-recurse] Error 2 Having just fixed the same issue for test_json_parser, I now realize what's going on there: dblink's link command doesn't actually mention any of the external libraries that we might need, such as libintl. You can get away with that on some platforms, but not macOS. It would probably be possible to fix that if anyone cared to. I'm not sufficiently excited about it to do so right now --- as you say, we don't support translation in contrib anyway. regards, tom lane