Обсуждение: pgsql: postgres_fdw: reestablish new connection if cached one is detect
postgres_fdw: reestablish new connection if cached one is detected as broken. In postgres_fdw, once remote connections are established, they are cached and re-used for subsequent queries and transactions. There can be some cases where those cached connections are unavaiable, for example, by the restart of remote server. In these cases, previously an error was reported and the query accessing to remote server failed if new remote transaction failed to start because the cached connection was broken. This commit improves postgres_fdw so that new connection is remade if broken connection is detected when starting new remote transaction. This is useful to avoid unnecessary failure of queries when connection is broken but can be reestablished. Author: Bharath Rupireddy, tweaked a bit by Fujii Masao Reviewed-by: Ashutosh Bapat, Tatsuhito Kasahara, Fujii Masao Discussion: https://postgr.es/m/CALj2ACUAi23vf1WiHNar_LksM9EDOWXcbHCo-fD4Mbr1d=78YQ@mail.gmail.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/32a9c0bdf493cf5fc029ab44a22384d547290eff Modified Files -------------- contrib/postgres_fdw/connection.c | 54 ++++++++++++++++++++------ contrib/postgres_fdw/expected/postgres_fdw.out | 48 +++++++++++++++++++++++ contrib/postgres_fdw/sql/postgres_fdw.sql | 41 +++++++++++++++++++ 3 files changed, 131 insertions(+), 12 deletions(-)
Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect
От
Michael Paquier
Дата:
Hi Fujii-san, On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote: > postgres_fdw: reestablish new connection if cached one is detected as broken. > > In postgres_fdw, once remote connections are established, they are cached > and re-used for subsequent queries and transactions. There can be some > cases where those cached connections are unavaiable, for example, > by the restart of remote server. In these cases, previously an error was > reported and the query accessing to remote server failed if new remote > transaction failed to start because the cached connection was broken. > > This commit improves postgres_fdw so that new connection is remade > if broken connection is detected when starting new remote transaction. > This is useful to avoid unnecessary failure of queries when connection is > broken but can be reestablished. lorikeet is telling that the test introduced by this commit is unstable: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36 Some details: BEGIN; SELECT 1 FROM ft1 LIMIT 1; - ?column? ----------- - 1 -(1 row) - +ERROR: could not receive data from server: Software caused connection abort +CONTEXT: remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ -- Michael
Вложения
On 2020/10/07 11:13, Michael Paquier wrote: > Hi Fujii-san, > > On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote: >> postgres_fdw: reestablish new connection if cached one is detected as broken. >> >> In postgres_fdw, once remote connections are established, they are cached >> and re-used for subsequent queries and transactions. There can be some >> cases where those cached connections are unavaiable, for example, >> by the restart of remote server. In these cases, previously an error was >> reported and the query accessing to remote server failed if new remote >> transaction failed to start because the cached connection was broken. >> >> This commit improves postgres_fdw so that new connection is remade >> if broken connection is detected when starting new remote transaction. >> This is useful to avoid unnecessary failure of queries when connection is >> broken but can be reestablished. > > lorikeet is telling that the test introduced by this commit is > unstable: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36 Thanks for letting me know this! > > Some details: > BEGIN; > SELECT 1 FROM ft1 LIMIT 1; > - ?column? > ----------- > - 1 > -(1 row) > - > +ERROR: could not receive data from server: Software caused connection abort > +CONTEXT: remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ This error means that new connection was successfully reestablished after the cached connection was terminated, and then the above connection error occurred when issuing "START TRANSACTION" command on that new connection. There seems no suspicious relevant log messages in the logfile. So I'm not sure why this error happened, yet. Per the previous discusson at [1], lorikeet sometimes seems to cause connection-relation failure in the regression test. So the cause of error that we faced today also may be lorikeet itself. [1] https://www.postgresql.org/message-id/CA+hUKGL3Son9iAeqgjPbXCpU_6hhZhw9X24uNO14mOC4bG0cCA@mail.gmail.com Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/10/07 12:54, Fujii Masao wrote: > > > On 2020/10/07 11:13, Michael Paquier wrote: >> Hi Fujii-san, >> >> On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote: >>> postgres_fdw: reestablish new connection if cached one is detected as broken. >>> >>> In postgres_fdw, once remote connections are established, they are cached >>> and re-used for subsequent queries and transactions. There can be some >>> cases where those cached connections are unavaiable, for example, >>> by the restart of remote server. In these cases, previously an error was >>> reported and the query accessing to remote server failed if new remote >>> transaction failed to start because the cached connection was broken. >>> >>> This commit improves postgres_fdw so that new connection is remade >>> if broken connection is detected when starting new remote transaction. >>> This is useful to avoid unnecessary failure of queries when connection is >>> broken but can be reestablished. >> >> lorikeet is telling that the test introduced by this commit is >> unstable: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36 > > Thanks for letting me know this! > >> >> Some details: >> BEGIN; >> SELECT 1 FROM ft1 LIMIT 1; >> - ?column? >> ----------- >> - 1 >> -(1 row) >> - >> +ERROR: could not receive data from server: Software caused connection abort >> +CONTEXT: remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ > > This error means that new connection was successfully reestablished > after the cached connection was terminated, and then the above connection > error occurred when issuing "START TRANSACTION" command on that > new connection. There seems no suspicious relevant log messages in the > logfile. So I'm not sure why this error happened, yet. > > Per the previous discusson at [1], lorikeet sometimes seems to cause > connection-relation failure in the regression test. So the cause of error > that we faced today also may be lorikeet itself. Since it's not good to keep the buildfarm member red, I will revert the commit unless I come up with something even after further investigation. My current just guess is that PQstatus(conn) doesn't indicate CONNECTION_BAD when the above error occurs, and which prevents new connection from being reestablished because of the following check. + if (PQstatus(entry->conn) != CONNECTION_BAD || + entry->xact_depth > 0 || + retry_conn) + PG_RE_THROW(); Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/10/07 22:25, Fujii Masao wrote: > > > On 2020/10/07 12:54, Fujii Masao wrote: >> >> >> On 2020/10/07 11:13, Michael Paquier wrote: >>> Hi Fujii-san, >>> >>> On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote: >>>> postgres_fdw: reestablish new connection if cached one is detected as broken. >>>> >>>> In postgres_fdw, once remote connections are established, they are cached >>>> and re-used for subsequent queries and transactions. There can be some >>>> cases where those cached connections are unavaiable, for example, >>>> by the restart of remote server. In these cases, previously an error was >>>> reported and the query accessing to remote server failed if new remote >>>> transaction failed to start because the cached connection was broken. >>>> >>>> This commit improves postgres_fdw so that new connection is remade >>>> if broken connection is detected when starting new remote transaction. >>>> This is useful to avoid unnecessary failure of queries when connection is >>>> broken but can be reestablished. >>> >>> lorikeet is telling that the test introduced by this commit is >>> unstable: >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36 >> >> Thanks for letting me know this! >> >>> >>> Some details: >>> BEGIN; >>> SELECT 1 FROM ft1 LIMIT 1; >>> - ?column? >>> ----------- >>> - 1 >>> -(1 row) >>> - >>> +ERROR: could not receive data from server: Software caused connection abort >>> +CONTEXT: remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ >> >> This error means that new connection was successfully reestablished >> after the cached connection was terminated, and then the above connection >> error occurred when issuing "START TRANSACTION" command on that >> new connection. There seems no suspicious relevant log messages in the >> logfile. So I'm not sure why this error happened, yet. >> >> Per the previous discusson at [1], lorikeet sometimes seems to cause >> connection-relation failure in the regression test. So the cause of error >> that we faced today also may be lorikeet itself. > > Since it's not good to keep the buildfarm member red, I will revert > the commit unless I come up with something even after further > investigation. > > My current just guess is that PQstatus(conn) doesn't indicate > CONNECTION_BAD when the above error occurs, and which > prevents new connection from being reestablished because of > the following check. > > + if (PQstatus(entry->conn) != CONNECTION_BAD || > + entry->xact_depth > 0 || > + retry_conn) > + PG_RE_THROW(); The error message in discussion is reported when recv() fails and errno=ECONNABORTED. As far as I read the code, pqReadData() marks the connection as CONNECTION_BAD when errno=ECONNRESET, but not when errno=ECONNABORTED. So since PQstatus(entry->conn) doesn't indicate CONNECTION_BAD in ECONNABORTED case, the above check is passed through, an error is re-thrown and new connection is not reestablished. Therefore, the easy fix is to make libpq mark the connection as CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET. But is it safe to do that? Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/10/08 0:48, Fujii Masao wrote: > > > On 2020/10/07 22:25, Fujii Masao wrote: >> >> >> On 2020/10/07 12:54, Fujii Masao wrote: >>> >>> >>> On 2020/10/07 11:13, Michael Paquier wrote: >>>> Hi Fujii-san, >>>> >>>> On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote: >>>>> postgres_fdw: reestablish new connection if cached one is detected as broken. >>>>> >>>>> In postgres_fdw, once remote connections are established, they are cached >>>>> and re-used for subsequent queries and transactions. There can be some >>>>> cases where those cached connections are unavaiable, for example, >>>>> by the restart of remote server. In these cases, previously an error was >>>>> reported and the query accessing to remote server failed if new remote >>>>> transaction failed to start because the cached connection was broken. >>>>> >>>>> This commit improves postgres_fdw so that new connection is remade >>>>> if broken connection is detected when starting new remote transaction. >>>>> This is useful to avoid unnecessary failure of queries when connection is >>>>> broken but can be reestablished. >>>> >>>> lorikeet is telling that the test introduced by this commit is >>>> unstable: >>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36 >>> >>> Thanks for letting me know this! >>> >>>> >>>> Some details: >>>> BEGIN; >>>> SELECT 1 FROM ft1 LIMIT 1; >>>> - ?column? >>>> ----------- >>>> - 1 >>>> -(1 row) >>>> - >>>> +ERROR: could not receive data from server: Software caused connection abort >>>> +CONTEXT: remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ >>> >>> This error means that new connection was successfully reestablished >>> after the cached connection was terminated, and then the above connection >>> error occurred when issuing "START TRANSACTION" command on that >>> new connection. There seems no suspicious relevant log messages in the >>> logfile. So I'm not sure why this error happened, yet. >>> >>> Per the previous discusson at [1], lorikeet sometimes seems to cause >>> connection-relation failure in the regression test. So the cause of error >>> that we faced today also may be lorikeet itself. >> >> Since it's not good to keep the buildfarm member red, I will revert >> the commit unless I come up with something even after further >> investigation. >> >> My current just guess is that PQstatus(conn) doesn't indicate >> CONNECTION_BAD when the above error occurs, and which >> prevents new connection from being reestablished because of >> the following check. >> >> + if (PQstatus(entry->conn) != CONNECTION_BAD || >> + entry->xact_depth > 0 || >> + retry_conn) >> + PG_RE_THROW(); > > The error message in discussion is reported when recv() fails and > errno=ECONNABORTED. As far as I read the code, pqReadData() marks > the connection as CONNECTION_BAD when errno=ECONNRESET, > but not when errno=ECONNABORTED. So since PQstatus(entry->conn) > doesn't indicate CONNECTION_BAD in ECONNABORTED case, > the above check is passed through, an error is re-thrown and > new connection is not reestablished. > > Therefore, the easy fix is to make libpq mark the connection as > CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET. Patch attached. This patch also changes errcode_for_socket_access() so that it uses ERRCODE_CONNECTION_FAILURE rather than ERRCODE_INTERNAL_ERROR as sqlerrorcode in ECONNABORTED case like ECONNRESET. Is this sane? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
Fujii Masao <masao.fujii@oss.nttdata.com> writes: >> Therefore, the easy fix is to make libpq mark the connection as >> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET. > Patch attached. This patch also changes errcode_for_socket_access() > so that it uses ERRCODE_CONNECTION_FAILURE rather than > ERRCODE_INTERNAL_ERROR as sqlerrorcode in ECONNABORTED case > like ECONNRESET. Is this sane? Some digging around on the net suggests that Windows will only return WSAECONNRESET for the case where it gets a TCP RST packet from the remote host. Other cases of TCP connection loss (e.g. timeout) are reported as WSAECONNABORTED. I think that our code generally only expects ECONNRESET to be reported for a lost connection. Based on this info, we clearly need to handle ECONNABORTED as being equivalent to that. I don't see any reason why we'd care to distinguish the two cases. So it seems like (a) we'd better make sure that ECONNABORTED is recognized anyplace that we're testing for ECONNRESET, and (b) that seems like a back-patchable bug fix, independently of the immediate issue. On the other hand, it seems a bit odd that we've not recognized this issue long since. Maybe the preference for ECONNABORTED only exists on a few Windows versions? I've not checked your patch in detail. regards, tom lane
I wrote: > I've not checked your patch in detail. Bread crumb for the archives: I started a -hackers thread about this: https://www.postgresql.org/message-id/flat/2621622.1602184554%40sss.pgh.pa.us regards, tom lane
Fujii Masao <masao.fujii@oss.nttdata.com> writes: >>> Therefore, the easy fix is to make libpq mark the connection as >>> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET. So in the wake of commit fe27009cb, this is what lorikeet is doing: @@ -9028,9 +9028,7 @@ CALL terminate_backend_and_wait('fdw_retry_check'); SAVEPOINT s; SELECT 1 FROM ft1 LIMIT 1; -- should fail -ERROR: server closed the connection unexpectedly - This probably means the server terminated abnormally - before or while processing the request. +ERROR: could not receive data from server: Software caused connection abort CONTEXT: remote SQL command: SAVEPOINT s2 COMMIT; -- Clean up which is better --- the connection recovery is working --- but obviously still not a "pass". The only way to make this test pass as-is is to hack libpq so that it deems ECONNABORTED to be a server crash, which frankly I do not think is a good change. I don't see any principled reason to think that it means that rather than a network connectivity failure. What I've done instead as a stopgap is to adjust the test case to be insensitive to the exact error message text. Maybe there's a better way, but I can't think of anything besides having two (or more?) expected-output files. That would be quite unpalatable as things stand, though perhaps we could make it tolerable by splitting this test off into a second test script. Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very happy with it: * The control flow seems rather forced. I think it was designed on the assumption that reindenting the existing code is forbidden. Why not use an actual loop, instead of a goto? I also think that it's far from obvious that the loop isn't an infinite loop; it took me quite a while to glom onto the fact that the test inside the PG_CATCH avoids that by checking retry_conn. Maybe a comment would be enough to improve that, but I feel the control logic could stand a rethink. * The CATCH case is completely failing to clean up after itself. At the minimum, there has to be a FlushErrorState() there. I don't think it'd be terribly hard to drive this code to an error-stack-overflow PANIC. * As is so often true of proposed patches in which PG_CATCH is thought to be an adequate way to catch an error, this is just unbelievably fragile. It will break, and badly so, if it catches an error that is anything other than what it is expecting ... and it's not even particularly trying to verify that the error is what it's expecting. It might be okay, or at least a little bit harder to break, if it verified that the error's SQLSTATE is ERRCODE_CONNECTION_FAILURE. regards, tom lane
On 2020/10/11 9:16, Tom Lane wrote: > Fujii Masao <masao.fujii@oss.nttdata.com> writes: >>>> Therefore, the easy fix is to make libpq mark the connection as >>>> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET. > > So in the wake of commit fe27009cb, Many thanks for the commit fe27009cb !! > this is what lorikeet is doing: > > @@ -9028,9 +9028,7 @@ > CALL terminate_backend_and_wait('fdw_retry_check'); > SAVEPOINT s; > SELECT 1 FROM ft1 LIMIT 1; -- should fail > -ERROR: server closed the connection unexpectedly > - This probably means the server terminated abnormally > - before or while processing the request. > +ERROR: could not receive data from server: Software caused connection abort > CONTEXT: remote SQL command: SAVEPOINT s2 > COMMIT; > -- Clean up > > which is better --- the connection recovery is working --- but > obviously still not a "pass". > > The only way to make this test pass as-is is to hack libpq so that > it deems ECONNABORTED to be a server crash, which frankly I do not > think is a good change. I don't see any principled reason to think > that it means that rather than a network connectivity failure. > > What I've done instead as a stopgap is to adjust the test case to be > insensitive to the exact error message text. For now I've not come up with better idea than current this fix. > Maybe there's a better > way, but I can't think of anything besides having two (or more?) > expected-output files. That would be quite unpalatable as things > stand, though perhaps we could make it tolerable by splitting this > test off into a second test script. > > Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very > happy with it: > > * The control flow seems rather forced. I think it was designed > on the assumption that reindenting the existing code is forbidden. > Why not use an actual loop, instead of a goto? I also think that > it's far from obvious that the loop isn't an infinite loop; it > took me quite a while to glom onto the fact that the test inside the > PG_CATCH avoids that by checking retry_conn. Maybe a comment would > be enough to improve that, but I feel the control logic could stand > a rethink. Isn't it simpler and easier-to-read to just reestablish new connection again in the retry case instead of using a loop because we retry only once? Patch attached. > > * The CATCH case is completely failing to clean up after itself. > At the minimum, there has to be a FlushErrorState() there. > I don't think it'd be terribly hard to drive this code to an > error-stack-overflow PANIC. You're right. Sorry I forgot to take into consideration this :( I fixed this issue in the attached patch. > > * As is so often true of proposed patches in which PG_CATCH is > thought to be an adequate way to catch an error, this is just > unbelievably fragile. It will break, and badly so, if it catches > an error that is anything other than what it is expecting ... > and it's not even particularly trying to verify that the error is > what it's expecting. It might be okay, or at least a little bit > harder to break, if it verified that the error's SQLSTATE is > ERRCODE_CONNECTION_FAILURE. "PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough? Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE is checked. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
Fujii Masao <masao.fujii@oss.nttdata.com> writes: > On 2020/10/11 9:16, Tom Lane wrote: >> Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very >> happy with it: >> >> * The control flow seems rather forced. I think it was designed >> on the assumption that reindenting the existing code is forbidden. > Isn't it simpler and easier-to-read to just reestablish new connection again > in the retry case instead of using a loop because we retry only once? > Patch attached. Yeah, that seems better. >> * As is so often true of proposed patches in which PG_CATCH is >> thought to be an adequate way to catch an error, this is just >> unbelievably fragile. It will break, and badly so, if it catches >> an error that is anything other than what it is expecting ... >> and it's not even particularly trying to verify that the error is >> what it's expecting. It might be okay, or at least a little bit >> harder to break, if it verified that the error's SQLSTATE is >> ERRCODE_CONNECTION_FAILURE. > "PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough? > Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE > is checked. The reason I'm concerned about this is that we could potentially throw an elog(ERROR) somewhere between return from libpq and the expected ereport call in pgfdw_report_error. It wouldn't be wise to ignore such a condition (it might be out-of-memory or some such). Personally I'd code the check with explicit tests for *both* PQstatus()==CONNECTION_BAD and sqlstate==ERRCODE_CONNECTION_FAILURE, rather than just asserting that the latter must imply the former. By my reading, pgfdw_report_error will report ERRCODE_CONNECTION_FAILURE for any libpq-originated error condition, but not all of those will set CONNECTION_BAD. Other than that, this seems reasonable by eyeball (I didn't do any testing). regards, tom lane
On 2020/10/14 3:34, Tom Lane wrote: > Fujii Masao <masao.fujii@oss.nttdata.com> writes: >> On 2020/10/11 9:16, Tom Lane wrote: >>> Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very >>> happy with it: >>> >>> * The control flow seems rather forced. I think it was designed >>> on the assumption that reindenting the existing code is forbidden. > >> Isn't it simpler and easier-to-read to just reestablish new connection again >> in the retry case instead of using a loop because we retry only once? >> Patch attached. > > Yeah, that seems better. > >>> * As is so often true of proposed patches in which PG_CATCH is >>> thought to be an adequate way to catch an error, this is just >>> unbelievably fragile. It will break, and badly so, if it catches >>> an error that is anything other than what it is expecting ... >>> and it's not even particularly trying to verify that the error is >>> what it's expecting. It might be okay, or at least a little bit >>> harder to break, if it verified that the error's SQLSTATE is >>> ERRCODE_CONNECTION_FAILURE. > >> "PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough? >> Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE >> is checked. > > The reason I'm concerned about this is that we could potentially throw an > elog(ERROR) somewhere between return from libpq and the expected ereport > call in pgfdw_report_error. It wouldn't be wise to ignore such a > condition (it might be out-of-memory or some such). Understood. > Personally I'd code the check with explicit tests for *both* > PQstatus()==CONNECTION_BAD and sqlstate==ERRCODE_CONNECTION_FAILURE, > rather than just asserting that the latter must imply the former. > By my reading, pgfdw_report_error will report ERRCODE_CONNECTION_FAILURE > for any libpq-originated error condition, but not all of those will > set CONNECTION_BAD. Yes, this makes sense. I updated the patch so that both sqlstate and PQstatus() are checked. Patch attached. > Other than that, this seems reasonable by eyeball (I didn't do any > testing). Thanks! Barring any objection, I will commit it. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Вложения
On 2020/10/15 16:21, Fujii Masao wrote: > > > On 2020/10/14 3:34, Tom Lane wrote: >> Fujii Masao <masao.fujii@oss.nttdata.com> writes: >>> On 2020/10/11 9:16, Tom Lane wrote: >>>> Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very >>>> happy with it: >>>> >>>> * The control flow seems rather forced. I think it was designed >>>> on the assumption that reindenting the existing code is forbidden. >> >>> Isn't it simpler and easier-to-read to just reestablish new connection again >>> in the retry case instead of using a loop because we retry only once? >>> Patch attached. >> >> Yeah, that seems better. >> >>>> * As is so often true of proposed patches in which PG_CATCH is >>>> thought to be an adequate way to catch an error, this is just >>>> unbelievably fragile. It will break, and badly so, if it catches >>>> an error that is anything other than what it is expecting ... >>>> and it's not even particularly trying to verify that the error is >>>> what it's expecting. It might be okay, or at least a little bit >>>> harder to break, if it verified that the error's SQLSTATE is >>>> ERRCODE_CONNECTION_FAILURE. >> >>> "PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough? >>> Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE >>> is checked. >> >> The reason I'm concerned about this is that we could potentially throw an >> elog(ERROR) somewhere between return from libpq and the expected ereport >> call in pgfdw_report_error. It wouldn't be wise to ignore such a >> condition (it might be out-of-memory or some such). > > Understood. > > >> Personally I'd code the check with explicit tests for *both* >> PQstatus()==CONNECTION_BAD and sqlstate==ERRCODE_CONNECTION_FAILURE, >> rather than just asserting that the latter must imply the former. >> By my reading, pgfdw_report_error will report ERRCODE_CONNECTION_FAILURE >> for any libpq-originated error condition, but not all of those will >> set CONNECTION_BAD. > > Yes, this makes sense. I updated the patch so that both sqlstate and > PQstatus() are checked. Patch attached. > > >> Other than that, this seems reasonable by eyeball (I didn't do any >> testing). > > Thanks! Barring any objection, I will commit it. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION