Re: OOM in libpq and infinite loop with getCopyStart()
От | Michael Paquier |
---|---|
Тема | Re: OOM in libpq and infinite loop with getCopyStart() |
Дата | |
Msg-id | CAB7nPqTDQN4PvPsQax3O5-NDY2KouCVZEiwExEcpH4qNdnV6ow@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: OOM in libpq and infinite loop with getCopyStart() (Michael Paquier <michael.paquier@gmail.com>) |
Ответы |
Re: OOM in libpq and infinite loop with getCopyStart()
(Michael Paquier <michael.paquier@gmail.com>)
|
Список | pgsql-hackers |
On Wed, Apr 6, 2016 at 3:09 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wrote: >>> So the core of my complaint is that we need to fix things so that, whether >>> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd >>> better consider the behavior when we cannot), ... >> >> BTW, the real Achilles' heel of any attempt to ensure sane behavior at >> the OOM limit is this possibility of being unable to create a PGresult >> with which to inform the client that we failed. >> >> I wonder if we could make things better by keeping around an emergency >> backup PGresult struct. Something along these lines: >> >> 1. Add a field "PGresult *emergency_result" to PGconn. >> >> 2. At the very beginning of any PGresult-returning libpq function, check >> to see if we have an emergency_result, and if not make one, ensuring >> there's room in it for a reasonable-size error message; or maybe even >> preload it with "out of memory" if we assume that's the only condition >> it'll ever be used for. If malloc fails at this point, just return NULL >> without doing anything or changing any libpq state. (Since a NULL result >> is documented as possibly caused by OOM, this isn't violating any API.) >> >> 3. Subsequent operations never touch the emergency_result unless we're >> up against an OOM, but it can be used to return a failure indication >> to the client so long as we leave libpq in a state where additional >> calls to PQgetResult would return NULL. >> >> Basically this shifts the point where an unreportable OOM could happen >> from somewhere in the depths of libpq to the very start of an operation, >> where we're presumably in a clean state and OOM failure doesn't leave >> us with a mess we can't clean up. Sorry for the late reply here. I am looking back at this thing more seriously. So, if I understand that correctly. As the emergency result is pre-allocated, this leverages any future OOM errors because we could always fallback to this error in case of problems, so this would indeed help. And as this is independent of the rest of the status of pg_conn, any subsequent calls of any PQ* routines returning a PGresult would remain in the same state. On top of this emergency code path, don't you think that we need as well a flag that is switched to 'on' once an OOM error is faced? In consequence, on a code path where an OOM happens, this flag is activated. Then any subsequent calls of routines returning a PGresult checks for this flag, resets it, and returns the emergency result. At the end, it seems to me that the code paths where we check if the emergency flag is activated or not are the beginning of PQgetResult and PQexecFinish. In the case of PQexecFinish(), pending results would be cleared the next time PQexecStart is called. For PQgetResult, this gives the possibility to the application to report the OOM properly. However, successive calls to PQgetResult() would still make the application wait undefinitely for more data if it doesn't catch the error. Another thing that I think needs to be patched is libpqrcv_PQexec by the way, so as we check if any errors are caught on the way when calling multiple times PQgetResult or this code path could remain stuck with an infinite wait. As a result, it seems to me that this offers no real way to fix completely applications doing something like that: PQsendQuery(conn); for (;;) { while (PQisBusy(conn)) { //wait here for some event } result = PQgetResult(conn); if (!result) break; } The issue is that we'd wait for a NULL result to be received from PQgetResult, however even with this patch asyncStatus remaining set to PGASYNC_BUSY would make libpq waiting forever with pqWait for data that will never show up. We could have a new status for asyncStatus to be able to skip properly the loops where asyncStatus == PGASYNC_BUSY and make PQisBusy smarter but this would actually break the semantics of calling successively PQgetResult() if I am following correctly the last posts of this thread. Another, perhaps, better idea is to add some more extra logic to pg_conn as follows: bool is_emergency; PGresult *emergency_result; bool is_emergency_consumed; So as when an OOM shows up, is_emergency is set to true. Then a first call of PQgetResult returns the PGresult with the OOM in emergency_result, setting is_emergency_consumed to true and switching is_emergency back to false. Then a second call to PQgetResult enforces the consumption of any results remaining without waiting for them, at the end returning NULL to the caller, resetting is_emergency_consumed to false. That's different compared to the extra failure PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy(). Thoughts? -- Michael
В списке pgsql-hackers по дате отправления: