Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Дата
Msg-id 20240209000734.st2xwkfsrz7izbnx@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
Hi,

On 2024-02-08 18:50:32 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think we ought to understand *why* we are getting the "Too many open
> > files". The AcquireExternalFD() in CreateWaitEventSet() should prevent
> > that.
>
> Actually, I think the AcquireExternalFD() in CreateWaitEventSet() is
> *causing* that and needs to be removed.
>
> What is happening in Alexander's new example is that we are doing
> AcquireExternalFD() for each postgres_fdw connection
> (cf. libpqsrv_connect in libpq-be-fe-helpers.h), and the example
> is tuned to bring that exactly up to the limit of what
> AcquireExternalFD() allows.  Then the next WaitLatchOrSocket call
> fails, because it does
>
>     WaitEventSet *set = CreateWaitEventSet(CurrentResourceOwner, 3);
>
> Then when pgfdw_abort_cleanup tries to clean up the connections'
> state, it needs to do WaitLatchOrSocket again, and that fails again,
> and we PANIC because we're already in abort state.
>
> Since WaitLatchOrSocket is going to free this WaitEventSet before it
> returns, it's not apparent to me why we need to count it as a
> long-lived FD: we could just as well assume that it can slide in under
> the NUM_RESERVED_FDS limit.

Well, the AcquireExternalFD() is in more general code, that's also used for
long-lived WaitEventSets - for those it's the right thing to count it as a
long lived FD.


> Or perhaps use ReserveExternalFD instead of AcquireExternalFD.  We'd need
> some API extension to tell latch.c to do that, but that doesn't seem hard.
> (Unless we could consider that all WaitEventSets should use
> ReserveExternalFD?  Not sure I want to argue for that though.)

Yea, I don't think we want that.


> I guess a third possibility is that WaitLatchOrSocket could just
> permanently hang onto the WaitEventSet once it's got one.

Right now we don't support changing the socket FD associated with the WES, so
that'd not easily work. We've been talking about adding support for that for a
while though.


I might be missing something here, but leaving the concrete crash aside, why
is it ok for pgfdw_get_cleanup_result() etc to block during abort processing?
If I read the code right, we can end up waiting for up to 2x30s for each
connection, and there can be many connections.  The code also has a bunch of
memory allocations, so if we are in abort processing after an out-of-memory
error, we can easily cause failures again as well.

Greetings,

Andres Freund



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds