Re: Crash during backend start when low on memory

Поиск
Список
Период
Сортировка
От Mats Kindahl
Тема Re: Crash during backend start when low on memory
Дата
Msg-id CA+144260BBLj2JQ93CxEbDvULkW9M2GUWvKcVOujEwByALo49g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Crash during backend start when low on memory  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-bugs
On Fri, Jan 13, 2023 at 12:58 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2023-Jan-13, Mats Kindahl wrote:

> I have an improved patch based on raising an error when malloc fails. I
> used the version of mallocfail that is available at
> https://github.com/ralight/mallocfail which allows you to run the program
> with incrementally failing memory allocations but since the count is around
> 6886 or so when the postmaster starts to allow backend starts and other
> processes are starting in between, this count is not always accurate and
> something "smarter" is needed to get a reliable failure test.

Hmm, but I'm not sure I like what you did to BackendStartup:

@@ -4054,14 +4073,7 @@ BackendStartup(Port *port)
         * Create backend data structure.  Better before the fork() so we can
         * handle failure cleanly.
         */
-       bn = (Backend *) malloc(sizeof(Backend));
-       if (!bn)
-       {
-               ereport(LOG,
-                               (errcode(ERRCODE_OUT_OF_MEMORY),
-                                errmsg("out of memory")));
-               return STATUS_ERROR;
-       }
+       CHECKED_CALL(bn = malloc(sizeof(Backend)));

With this code, now we will have postmaster raise ERROR if this malloc
fails, rather than returning STATUS_ERROR as previously.  How does
postmaster react to this?  I think it won't go very well.

Well.. this error is caught in the caller using this code:

PG_TRY();
{
  port = ConnCreate(ListenSocket[i]);
  BackendStartup(port);
}
PG_FINALLY();
{
  /*
   * We no longer need the open socket or port structure
   * in this process, and definitely not if we failed
   * spawning a backend.
   */
  if (port)
  {
     if (port->sock != PGINVALID_SOCKET)
     StreamClose(port->sock);
     ConnFree(port);
  }
}
PG_END_TRY();

Unless I am mistaken, this should mean that the error message is printed, the streams closed and memory free'ed, and PostmasterMain continues as before. It could just be logged using LOG level, but an ERROR stands out better and is searchable so it is easy to surface in systems monitoring the server. (Digging for specific log messages is cumbersome, just printing out all ERROR messages in the log is easy.)


(Stylistically, I'd rather have the test and corresponding reaction at
each call site rather than using a macro to hide it.  I especially don't
like the fact that the errdetail you propose would report a line of C
code to the user, rather than an intelligible explanation of what that
code is trying to do.)

I agree this is ugly, but since this was rare and the error message is mostly to understand what went wrong to be able to debug it, I thought this would be sufficient for that. It is also easier to maintain. Trivial to replace.


> It is quite straightforward to test it by attaching a debugger and setting
> the return value of malloc, but to be able to have some "smarter" failures
> only when, for example, starting a backend it would be necessary to have
> something better integrated with the postgres code. Being able to attach a
> uprobe using bpftrace and override the return value would be the most
> elegant solution, but that is currently not possible, so considering
> whether to add something like the mallocfail above to the regular testing.

Hmm, I'm not sure that this type of testing is something to do on an
ongoing basis.

Me neither, which is why I didn't spend time on it right now. The argument for doing it is that it would be good to have some test that can at least be easily used to check that it behaves correctly on startup even if memory failures occur.
 

> Note: there are a bunch of other core dumps that occur during startup
> before the postmaster is ready to accept connections, but I think that is
> less of a problem since it fails during startup before the system is ready
> to accept connections.

That sounds reasonable.  The service watcher should restart postmaster
appropriately in that case (be it old SysV init, systemd, your container
management layer, or whatever.)

Yes, that was my thinking as well. Core dumps are ugly, but provide useful information to figure out where things went wrong.
 

--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: BUG #17698: On SIGTERM, psql terminates, but leaves the statement running