回复: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

Поиск
Список
Период
Сортировка
От Thomas wen
Тема 回复: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?
Дата
Msg-id SEZPR01MB4575F1A160B20047023891DE86B2A@SEZPR01MB4575.apcprd01.prod.exchangelabs.com
обсуждение исходный текст
Ответ на Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?  (yuansong <yyuansong@126.com>)
Ответы Re:Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?  (yuansong <yyuansong@126.com>)
Список pgsql-hackers
Hi yuansong
      there is connnection pool path (https://commitfest.postgresql.org/34/3043/) ,but it  has been dormant for few years,You can check this patch to get what you want to need

发件人: yuansong <yyuansong@126.com>
发送时间: 2023年11月13日 17:13
收件人: Laurenz Albe <laurenz.albe@cybertec.at>
抄送: pgsql-hackers@lists.postgresql.org <pgsql-hackers@lists.postgresql.org>
主题: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?
 

Enhancing the overall fault tolerance of the entire system for this feature is quite important. No one can avoid bugs, and I don't believe that this approach is a more advanced one. It might be worth considering adding it to the roadmap so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all connections has a significant impact on other connections. Is it possible to only disconnect the crashed connection and have the other connections simply roll back the current transaction without reconnecting? Perhaps this problem can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or restore the shared memory and disk changes caused by the crashed backend? Besides clearing various known locks, what else needs to be changed? Does anyone have any insights? Please help me. Thank you.







At 2023-11-13 13:53:29, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote: >On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote: >> yuansong <yyuansong@126.com> writes: >> > In PostgreSQL, when a backend process crashes, it can cause other backend >> > processes to also require a restart, primarily to ensure data consistency. >> > I understand that the correct approach is to analyze and identify the >> > cause of the crash and resolve it. However, it is also important to be >> > able to handle a backend process crash without affecting the operation of >> > other processes, thus minimizing the scope of negative impact and >> > improving availability. To achieve this goal, could we mimic the Oracle >> > process by introducing a "pmon" process dedicated to rolling back crashed >> > process transactions and performing resource cleanup? I wonder if anyone >> > has attempted such a strategy or if there have been previous discussions >> > on this topic. >> >> The reason we force a database-wide restart is that there's no way to >> be certain that the crashed process didn't corrupt anything in shared >> memory. (Even with the forced restart, there's a window where bad >> data could reach disk before we kill off the other processes that >> might write it. But at least it's a short window.) "Corruption" >> here doesn't just involve bad data placed into disk buffers; more >> often it's things like unreleased locks, which would block other >> processes indefinitely. >> >> I seriously doubt that anything like what you're describing >> could be made reliable enough to be acceptable. "Oracle does >> it like this" isn't a counter-argument: they have a much different >> (and non-extensible) architecture, and they also have an army of >> programmers to deal with minutiae like undoing resource acquisition. >> Even with that, you'd have to wonder about the number of bugs >> existing in such necessarily-poorly-tested code paths. > >Yes. >I think that PostgreSQL's approach is superior: rather than investing in >code to mitigate the impact of data corruption caused by a crash, invest >in quality code that doesn't crash in the first place. > >Euphemistically naming a crash "ORA-600 error" seems to be part of >their strategy. > >Yours, >Laurenz Albe >

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: mxid_age() and age(xid) appear undocumented
Следующее
От: vignesh C
Дата:
Сообщение: Re: pg_upgrade and logical replication