Обсуждение: Re: pgsql: Teach DSM registry to ERROR if attaching to an uninitialized ent
On Wed, Nov 12, 2025 at 3:31 PM Nathan Bossart <nathan@postgresql.org> wrote: > Teach DSM registry to ERROR if attaching to an uninitialized entry. > > If DSM entry initialization fails, backends could try to use an > uninitialized DSM segment, DSA, or dshash table (since the entry is > still added to the registry). To fix, keep track of whether > initialization completed, and ERROR if a backend tries to attach to > an uninitialized entry. We could instead retry initialization as > needed, but that seemed complicated, error prone, and unlikely to > help most cases. Furthermore, such problems probably indicate a > coding error. Having read the thread that led to this commit, I agree that this is an improvement, but it still doesn't seem like a great situation to me. Maybe I'm misunderstanding, but it seems like once the initialization has failed, there's no way to retry: you can't retry the initialization on the existing segment, and you can't delete it and create a new segment. If so, that means your only option for restoring proper function after an initialization failure is to bounce the entire server. Now, perhaps the chances of an initialization failure in practice are quite low. Maybe a typical initialization function won't do anything that could fail. But it just seems weird to me to design something that thinks errors are likely enough that it's worth having some amount of mechanism to deal with them, but unlikely enough that it's not worth making that mechanism more robust. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 10:44:31AM -0500, Robert Haas wrote: > Now, perhaps the chances of an initialization failure in practice are > quite low. Maybe a typical initialization function won't do anything > that could fail. But it just seems weird to me to design something > that thinks errors are likely enough that it's worth having some > amount of mechanism to deal with them, but unlikely enough that it's > not worth making that mechanism more robust. Well, I will attempt to convince you that is indeed the case, although I'm not sure I will succeed... ISTM that besides pure coding errors, the most likely reasons for an ERROR during DSM segment initialization are things like out-of-memory, too many LWLock tranches registered, etc. In those cases, do we want every single backend that tries to attach to the segment to retry and most likely fail again? And what if the initialization callback completed halfway before failing? Do we expect every extension author to carefully craft their callbacks to handle that? On the other hand, if we don't handle ERRORs and someone does run into an OOM or whatnot, do we want other backends to attach and use the unitialized segment (leading to things like seg-faults)? Tracking whether an entry was initialized seems like cheap insurance against these problems. Finally, the only report from the field about any of this involves SQLsmith allocating tons of LWLock tranches via test_dsa, which has been fixed independently. So I have no evidence to suggest that we need to make it more robust. In fact, without this one arguably contrived report, I probably wouldn't have done anything at all. I'll grant you that this is a judgment call. I tried to find a good middle ground, but I would be unsurprised to learn that other committers would have done things differently. -- nathan
On Thu, Nov 20, 2025 at 11:12 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > ISTM that besides pure coding errors, the most likely reasons for an ERROR > during DSM segment initialization are things like out-of-memory, too many > LWLock tranches registered, etc. In those cases, do we want every single > backend that tries to attach to the segment to retry and most likely fail > again? And what if the initialization callback completed halfway before > failing? Do we expect every extension author to carefully craft their > callbacks to handle that? Well, I don't want to pretend like I know all the answers here. I do think it's pretty clear that making it the initialization callback's job to deal with a partially-failed initialization would be crazy. What I'd be inclined to do after a failure is tear down the entire segment. Even if there's no mechanism to retry, you're saving the memory that the segment would have consumed. And maybe there is a mechanism to retry. If these initialization callbacks are cheap enough, maybe having every backend try is not really a big deal -- maybe it's good, insofar as an interactive session would be more likely to see an error indicating the real cause of failure than a fake error that basically tells you at some point in the past something went wrong. If repeated failures are too painful, another idea could be to just mark the entry as being in a failed state (maybe with a timestamp) and let the extension decide when it wants to press the reset button. Or maybe that's all too complicated for too little utility. I'm not sure. My perception is that this is setting a significantly lower standard for error tolerance than what we typically seek to achieve, but sometimes my perceptions are wrong, and sometimes better options are hard to come by. > On the other hand, if we don't handle ERRORs and someone does run into an > OOM or whatnot, do we want other backends to attach and use the unitialized > segment (leading to things like seg-faults)? Tracking whether an entry was > initialized seems like cheap insurance against these problems. That's fair. > Finally, the only report from the field about any of this involves SQLsmith > allocating tons of LWLock tranches via test_dsa, which has been fixed > independently. So I have no evidence to suggest that we need to make it > more robust. In fact, without this one arguably contrived report, I > probably wouldn't have done anything at all. Yeah, I'm not clear who is using this or how, so that makes it hard to judge. > I'll grant you that this is a judgment call. I tried to find a good > middle ground, but I would be unsurprised to learn that other committers > would have done things differently. Such is life! -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 01:18:56PM -0500, Robert Haas wrote: > What I'd be inclined to do after a failure is tear down the entire > segment. Even if there's no mechanism to retry, you're saving the > memory that the segment would have consumed. Unpinning/detaching the segment/DSA/dshash table and deleting the DSM registry entry in a PG_CATCH block scares me a little, but it might be doable. > Or maybe that's all too complicated for too little utility. I'm not sure. That's ultimately where I landed. > My perception is that this is setting a significantly lower standard for > error tolerance than what we typically seek to achieve, but sometimes my > perceptions are wrong, and sometimes better options are hard to come by. Another thing that might be subconsciously guiding my decisions here is the existing behavior when a shmem request/startup hook ERRORs (server startup fails). I'd expect DSM registry users to be doing similar things in their initialization callbacks, and AFAIK this behavior hasn't been a source of complaints. -- nathan
On Thu, Nov 20, 2025 at 6:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Unpinning/detaching the segment/DSA/dshash table and deleting the DSM > registry entry in a PG_CATCH block scares me a little, but it might be > doable. It seems a bit weird to be doing explicit unpinning in a PG_CATCH block. Ideally you'd want to postpone the pinning until initialization has succeeded, so that if you fail before that, transaction cleanup takes care of it automatically. Alternatively, destroying what existed before could be deferred until later, when an as-yet-unfailed transaction stumbles across the tombstone. > Another thing that might be subconsciously guiding my decisions here is the > existing behavior when a shmem request/startup hook ERRORs (server startup > fails). I'd expect DSM registry users to be doing similar things in their > initialization callbacks, and AFAIK this behavior hasn't been a source of > complaints. What bugs me here is the fact that you have a perfectly good working server that you can't "salvage". If the server had failed at startup, that would have sucked, but you would have been forced into correcting the problem (or giving up on the extension) and restarting, so once the server gets up and running, it's always in a good state. With this mechanism, you can get a running server that's stuck in this failed-initialization state, and there's no way for the DBA to do anything except by bouncing the entire server. That seems like it could be really frustrating. Now, if it's the case that resetting this mechanism wouldn't fundamentally fix anything because the reinitialization attempt would inevitably fail for the same reason, then I suppose it's not so bad, but I'm not sure that would always be true. I just feel like one-way state transitions from good->bad are undesirable. One way of viewing log levels like ERROR, FATAL, and PANIC is that they force you to do a reset of the transaction, session, or entire server to get back to a good state, but here you just get stuck in the bad one. Am I worrying too much? Possibly! But as I said to David on another thread this morning, it's better to worry on pgsql-hackers before any problem happens than to start worrying after something bad happens in a customer situation. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Nov 22, 2025 at 10:25:48AM -0500, Robert Haas wrote: > On Thu, Nov 20, 2025 at 6:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> Unpinning/detaching the segment/DSA/dshash table and deleting the DSM >> registry entry in a PG_CATCH block scares me a little, but it might be >> doable. > > It seems a bit weird to be doing explicit unpinning in a PG_CATCH > block. Ideally you'd want to postpone the pinning until initialization > has succeeded, so that if you fail before that, transaction cleanup > takes care of it automatically. Alternatively, destroying what existed > before could be deferred until later, when an as-yet-unfailed > transaction stumbles across the tombstone. Oh, yeah, good idea. > Am I worrying too much? Possibly! But as I said to David on another > thread this morning, it's better to worry on pgsql-hackers before any > problem happens than to start worrying after something bad happens in > a customer situation. I'll give what you suggested a try. It seems reasonable enough. -- nathan
On Sat, Nov 22, 2025 at 01:41:21PM -0600, Nathan Bossart wrote: > I'll give what you suggested a try. It seems reasonable enough. Here is a mostly-untested first try. -- nathan
Вложения
On Mon, Nov 24, 2025 at 12:36:25PM -0600, Nathan Bossart wrote: > On Sat, Nov 22, 2025 at 01:41:21PM -0600, Nathan Bossart wrote: >> I'll give what you suggested a try. It seems reasonable enough. > > Here is a mostly-untested first try. I think 0002 was more complicated than necessary. Here's a second try. -- nathan
Вложения
On Mon, Nov 24, 2025 at 2:03 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I think 0002 was more complicated than necessary. Here's a second try. I haven't done a full review of this and I'm not sure whether you want me to spend more time on it, but something I notice about this version is that the PG_CATCH() blocks only contain dshash_delete_entry() calls. That seems good, because that means that all the other cleanup is being handled by transaction abort (assuming that the patch isn't buggy, which I haven't attempted to verify). However, it does make me wonder if we could also find a way to postpone adding the dshash entries so that we don't need to do anything in PG_CATCH() blocks at all. I'm guessing that the reason why that doesn't easily work is because you're relying on those locks to prevent multiple backends from doing the same initialization? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 24, 2025 at 04:02:18PM -0500, Robert Haas wrote: > I haven't done a full review of this and I'm not sure whether you want > me to spend more time on it... I'd appreciate an eyeball check if you have the time. > I'm guessing that the reason why that doesn't easily work is > because you're relying on those locks to prevent multiple backends > from doing the same initialization? For GetNamedDSMSegment(), I bet we could avoid needing a PG_CATCH by taking the DSMRegistryLock exclusively when accessing the registry. But for GetNamedDSA() and GetNamedDSHash(), we want to keep the half-initialized entry so that we don't leak LWLock tranche IDs. I initially thought that might be okay, but if every backend is retrying, you could quickly run out of tranche IDs. -- nathan