Re: Changing shared_buffers without restart
От | Dmitry Dolgov |
---|---|
Тема | Re: Changing shared_buffers without restart |
Дата | |
Msg-id | eqs6v4rsboazl67xz3wxc6xjkgrpfybitpl45y3lmb2br67wbj@o7czebb3rlgd обсуждение исходный текст |
Ответ на | Re: Changing shared_buffers without restart (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
Ответы |
Re: Changing shared_buffers without restart
Re: Changing shared_buffers without restart |
Список | pgsql-hackers |
> On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote: > This is because the BarrierArriveAndWait() only waits for all the > attached backends. It doesn't wait for backends which are yet to > attach. I think what we want is *all* the backends should execute all > the phases synchronously and wait for others to finish. If we don't do > that, there's a possibility that some of them would see inconsistent > buffer states or even worse may not have necessary memory mapped and > resized - thus causing segfaults. Am I correct? > > I think what needs to be done is that every backend should wait for other > backends to attach themselves to the barrier before moving to the > first phase. One way I can think of is we use two signal barriers - > one to ensure that all the backends have attached themselves and > second for the actual resizing. But then the postmaster needs to wait for > all the processes to process the first signal barrier. A postmaster can > not wait on anything. Maybe there's a way to poll, but I didn't find > it. Does that mean that we have to make some other backend a coordinator? Yes, you're right, plain dynamic Barrier does not ensure all available processes will be synchronized. I was aware about the scenario you describe, it's mentioned in commentaries for the resize function. I was under the impression this should be enough, but after some more thinking I'm not so sure anymore. Let me try to structure it as a list of possible corner cases that we need to worry about: * New backend spawned while we're busy resizing shared memory. Those should wait until the resizing is complete and get the new size as well. * Old backend receives a resize message, but exits before attempting to resize. Those should be excluded from coordination. * A backend is blocked and not responding before or after the ProcSignalBarrier message was sent. I'm thinking about a failure situation, when one rogue backend is doing something without checking for interrupts. We need to wait for those to become responsive, and potentially abort shared memory resize after some timeout. * Backends join the barrier in disjoint groups with some time in between, which is longer than what it takes to resize shared memory. That means that relying only on the shared dynamic barrier is not enough -- it will only synchronize resize procedure withing those groups. Out of those I think the third poses some problems, e.g. if we shrinking the shared memory, but one backend is accessing buffer pool without checking for interrupts. In the v3 implementation this won't be handled correctly, other backends will ignore such rogue process. Independently from that we could reason about the logic much easier if it's guaranteed that all the process to resize shared memory will wait for each other to start simultaneously. Looks like to achieve that we need a slightly different combination of a global Barrier and ProcSignalBarrier mechanism. We can't use ProcSignalBarrier as it is, because processes need to wait for each other, and at the same time finish processing to bump the generation. We also can't use a simple dynamic Barrier due to possibility of disjoint groups of processes. A static Barrier is also not easier, because we would need somehow to know exact number of processes, which might change over time. I think a relatively elegant solution is to extend ProcSignalBarrier mechanism to track not only pss_barrierGeneration, as a sign that everything was processed, but also something like pss_barrierReceivedGeneration, indicating that the message was received everywhere but not processed yet. That would be enough to allow processes to wait until the resize message was received everywhere, then use a global Barrier to wait until all processes are finished. It's somehow similar to your proposal to use two signals, but has less implementation overhead. This would also allow different solutions regarding error handling. E.g. we could do an unbounded waiting for all processes we expect to resize, assuming that the user will be able to intervene and fix an issue if there is any. Or we can do a timed waiting, and abort the resize after some timeout of not all processes are ready yet. In the new v4 version of the patch the first option is implemented. On top of that there are following changes: * Shared memory address space is now reserved for future usage, making shared memory segments clash (e.g. due to memory allocation) impossible. There is a new GUC to control how much space to reserve, which is called max_available_memory -- on the assumption that most of the time it would make sense to set its value to the total amount of memory on the machine. I'm open for suggestions regarding the name. * There is one more patch to address hugepages remap. As mentioned in this thread above, Linux kernel has certain limitations when it comes to mremap for segments allocated with huge pages. To work around it's possible to replace mremap with a sequence of unmap and map again, relying on the anon file behind the segment to keep the memory content. I haven't found any downsides of this approach so far, but it makes the anonymous file patch 0007 mandatory.
Вложения
- v4-0001-Allow-to-use-multiple-shared-memory-mappings.patch
- v4-0002-Address-space-reservation-for-shared-memory.patch
- v4-0003-Introduce-multiple-shmem-segments-for-shared-buff.patch
- v4-0004-Introduce-pending-flag-for-GUC-assign-hooks.patch
- v4-0005-Introduce-pss_barrierReceivedGeneration.patch
- v4-0006-Allow-to-resize-shared-memory-without-restart.patch
- v4-0007-Use-anonymous-files-to-back-shared-memory-segment.patch
- v4-0008-Support-resize-for-hugetlb.patch
В списке pgsql-hackers по дате отправления: