Re: Changing shared_buffers without restart
| От | Ashutosh Bapat |
|---|---|
| Тема | Re: Changing shared_buffers without restart |
| Дата | |
| Msg-id | CAExHW5sVxEwQsuzkgjjJQP9-XVe0H2njEVw1HxeYFdT7u7J+eQ@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Changing shared_buffers without restart (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
| Список | pgsql-hackers |
Hi, PFA new patchset with some TODOs from previous email addressed: On Mon, Oct 13, 2025 at 9:28 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > 1. New backends join while the synchronization is going on. Done. Explained the solution below in detail. > An existing backend exiting. Not tested specifically, but should work. > 2. Failure or crash in the backend which is executing pg_resize_buffer_pool() still a TODO > 3. Fix crashes in the tests. core regression passes, pg_buffercache regression tests pass and the tests for buffer resizing pass most of the time. So far I have seen two issues 1. An assertion from AIO worker - which happened only once and I couldn't reproduce again. Need to study interaction of AIO worker with buffer resizing. 2. checkpointer crashes - which is one of the TODOs listed below. 3. Also there's an shared memory id related failure, which I don't understand but happen more frequently than the first one. Need to look into that. > go through Tomas's detailed comments and address those > which still apply. Still a TODO. But since many of those patches are revised heavily, I think many of the comments may have been addressed, some may not apply anymore. > And the patches are still WIP, with many TODOs. But I wanted to get some feedback on the proposed UI and synchronization This is still a request. > Patches 0001 to 0016 are the same as the previous patchset. I haven't > touched them in case someone would like to see an incremental change. > However, it's getting unwieldy at this point, so I will squash > relevant patches together and provide a patchset with fewer patches > next. I have squashed the patches into 3 so that it's easy to review, read and work with those patches. The work is still WIP and there are many TODOs in the patches. Patch 0001: SQL interface to read contents of buffer lookup table. It was there in the previous patchset as 0001 but in this patchset I have moved the SQL function to the pg_buffercache module and renamed it accordingly. I added this change because I found it useful to debug issues I found while testing buffer resizing patches. The issues were related to page->buffer mappings which existed in the buffer look up table but were not present in the buffer descriptor array or buffer blocks. pg_buffercache, which traverses just the buffer descriptor array, isn't enough. Even without the resizing functionality this will help us catch situations where buffer descriptor array and buffer lookup table goes out of sync. I plan to keep it in this patchset as a debugging tool. If other developers feel that it could be useful, I will propose it in a separate thread. Patch 0002: This is a single patch squashing all patches (0005, 0006, 0007, 0008, 0009 and 0010) related to shared memory management and address space reservation together. This patch allows the creation of multiple shared memory segments and also lays them out so as to make those resizable. The actual code to resize the segments is in the next patch. The APIs used for memory management and address space reservation are described later. Prominent changes from the previous patches are: 1. modifies CalculateShmemSize() so that it can work with multiple shared memory segments. 2. It also combines AnonymousMapping and ShmemSegment structures together as suggested by Tomas upthread. The merger is still going on. There are some old comments or variable names referring to memory mapping when they should be mentioning shared memory segments. I will work on that when I start polishing this patch. 4. GUC to specify the maximum size of buffer pool has been renamed and moved to the next patch which deals with actual resizing. 5. Changes to process config reload in AIO workers are removed. Those are not needed after 55b454d0e14084c841a034073abbf1a0ea937a45. Patch 0003: Implements the UI and synchronization described in the previous email [1] with additional improvements to support a new backend joining while resizing is in progress. This patch squashes other patches 0002 - 0004 and 0011 onward patches from the previous patchset, but it also gets rid of a lot of code related to the old synchronization method and the old UI. The code related to resizing including implementation of pg_resize_shared_buffers() is moved to storage/buffer/buf_resize.c, a new file. There is no change to the UI. The buffer resizing still looks like as described in the previous email. > SHOW shared_buffers; -- default > shared_buffers > ---------------- > 128MB > (1 row) > > ALTER SYSTEM SET shared_buffers = '64MB'; > SELECT pg_reload_conf(); > pg_reload_conf > ---------------- > t > (1 row) > > SHOW shared_buffers; > shared_buffers > ----------------------- > 128MB (pending: 64MB) > (1 row) > > SELECT pg_resize_shared_buffers(); > pg_resize_shared_buffers > -------------------------- > t > (1 row) > > SHOW shared_buffers; > shared_buffers > ---------------- > 64MB > (1 row) > > ALTER SYSTEM SET shared_buffers = '256MB'; > SELECT pg_reload_conf(); > pg_reload_conf > ---------------- > t > (1 row) > > SHOW shared_buffers; > shared_buffers > ----------------------- > 64MB (pending: 256MB) > (1 row) > > SELECT pg_resize_shared_buffers(); > pg_resize_shared_buffers > -------------------------- > t > (1 row) > > SHOW shared_buffers; > shared_buffers > ---------------- > 256MB > (1 row) > The implementation uses a similar strategy as described in the previous email with changes described below. A new backend inherits the address space of shared memory segments and the local variable NBuffers through Postmaster. These are changed when resizing the buffer pool. And the same changes need to be applied to the Postmaster so that a new backend inherits them. Since Postmaster is not part of the ProcSignalBarrier mechanism, the coordinator has to send signals to the Postmaster separately. This has the following drawbacks 1. Additional code to signal Postmaster 2. coordinator has to wait for Postmaster to apply the changes separately, thus adding extra delays 3. platforms which use fork() + exec(), will add more complexity to transfer the state to new child 4. If the postmaster is signaled after sending a barrier to other backends, the newly joined backend will miss the state update as well as the barrier. If the postmaster is signaled before sending a barrier to other backends, a newly joining backend will receive the barrier as well as state update from Postmaster. This means the barrier handling code is required to be idempotent. This will make the barrier handling code more complex and also constrained. Instead the approach taken by Thomas Munro in [2] does not require updating the address space. It uses shared memory variables instead of process local memory variables to save the state of the shared buffer pool. This patchset uses a similar approach and 1. avoids involving Postmaster in the resizing process 2. additionally making barrier handling code super thin. Shared Memory and address space management ======================================== An fd is created using memfd_create to manage the size of the shared memory segment using ftruncate and fallocate(). That fd is passed to mmap() which reserves the maximum required address space and maps the anonymous file (and the backing memory) in that address space. mmap uses MAP_NORESERVE so as not to allocate memory against mapping. The size of the anonymous file controls the amount of memory allocated. For the main shared memory segment, the size of the reserved space is the same as the amount of memory required. But for shared buffer pool related segments the size of the reserved space is decided by GUC max_shared_buffers (mentioned in the previous email and quoted below). When resizing shared buffers only the anonymous file is resized and not the address space. I tested this protocol with an attached small program (mfdtruncate.c). Sharing it in case somebody finds it useful. Saving shared buffer pool sizes in the shared memory ========================================= When resizing, we need to track two ranges of buffers 1. active buffers, which is the range of buffers from which the new allocations happen at a given time and 2. valid buffers which is the range of buffers which are valid at a given time. When shrinking, the active buffers is set to the new size while the valid buffers remains same as the old size till all the buffers outside the new size are evicted. When expanding, valid buffers and active buffers are both changed to new size after memory is resized and expanded data structures are initialized. Current global variable NBuffers is insufficient to track these two numbers. Instead we have a new member StrategyControl::activeNBuffers which tracks the active buffer range. The shared memory structure controlling the resizing operation (ShmemCtrl) has a member currentNBuffers which gives the range of valid number of shared buffers at a given point in time. (I am planning to merge ShmemCtrl and StrategyControl, so that we have all the metadata about shared buffers in one place in the shared memory). These two numbers are saved in the shared memory for the reasons explained below and replace current NBuffers. They are modified by the coordinator as the resizing progresses. Some usages of NBuffers are replaced by one of the two variables as appropriate but more work is required. Next I will be working on 1. Background writer synchronization 2. Checkpoint synchronization 3. Make all the shared buffer pool structures, except buffer blocks, static and maximally allocated as suggested by Andres earlier. [3] 4. Replace NBuffers usages as explained above 3. merge ShmemCtrl and StrategyControl as explained above 4. Handle failures in resizing 5. There have been concerns raised earlier that anonymous file backed memory is not dumped with core. I am thinking of not using an anonymous file for the main memory segment so that it gets dumped with core. But shared buffers still will be dumped. However, I am skeptical as to whether we need GBs (say) of shared buffers being dumped along with core or should we leave that choice to users. [1] https://www.postgresql.org/message-id/CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BhUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g%40mail.gmail.com [3] https://www.postgresql.org/message-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u%40yaogvphfwc4h -- Best Wishes, Ashutosh Bapat
Вложения
В списке pgsql-hackers по дате отправления: