Обсуждение: Fix size estimation for parallel B-Tree scans with skip arrays
Hi folks.
This commit introduced parallel scan skip support, however it underestimates the required memory, causing it to write past the allocated shared memory boundary. This can corrupt any entity using the adjacent shared memory segment, leading to unpredictable behavior.
I reproduced the issue manually on stock postgres and raised a patch that fixes it along with regress tests. In my repro, the issue manifested as postgres server crashing unexpectedly.
Root cause:
In src/backend/access/nbtree/nbtree.c, the loop in btestimateparallelscan assumes that every index column might require a skip array and adds sizeof(int) to the estimated size:
However, every skip array actually needs space for its slot in the btps_arrElems array AND space to store its scan key's sk_flags. Therefore, it requires sizeof(int) * 2.
The attached patch fixes this by allocating sizeof(int) * 2 per attribute in btestimateparallelscan.
Please let me know your thoughts.
Thanks,
Siddharth Kothari
Вложения
On Wed, Apr 29, 2026 at 2:54 AM Siddharth Kothari <sidkot@google.com> wrote: > Root cause: > > In src/backend/access/nbtree/nbtree.c, the loop in btestimateparallelscan assumes that every index column might requirea skip array and adds sizeof(int) to the estimated size: > > However, every skip array actually needs space for its slot in the btps_arrElems array AND space to store its scan key'ssk_flags. Your diagnosis looks correct to me. As you said, the problem is that we only add btps_arrElems space overhead for input scan keys -- we neglect to do the same for any skip array scan key that might be output by nbtree preprocessing later on. I'll commit this patch later today. Thanks! -- Peter Geoghegan
Thank you Peter!
On Wed, Apr 29, 2026 at 8:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 29, 2026 at 2:54 AM Siddharth Kothari <sidkot@google.com> wrote:
> Root cause:
>
> In src/backend/access/nbtree/nbtree.c, the loop in btestimateparallelscan assumes that every index column might require a skip array and adds sizeof(int) to the estimated size:
>
> However, every skip array actually needs space for its slot in the btps_arrElems array AND space to store its scan key's sk_flags.
Your diagnosis looks correct to me. As you said, the problem is that
we only add btps_arrElems space overhead for input scan keys -- we
neglect to do the same for any skip array scan key that might be
output by nbtree preprocessing later on.
I'll commit this patch later today.
Thanks!
--
Peter Geoghegan
On 4/29/26 08:54, Siddharth Kothari wrote: > Hi folks. > > This commit <https://github.com/postgres/postgres/ > commit/92fe23d93aa3bbbc40fca669cabc4a4d7975e327#diff- > db0039b5ba12b5915e91ed6eedd78744e3cf7a77082af072d9626a5ae306c579> introduced parallel scan skip support, however it underestimatesthe required memory, causing it to write past the allocated shared memory boundary. This can corrupt any entityusing the adjacent shared memory segment, leading to unpredictable behavior. > > I reproduced the issue manually on stock postgres and raised a patch > that fixes it along with regress tests. In my repro, the issue > manifested as postgres server crashing unexpectedly. > Thanks for the report. I'm able to reproduce the crash using your reproducer script. At first I've been confused why you need a BRIN index when this report is about btree, but I suppose that's just to force a parallel index scan. There are easier ways to do that, though, e.g. by increasing cpu_tuple_cost. Then it's enough to query just the one rel. How did you discover this issue? I don't think anyone else reported such crashes, so presumably it's not quite common. > Root cause: > > In src/backend/access/nbtree/nbtree.c, the loop > in btestimateparallelscan assumes that every index column might require > a skip array and adds sizeof(int) to the estimated size: > > However, every skip array actually needs space for its slot in > the btps_arrElems array AND space to store its scan key's sk_flags. > Therefore, it requires sizeof(int) * 2. > > > The attached patch fixes this by allocating sizeof(int) * 2 per > attribute in btestimateparallelscan. > It does fix it for me, but I don't know enough about the skip scan internals to say if the fix is right. Is there something we could do to deal with this class of bugs (buffer overflow in shared memory)? For buffers in private memory we have tools like valgrind and sentinels to make these issues more obvious, but for shared memory that's not the case ... :-( regards -- Tomas Vondra
On Wed, Apr 29, 2026 at 11:42 AM Tomas Vondra <tomas@vondra.me> wrote: > Thanks for the report. I'm able to reproduce the crash using your > reproducer script. At first I've been confused why you need a BRIN index > when this report is about btree, but I suppose that's just to force a > parallel index scan. There are easier ways to do that, though, e.g. by > increasing cpu_tuple_cost. Then it's enough to query just the one rel. I pushed the fix a short while ago, but didn't include the tests. I don't think that the added test cycles would have paid for themselves. > How did you discover this issue? I don't think anyone else reported such > crashes, so presumably it's not quite common. There were many ways that this issue could accidentally fail to fail. For example, if even one of the skip arrays happened to be on a text column, there'd almost certainly have been no crash. In general we're very conservative about the space we request. We have to be, because the request is made only once, long before we really know what nbtree preprocessing will do/how many arrays it'll output. > It does fix it for me, but I don't know enough about the skip scan > internals to say if the fix is right. _bt_parallel_serialize_arrays assumes that btscan->btps_arrElems[] has so->numArrayKeys[]-many elements -- with and without the fix. The easiest way to see that the fix is correct is by noticing that _bt_parallel_serialize_arrays expects a certain layout in shared memory that btestimateparallelscan wasn't fully handling. When btestimateparallelscan estimated the amount of shared memory that the scan will require, it previously neglected to account for how skip arrays could contribute to the size of so->numArrayKeys[]. With Siddharth's fix in place, we conservatively assume that preprocessing will add the maximum possible number of skip arrays/use the largest possible so->numArrayKeys[]/so->numArrayKeys when we determine btscan->btps_arrElems[] space overhead. Making _bt_parallel_serialize_arrays agree with btestimateparallelscan. > Is there something we could do to deal with this class of bugs (buffer > overflow in shared memory)? For buffers in private memory we have tools > like valgrind and sentinels to make these issues more obvious, but for > shared memory that's not the case ... :-( I'm not sure that Valgrind style instrumentation would have actually caught this issue. As I said, our conservative approach could mask the issue in many ways. Plus the test case involved an index with the maximum 32 index columns, and an input scan key on the very last index column, which is obviously very atypical. -- Peter Geoghegan